On the Relevancy team at zulily, we are often required to load a large number of large strings into memory. This often causes memory issues. After looking at multiple ways to reduce memory pressure, we settled on Radix Trees to store these strings. Radix Trees provide very fast prefix searching and are great for auto-complete services and similar uses. This post focuses entirely on memory consumption.
What Is A Radix Tree?
Radix Trees take sequences of data and organize them in a tree structure. Strings with common prefixes end up sharing nodes toward the top of this structure, which is how memory savings is realized. Consider the following example, where we store “antidisestablishmentarian” and “antidisestablishmentarianism” in a Radix Tree:
+- antidisestablishmentarian (node 1) +- ism (node 2)
Two strings, totaling 53 characters, can be stored as two nodes in a tree. The first node stores the common prefix (25 characters) between it and its children. The second stores the rest (3 characters). In terms of character data stored, the Radix Tree stores the same information in approximately 53% of the space (not counting the additional overhead introduced by the tree structure itself).
If you add the string “antibacterial” to the tree, you need to break apart node 1 and shuffle things around. You end with:
+- anti (node 3) |- disestablishmentarian (node 4) | +- ism (node 2) +- bacterial (node 5)
We run a lot of software in the JVM, where memory performance can be tricky to measure. In order to validate our Radix Tree implementation and measure the impact, I pumped a bunch of pseudo-realistic data into various collections and captured memory snapshots with YourKit Java Profiler.
It didn’t take long to hack together some real-looking data in Ruby with Faker. I created four input files of approximately 1,000,000 strings that included a random selection of 12-digit numbers, bitcoin addresses, email addresses and ISBNs.
sreed:src/ $ head zulily-oss/radix-tree/12-digit-numbers.txt 141273396879 414492487489 353513537462 511391464467 633249176834 347155664352 632411507158 752672544343 483117282483 211673267195 sreed:src/ $ head zulily-oss/radix-tree/bitcoins.txt 1Mp85mezCtBXZDVHGSTn3NYZuriwRMmW6D 1N8ziuitNLmSnaXy2psYpLcXvugHw1Yc5s 18DnruBzLHmnVHQhDghoa6eDt6sDkfuWKr 1A3sRfAnP89HE4RgNQARa3kCq4xFEF9eev 12WR4DrsR4mM8gDHZCuqXe2h37VUSUPSNu 1PRmYuevwZXZamBEgANzLXe2SjFneGDsXp 1EpjPwt8Ap47XA6HwJhCTxUZRDH11GKWuQ 1P8MAgobhLw4FYcFHbw7a8t2FvQZg8K597 15xhiiLdkin8zi6S5KL9DkDDQyvLb1pjjT 1NPEZeEjgGu5TYdz5d3kxjVfLwxAZ2fK6f sreed:src/ $ head zulily-oss/radix-tree/emails.txt firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com sreed:src/ $ head zulily-oss/radix-tree/isbns.txt 216962073-7 640524955-7 955360834-5 429656067-0 605437693-4 204030847-4 037410069-1 239193083-6 182539755-4 034988227-4
Measuring Memory with YourKit
YourKit provides a measurement of “retained size” in its memory snapshots which is helpful when trying to understand how your code is impacting the heap. What isn’t necessarily intuitive about it, though, is what objects it excludes from this “retained size” measurement. Their documentation is very helpful here: only object references that are exclusively held by the object you’re measuring will be included. Instead of telling you “this is how much memory usage your object imposes on the VM,” retained size instead tells you “this is how much memory the VM would be able to garbage-collect if it were gone.” This is a subtle, but very real, difference if you wish to optimize memory consumption.
Thus, my memory testing needed to ensure that each collection held complete copies of the objects I wished to measure. In this case, each string key needed to be duplicated (I decided to intern and share every value I stored in order to measure only the memory gains from different key storage techniques).
// Results in shared reference, and inaccurate measurement map1.put(key, value); map2.put(key, value); // Results in shared char reference, and better but // still inaccurate measurement map1.put(new String(key), value); map2.put(new String(key), value); // Results in complete copy of keys, and accurate measurement map1.put(new String(key.toCharArray()), value); map2.put(new String(key.toCharArray()), value);
I tested our own Radix Tree implementation, ConcurrentRadixTree from https://code.google.com/p/concurrent-trees/, a string array, Guava’s ImmutableMap and Java’s HashMap, TreeMap, Hashtable and LinkedHashMap. Each collection stored the same values for each key.
Both zulily’s Radix Tree and the ConcurrentRadixTree from concurrent-trees were configured to store string data as UTF-8-encoded byte arrays.
ConcurrentRadixTree was included simply to ensure that our own version (to be open-sourced soon) was worth the effort. The others were measured simply to highlight the benefits of Radix Tree storage for different input types. Each collection has its own merits and in most ways they are all superior to the Radix Tree for storage (put/get performance, concurrency and other features).
First of all, Guava’s ImmutableMap is pretty good. It stored the same key and value data as java.util.HashMap in 92-95% of the space. The Radix Tree breaks keys into byte array sequences and stores them in a tree structure based on common prefixes. This resulted in a best case of 62% the size of the ImmutableMap for bitcoin addresses (strings which have many common prefixes) and a worst case 88% for random 12-digit numbers. We see that the memory used by this data structure is largely dependent on the type of data put into it. Large strings with many large common prefixes are stored very efficiently in a narrow tree structure. Unique strings create a lot of branches in the underlying tree, making it very wide and adding a lot of overhead.
Converting Java Strings to byte arrays accounts for most of the memory savings, but not all. Byte array storage was anywhere from 90% (bitcoin addresses) to 99% (ISBNs) in the tests I ran.
For us, storing byte-encoded representations of string data in a radix tree allowed us to reclaim valuable memory in our services. However it wasn’t until validating the implementation in an accurate manner with realistic data and trustworthy tools that we rested easy knowing we had set out what we wished to accomplish.