One of the primary attractions of Java is the incredible richness and robustness of its standard library, the Java Development Kit (JDK). “Basically bug-free” is a prerequisite for any standard library, but the JDK goes above-and-beyond when it comes to how well each of these subsystems is implemented for the general-purpose use case.
A corollary of this is that you should almost never implement your own basic types. Designing and implementing production-grade, high-performance algorithms and data-structures requires its own skillset, and even if you have it, chances are you’ll encounter some distant edge case in the far-off future. Suffice it to say that I’m generally not a fan of “implement your own priority queue” style whiteboard interviews.
But over the years at Sunshower, building infrastructure that’s, well, somewhat outside of the mainstream, I have encountered particular exceptions to this rule and open-sourced them under Sunshower-Arcus. One of the more niche exceptions is a data-structure supporting fast, persistent modifications.
What do we need Ropes for when we have Strings?
My current project is a new, structured infrastructure and deployment language called “Breeze”. As part of that, we need a good IDE supporting code-completion, symbol navigation, etc. I’ll go into Breeze into more detail at some point, but writing parsers for IDEs is somewhat different from writing parsers for other purposes. IDE-quality parsers must be fault-tolerant, which means that when they encounter an error, they must not just “give up”–they should provide helpful hints like “I see this is an assignment–we have these candidate symbols in the current scope” rather than “parse error: expected SYMBOL at LINE 22, COLUMN 15” or something to that effect.
Another aspect of IDE parsers is that the user spends a lot of time modifying sections of a document that are in pathological locations for contiguous, array-based strings. For example, when I add a method to a class in my IDE, each keystroke performs an insertion or deletion somewhere within the document string, usually towards the middle of it. This entails the following for contiguous-memory strings:
- Copy the characters before the operation’s start index
- Copy the characters after the operation’s end index
- Place the operation
- Concatenate (1, 3, 2)
For a typical-sized source file (several kB at a minimum), this is excruciatingly slow and generate an unacceptable amount of heap-pressure (creating short-lived objects that must be reclaimed), and so many IDEs use a data-structure called a “gap-buffer” which “freezes” the regions outside of the current edit-operations into contiguous-memory strings and “thaws” the regions within the current edit-operations into a friendlier data-structure like a linked list.
There is an even better data-structure for this sort of workload, one that is actually a reasonable substitute for effectively all string operations: ropes.
A rope is a balanced binary tree (not a binary-search tree) whose leaves are comprised of small contiguous-memory strings. For instance, a Rope constructed from the string “Hello my name is Josiah” may look like this:
rope(8,23)[null] ├╴rope(5,8)[null] │ ├╴rope(5,5)[Hello] │ └╴rope(3,3)[ my] └╴rope(7,15)[null] ├╴rope(3,7)[null] │ ├╴rope(3,3)[ na] │ └╴rope(4,4)[me i] └╴rope(1,8)[null] ├╴rope(1,1)[s] └╴rope(7,7)[ Josiah]
The “internal” nodes are merely pointers to either other internal nodes, or pointers to contiguous subsequences of characters. The balancing criteria for such a tree is:
1. Each internal node must have two children
2. Each leaf node must be “flat”–contain a contiguous array of characters and no children.
This is a less restrictive balancing than say, AVL balancing (although it is similar), but it does guarantee us a maximum tree height of log2(n), where
n is the number of characters in the tree. Each node is identified by its “weight”, or the number of characters contained within its left subtree. For leaf nodes, the weight is the length
of its character array.
Persistence and Concurrency
The JDK designers designed
java.lang.String to be immutable for quite a few reasons that span security, intrinsification, caching (interning) and concurrency-safety. There are “mutable” strings in Java such as the
java.lang.StringBuilder type which can be more ergonomic and faster for some operations, but ultimately even StringBuilder relies on a contiguous array of memory. Amortized constant-time append operations to StringBuilder require doubling the size of its backing character array each time the array is resized, and StringBuilder is not thread-safe or concurrent (although StringBuffer is).
In the functional programming paradigm, many core data-structures are “tree-like”, and support a feature known as “persistence” to solve this problem. When a persistent data-structure is modified, it retains its structure and returns a “copy” containing the modifications. Many data-structures can share a substantial amount of substructure, reducing the cost of a copy operation. This is the approach to concurrency that our ropes take: modifying a rope will return another rope sharing as much of the original’s substructure as possible (such as leaf nodes). This is a trade-off between performance and concurrency-safety. As we’ll see, this is not as expensive as it might seem.
The benchmarks are run via JMH and are available at https://github.com/sunshower-io/sunshower-arcus/tree/master/arcus-lang/src/benchmarks/java/io/sunshower/arcus/lang/benchmarks.
CPU: Intel Xeon E3-1535M@3.10 GHz OS: Debian (5.10.0-13-amd64 #1 SMP Debian 5.10.106-1) Virtual Machine with 8 CPUs (2 sockets, 4 cores each), 32 GB RAM. VMWare Workstation 16.2.2
OpenJDK Corretto 188.8.131.52.1 ZGC max heap-size: 8GB default heap-size: 4GB
This could certainly be tuned to support the operations that Ropes rely on, but benchmarking against defaults seemed to be the most reasonable approach to me.
Each benchmark was conducted on Strings and Ropes constructed from the same byte array ranging in size from 1b to 10mB by factors of 10. Obviously the axes are logarithmic.
String Construction vs. Rope Construction
This benchmark simply tests the amount of time required to construct a String vs. a Rope from the same character array:
Note that the construction time of ropes vs. strings is quite similar up until ~100 bytes. This is because ropes under 199 characters (our default split-size) degenerate to strings. As an aside, the split-size should be prime, and I selected 199 after some microbenchmarks indicated that it provided a good balance between minimizing internal nodes (tree-height) and providing good modification performance.
The ratio of throughput of rope construction vs. string construction is quite consistent once the fallback behavior is no longer relevant:
In any case, constructing a string was never a full order of magnitude faster than constructing a rope for any operand sizes in our benchmarks. This is quite remarkable given the amount of internal structure that ropes possess.
The next benchmark I performed is inserting a 100-byte string into the center of Ropes and Strings ranging in length from 1 byte to 10 mB. Object construction is not counted towards this benchmark’s throughput:
This is an operation where Ropes really shine. On the logarithmic scale presented, Ropes decrease in throughput relative to operand size according to an (approximately) constant factor, and overtake Strings in throughput at around the 1 kB mark. Recall that a constant line on a log-scale chart is a logarithmic factor, which we do observe. Furthermore, Strings degrade in performance linearly (linear downward trend) for this operation, until ropes are 81,000x faster for 100Mb operand sizes at which point java.lang.String on my configuration is only capable of 21 operations/second.
The Sunshower rope structure is intended to be a drop-in replacement for java.lang.String. Although it’s still relatively new, it does what you’d expect it to do such as:
1. Works with JDK regular expressions and patterns
2. Is a good candidate for hashtable keys
3. Implements Comparable correctly
In addition, certain operations that Ropes are relatively slow at such as String comparisons, charAt, splitting over non-regex substrings are heavily optimized as we provide high-quality string search algorithms such as tuned Boyer-Moore-Horspool implementations which we’ve compared with competitors such as Knuth-Morris-Pratt, Rabin-Karp, and Raita (which was surprisingly poorly performing).
This rope implementation also provides a block-iteration structure granting efficient in-order access to the leaf nodes of the data-structure.
Thanks for reading! I’ll try to provide additional benchmarks as they’re requested. I’ve only documented several of the benchmarks that I’ve currently run. The benchmark data for the described benchmarks can be found at https://github.com/sunshower-io/sunshower-arcus/blob/master/arcus-lang/src/benchmarks/resources/benchmarks-4-11-22.txt. The benchmarks themselves can be found at: https://github.com/sunshower-io/sunshower-arcus/tree/master/arcus-lang/src/benchmarks/java/io/sunshower/arcus/lang/benchmarks.
I hope this post has demonstrated that a quality rope implementation can be a viable contiguous-memory string replacement for most or all operations! This library is fully open-source and available in Maven Central.
I’d like to thank Reddit user mauganra_it for their help with some implementation details such as recommending the use of java.lang.Strings as the primary leaf-node data-structure as opposed to character arrays. This had some measurable impacts on performance and memory utilization that I hope to discuss soon.