Ropes: Twining Together Strings for Editors

It's been a while since I've written about any data structures. But it just so happens that I'm actually really working on implementing a really interesting and broadly useful data structure now, called a Rope.

A bit of background, to lead in. I've got this love-hate relationship with
some of the development tools that Rob Pike has built. (Rob is one of the Unix
guys from Bell labs, and was one of the principal people involved in both the
Plan9 and Inferno operating systems.) Rob has implemented some amazing
development tools. The two that fascinate me were called Sam and Acme. The
best and worst features of both are a sort of extreme elegant minimalism.
There's no bloat in Rob's tools, no eye-candy, no redundancy. They're built to
do a job, and do it well - but not to do any more than their intended job.
(This can be contrasted against Emacs, which is a text editor that's grown
into a virtual operating system.) The positive side of this is that they're
incredibly effect, and they demonstrate just how simple a programmers text
editor should be. I've never used another tool that is more effective than
Acme or Sam. In all seriousness, I can do more of my work more easily in Sam
than I can in Emacs (which is my everyday editor). But on the other hand, that
extreme minimalist aesthetic has the effect of strictly eliminating any
overlaps: there's one way to do things, and if you don't like it, tough. In
the case of Acme and Sam, that meant that you used the mouse for damn-near
everything. You couldn't even use the up and down arrows to move the cursor!

This is a non-starter for me. As I've mentioned once or twice, I've got terrible
RSI in my wrists. I can't use the mouse that much. I like to keep my hands on my
keyboard. I don't mind using the mouse when it's appropriate, but moving my hand from
the keyboard to the mouse every time I want to move the cursor?. No. No
damned way. Just writing this much of this post, I would have had to go back and
forth between the keyboard and mouse over 50 times. (I was counting, but gave up when
I it 50.) A full day of that, and I'd be in serious pain.

I recently got reminded of Acme, because my new project at work involves using a
programming language developed by Rob Pike. And Acme would really be incredibly
useful for my new project. But I can't use it. So I decided to bite the bullet, and
use my free time to put together an Acme-like tool. (Most of the pieces that you need
for a prototype of a tool like that are available as open-source components, so it's
just a matter of assembling them. Still a very non-trivial task, but a
possible one.)

Which finally, leads us back to today's data structure. The fundamental piece of
a text editor is the data structure that you use to represent the text that you're
editing. For simplicity, I'm going to use Emacs terminology, and refer to the
editable contents of a file as a Buffer.

How do you represent a buffer?

As usual with data structures, you start by asking: What do I need it to do? What performance characteristics are important?

In the case of a text buffer, you can get by with a fairly small set
of basic operations:

  • Fast concatenation: concatenating blocks of text needs to be
    really fast.
  • Fast insert: given a point in a block
    of text, you need to be able to quickly insert text
    at that point.
  • Fast delete: given two points in a block of
    text, you need to be able to quickly remove the text between those points.
  • Reasonably fast traversal: Lots of algorithms,
    ranging from printing out the text to searching it are based
    on linear traversals of the contents. This doesn't have to be
    incredibly fast - it is an intrinsically linear process, and it's
    usually done in the context of something with a non-trivial cost
    (I/O, regular-expression scanning). But you can't afford to make it
    expensive.
  • Size: you need to be able to store effectively unlimited
    amounts of text, without significant performance degradation in the
    operations described above.

Considering those requirements, you can eliminate a number of obvious
alternatives:

  • Simple character array: Inserts and deletes
    are very expensive for large documents.
  • Lists of lines: if you use a linked list, the memory cost
    for pointers becomes very expensive. If you use an array of
    lines, copy-insert costs of new lines becomes very expensive
    for large documents.
  • Gap buffers: a neat variation on character arrays; they have
    the same problem on large documents.

Hybrids of the above can work nicely. Linked lists of character arrays
can work very nicely. In fact, the structure that I'm going to describe
is really a variation on the list of character arrays.

What works extremely well is to take the basic idea of a list of
mini-buffers (that is, smallish character arrays), and tweaking it,
so that instead of keeping a list of sub-buffers, you put
the sub-buffers into a binary tree. The result is a structure called
a rope. To the best of my knowledge, ropes were invented by
Hans Boehm (of garbage collection fame), and described in
this paper
. Ropes are
a simple, elegant data structure that's excellent for representing large string-like
data structures. (The name is a pun: a real-world rope is made by twining together lots of smaller strings. A data-structure rope tangles together data-structure strings.)

I'm not going to present the full-blown version of ropes in this post; instead, I'm going to work up to that. In this post, I'm going to start by showing you a simplified version that illustrates the basic ideas. In later posts,
I'll work up to full-blown ropes, including things like sub-rope sharing, copy-avoidance via copy-on-write, and so on.

i-4a194ca1aea44289ce159e5ef83447d7-balanced-rope.png

The basic idea of a rope is amazingly simple. A rope is binary tree
with strings in the leaves. An internal node in a rope tree doesn't contain any text - just pointers to its left and right children. The text value of a rope is the concatenation of its leaves, in left to right order. (This basic structure is also called a concatenation tree.) An example of a rope is shown to the side.

So - if you've got two huge strings of text that you want to concatenate, you just create a new root node, and put the two strings in as children. Presto! You've got a new rope containing the concatenation! For example, below are two rope-trees - one for the text "abcdefghijkl", and one for the text "zyxwv". Beneath them is the result of concatenating the two ropes.

i-4be4d99016e2ea69148df16e5e745a17-concat.png

Concatenation is therefore constant time, regardless of the length of the strings
being concatented. - you can't ask for better than that! Traversing the text is also
really easy - you just do an in-order traversal of the rope-tree. In the worst case -
a degenerate tree with one character per node - that gets very expensive. But it's
easy to avoid the degenerate case, as we'll see in a later post. In the typical case,
where each leaf node has a reasonable amount of text, and the tree is reasonably
balanced, the additional cost of traversing the tree is negligible in comparison to
the cost of traversing the characters in the nodes. (Each pointer in the tree needs
to be traversed once, which makes the total cost O(n) in the length of the text. But
since there's typically a lot of text per leaf, that ends up meaning that the
increase in traversal cost is minimal.)

For string-mutation operations, we can define almost anything we want in terms of
two primitives: concatenate, and subrope. For example, we can define how to
do insert and deletion of chunks of text from a rope as follows:

  • Remove text: to remove a chunk of text that runs from point P1 to point
    P2 in a rope, we take the subropes (0, P1) and (P2+1,end), and then
    we concatenate them. The result is the rope with the chunk removed.
  • Insert text: to insert a chunk of text C into a rope at a particular
    point p, we take the pre-subrope from 0 to P, and the post-subrope
    from P+1 to the end. Then we concatenate pre-subrope, C, and
    post-subrope.

So to provide those operations, we just need to figure out how to do substring in
an efficient way.

Substring isn't trivial in ropes. But it's not awful either. The main trick is
that there are a lot of cases to consider to get an implementation correct.

The basic idea is fairly simple. You traverse the tree, keeping track
of your position in the traversal. When you get to the substring you
want, you start copying - usually just copying the pointers to nodes. You
keep copying until you get to the end of the desired substring. The catch,
as I said above, is keeping track of all of the cases.

An outline of substring is:

  • Inputs:
    • A rope, R.
    • The start position of the desired sub-rope, Start.
    • The length of the desired sub-rope, L.
  • Algorithm:
    1. Let currentPosition = 0
    2. Let remainingLengthToCopy = L
    3. Let result = EmptyRope
    4. For each node N in the tree:
    5. If N is an internal node: traverse the left subtree, and then the right
      subtree.
    6. If N is a leaf, then:
      1. The start position of this node is currentPosition;
        the end position of this node is currentPosition plus the
        length of the string in this node.
      2. If the start position of the desired subrope is
        beyond the end of this leaf, then just increment
        currentPosition by the size of the string in this leaf.
      3. If the start position of the desired subrope is inside
        of this leaf, then start copying - produce a new
        rope node containing the substring (There are sub-cases
        here for "subrope also ends in this node", and "subrope"
        doesn't end in this node.) Increment currentPosition by
        the size of the string in this node, and decrement
        remainingLengthToCopy by the size of the copied substring.
      4. If the start position of the desired subrope is before this
        node, and the end position of the desired subrope is after
        this node, then insert this node into the result rope,
        increment the position by the size of the string in this node,
        and decrement remainingLengthToCopy by the length of the string
        in this node.
      5. If the start position of the subrope is before this node,
        and the end position is in the node, then copy the appropriate
        substring to a new leaf, and concatenate with result. Then
        return result.
      6. If the start position of this node is after the end of the desired
        sub-rope, return result.

An example implementation is below. It's a quick one that I whipped together in
Haskell. (I actually wrote versions in Haskell, OCaml, Python, and Java, but in
the end, I think that the Haskell is the easiest to follow.)

module Ropes where

-- A rope is either an internal node with two children and no text,
-- or a leaf node with text and no children.
data Rope = InternalRopeNode Rope Rope
         | LeafRopeNode String
   deriving (Show)

concatRope :: Rope -> Rope -> Rope
concatRope (LeafRopeNode "") r = r
concatRope r (LeafRopeNode "") = r
concatRope r1 r2 =
InternalRopeNode r1 r2

-- Subrope implemented using a fairly standard translation
-- of iteration to recursion. The state of each iteration is
-- represented by a triple consisting of the current position in
-- the rope, the number of characters still to be copied, and
-- the current subrope.
subRopeIter :: Rope -> Int -> Int -> (Int, Int, Rope) -> (Int, Int, Rope)

-- For an internal node, call subRope of the left child on the input
-- state, and then take the state resulting from the left-child invocation,
-- and use it as the input to the right child. There is a nice Haskell
-- syntax for this, but I think the let is clearer for non-Haskell
-- people.
subRopeIter (InternalRopeNode left right) start len (pos, remaining, result) =
    let (leftPos, leftRemaining, leftResult) = 
       subRopeIter left start len (pos, remaining, result)
    in subRopeIter right start len (leftPos, leftRemaining, leftResult)

-- The real work happens on leaf nodes. We define it by cases:
subRopeIter rope@(LeafRopeNode s) start len state@(pos, remaining, result) 
  -- (1) node_start + len(s) < start: node content comes before the start point.
  --        Just increment the position past the node and continue.
  | remaining == 0  = state

  -- (2) node_start < start and node_start + len(s) > start:
  --        Copy substring, decrement remaining, increment position.
  --       (2a) node_start + len(s) > pos + remaining
  --       (2b) node_start + len(s) < pos + remaining
  | pos + length(s) < start = (pos + (length s), remaining, result)
  | pos < start && pos + (length s) > start =
     let startPosInNode = (start - pos)
         substr = (drop startPosInNode s)
     in if (length substr) < remaining
          then (pos + (length s),
                remaining - (length substr),
                (concatRope result LeafRopeNode substr)))
          else (pos + (length s),
                0,
                (concatRope result (LeafRopeNode (take remaining substr))))

  
  -- (3) node_start > start and cur_remaining > 0 and cur_remaining > len(s): 
  --        Node lies entirely within the copy range. Copy entire node into
  --        result, decrement remaining, increment position.
  | pos > start && length(s) <= remaining =
       (pos + (length s), remaining - length(s), (concatRope result rope))

  -- (4) node_start > start + len (aka remaining == 0):
  --        Increment pos.    
  | pos > start && length(s) > remaining =
       (pos + length(s), 0, (concatRope result (LeafRopeNode (take remaining s))))
  | otherwise = state

subRope :: Rope -> Int -> Int -> Rope
subRope r start len =
   let (_, _, result) = subRopeIter r start len (0, len, LeafRopeNode(""))
   in result

For example, we can define a non-trivial rope as follows:

let r = InternalRopeNode 
         (InternalRopeNode 
           (LeafRopeNode "abc") 
           (InternalRopeNode 
             (LeafRopeNode "def") 
             (LeafRopeNode "ghi"))) 
         (InternalRopeNode 
            (LeafRopeNode "jkl")
            (LeafRopeNode "mno"))

That's a fairly complicated rope-tree of "abcdefghijklmno". In a real rope application, a string that short would really just be one node. But for illustration,
it'll do. To get the sub-rope of length 7 starting from position 5, we'd run:
subRope r 5 7. When I do that in GHCI, I get the following:


InternalRopeNode (InternalRopeNode (LeafRopeNode "f") (LeafRopeNode "ghi")) (LeafRopeNode "jkl")

Or "fghijkl" - exactly right. And to do the substring operation, we needed
to traverse the rope tree, copy two pointers (to the "ghi" and "jkl" nodes),
create one new leaf node (for "f"), and create two new internal nodes for
the concatenations.

More like this

In the Haskell stuff, I was planning on moving on to some monad-related stuff. But I had a reader write in, and ask me to write another post on data structures, focusing on a structured called azipper. A zipper is a remarkably clever idea. It's not really a single data structure, but rather a…
Last post, I described the basic idea of the binary heap data structure. But I didn't talk about how to implement it - and there's a very cool way of implementing it - optimally space efficient, very fast, and with respectable cache locality for moderate sized datasets. The idea is to use a…
We're only sorta bilaterally symmetric: superficially, our left and right halves are very similar, but dig down a little deeper, and all kinds of interesting differences appear. Our hearts are larger on the left than the right, our appendix is on the right side, even our brains have significant…
Naming Some Special Graphs When we talk about graph theory - particularly when we get to some of the interesting theorems - we end up referencing certain common graphs or type of graphs by name. In my last post, I had to work in the definition of snark, and struggle around to avoid mentioning…

Just use plain old arrays :)

Seriously, I had to write a specialized text editor few years ago. At first, I've tried to use rope-based strings.

However, when I later replaced ropes with plain old std::string - I got absolutely no visible slowdowns.

Of course, if you're going to edit 1Gb texts it's a completely different story.

By Alex Besogonov (not verified) on 26 Jan 2009 #permalink

I know this is off topic but I am a masters candidate in mathematics, research area is graph theory, and few fellow math students and I have just released a new podcast called Combinations and Permutations that I think that you and your audience may enjoy. Check it out in iTunes or Combinations and Permutations

I remember first encountering ropes when working on the ICFP 2007 contest (http://save-endo.cs.uu.nl/). The problem required fast manipulation of large strings, so it made a tremendous difference to get your data structures right.

I've found that it's convenient for some operations to cache the length of the string in each node. Substring operations, for example, may not need to traverse the whole tree.

Oh, and the other thing, of course, is that in practice you need to do some balancing on the spine of the rope to avoid degenerate behaviour.

By Pseudonym (not verified) on 27 Jan 2009 #permalink

Re #5:

The optimal length of nodes in a rope is going to be somewhat dependent on the specifics of the application. In general, you want to try to balance the benefits of having discontinuous storage with the cost of having to traverse pointers. In some very artificial experiments on a 64-bit linux machine, it looks like it's in the 256 character range for edit operations including frequent inserts. I expected it to be a bit higher than that - that's really a lot of fragmentation.

Re #6:

Yeah, cacheing node length is an obvious optimization. It can shift the balance of the ideal node length, because it adds to the size of the node, but that's not a big deal.

On the other hand, looking around at various rope implementations on the net, some people have tried to cache position. That's a losing proposition - the cost of maintaining that as you move things around is not trivial, and it means that you can't do node sharing.

Ah, memories. We had some of these same concepts in Dystal, a language I implemented for a guy in the sociology department named Jim Sakoda. It was embedded in FORTRAN II. I was never clear on why this kind of processing was particularly important to sociologists.

I was once inspired by Pike's presentation of Plan 9 at an OS conference (Asilomar?) to go home and redesign the Secure Kernel I was implementing for MITRE/ARPA to follow its principles. Fortunately the project was canceled, as its philosophic contradictions might have caused CPUs to melt.

By Bob Munck (not verified) on 01 Feb 2009 #permalink

Mark:

Is there any chance you'd post the OCaml, Python and Java versions? It would be OK if they came with a disclaimer "Not fully tested" or something, but I think it'd make an interesting case study for comparison. I don't doubt that the Haskell version is the most readable -- this is the kind of task where Haskell shines.

-- Michael Chermside