Topic: Balanced Binary Trees (optional x-hour) Date: Oct. 22, 2009 Number: 14x We have seen binary trees, and in particular binary search trees. We have seen that insertion, search, delete, finding the max and min in a tree can be done in time proportional to the height of the tree. However, as we also saw, a binary tree can have height O(n) in the worst case. This occurs, for example, when all items are inserted in increasing order (there are many other such configurations, too). So, how do we insure that this doesn't happen? First idea: - Each time you insert or delete, "fix up" the tree so that it is always as close to balanced as possible. - Problem: This isn't as easy as it looks. Consider this tree: 50 / \ 30 70 / \ / 20 40 60 This is balanced just fine. Now, insert 10. A perfectly balanced tree with these keys would look like this: 40 / \ 20 60 / \ / \ 10 30 50 70 Note that not a single parent-child relationship is preserved between the original tree an the new, balanced one. So, it's possible that keeping maintaining balance could take O(n) time, on each update! That's no good, since the whole point was to make the dictionary operations run fast. Potential Solutions. We have a binary search tree, we could: 1. Give up "binary" -- allow nodes to have varying numbers of children. e.g., 2-3-4 trees, B-trees 2. Give up "perfect" -- keep it "close" to perfectly balanced. e.g. AVL trees, Red-Black trees - AVL trees keep the difference in height between the left and right subtrees of any node to be 0 or 1. We will first look at 2-3-4 trees, and then relate them to a structure called red-black trees, which are easier to implement. --- Idea of 2-3-4 tree: Intuition: - We'll allow multiple keys to be stored at each node. - A node will have one more child than it has keys: leftmost child -- all keys less than first key next child -- all keys between first and second keys ... etc ... last child -- all keys greater than the last key a a:b a:b:c / \ / | \ / / \ \ Rules for 2-3-4 Trees: 1. Every node has either 2, 3, or 4 children (1, 2, or 3 keys). 2. All the leaves of the tree (null pointers) are on the same level. To insert, you first go as far down the tree as you can. So we build a tree by inserting 41, 38, 31: 41 -> 38:41 -> 31:38:41 -> / \ / | \ / / \ \ In general, go down to a node with null children, and if a 2 or 3 node, expand to a 3 or 4 node, adding the new key in the right location. If the node that you get down to is a 4-node, SPLIT it into two 2-nodes, promote the middle key to join the parent node (creating a new root if the current root is split), and then continue following down into the correct 2-node. Note that promoting the middle key to the parent may add a fourth key to the parent, which then has to split, etc. Now let's insert 12: - No room in the 4 node, so split: 38 38 / \ / \ 31 41 now insert 12 --> 12:31 41 / \ / \ / | \ / \ Now let's insert 19: No problem, goes into the 3 node 38 / \ 12:19:31 41 / / \ \ / \ But now if we insert 8: - No room in the 4 node, so split, promoting 19 19:38 19:38 / | \ / | \ 12 31 41 now insert 8 --> 8:12 31 41 / \ / \ / \ / | \ / \ / \ Now insert 15 (unproblematic), and 17 (one split): 19:38 19:38 / | \ / | \ 8:12 31 41 -- 15 --> 8:12:15 31 41 -- 17 --> / | \ / \ / \ / / \ \ / \ / \ 12:19:38 / | | \ 8 15:17 31 41 /\ /|\ /\ /\ Now insert 33 and 35 (unproblematic): 12:19:38 12:19:38 / | | \ / / \ \ 8 15:17 31 41 8 15:17 31:33 41 /\ /|\ /\ /\ /\ /|\ /|\ /\ 12:19:38 / / \ \ 8 15:17 31:33:35 41 /\ / | \ / / \ \ /\ Now insert 20 (drives a double split): - We want to split 31:33:35, so we can insert 20, but we cannot, because 12:19:38 is already a 4-node (full). So first, we must split that one: 19 /------+------\ 12 38 << after first split (root) / \ /-----|-----\ 8 15:17 31:33:35 41 /\ / | \ / / \ \ / \ 19 /------+------\ 12 33:38 << after second split / \ / | \ 8 15:17 31 35 41 /\ / | \ /\ /\ /\ 19 /------+------\ 12 33:38 / \ / | \ 8 15:17 20:31 35 41 << 20 finally inserted /\ / | \ /|\ /\ /\ --- Problems with 2-3-4 nodes: This is tricky to implement. You could: - Have three different types of nodes, and keep creating new ones as you need to (complex to code, lots of allocations) - Have each node big enough to hold a 4-node (wastes a lot of space) The big advantage binary nodes have is that they're simple and small; you always need at least TWO children anyway, so you can keep track of what's on either side of a particular key. Around 1984, Guibas and Sedgewick came up with the following clever idea: - Translate each 2, 3, or 4-node into a miniature binary tree: - We'll "color" each vertex, so that we can tell which nodes belong together as part of a larger 2-3-4 tree node. 2-nodes: a -> a / \ / \ 3-nodes: a:b -> a OR b (Here a double // or \\ means / | \ / \\ // \ that the child is RED) b a / \ / \ 4-nodes: a:b:c -> b / / \ \ // \\ a c / \ / \ So, basically we have encoded each 2, 3, or 4 node using only binary nodes, but we've painted each node with a "color", either red or black. We paint a node red if it's part of a "bigger" node. This is called a RED-BLACK TREE. See tree above after inserting 33. Would look like: 12:19:38 / / \ \ 8 15:17 31:33 41 << 2-3-4 tree version /\ /|\ /|\ /\ 19 // \\ 12 38 / \ / \ 8 17 31 41 /\ // \ / \\ /\ 15 33 /\ /\ So, now, one simple way to think about Red-Black trees is to just apply the 2-3-4 tree rules, bearing in mind this encoding. --- Red-Black Properties But, you can also think of a Red-Black tree as a structure separate from a 2-3-4 tree. The rules are: 1. Every node is either RED or BLACK; by convention, the empty tree is a leaf, and is always considered BLACK. 2. The root is black. If an action colors it red, change it back to black. 3. For any node which is RED, both of its children are BLACK (this means: No two consecutive red nodes along any path) 4. Every path from any node to a descendant leaf has the same number of black nodes. Claim: - The Red-Black properties (1-4) mean that the depth of the tree is O(lg(n)) if there are n nodes in the tree. Informal justification: - Since every path from the root to a leaf has the same number of black nodes (by property 4), the shortest possible path would be one which has NO red nodes in it. - Suppose k is the number of black nodes along any path from the root to a leaf. Q: How many red nodes could there be? A: At most k. By property 3, anytime you get a red node, your next node must be black. You can only do that k times before you run out of black nodes. So, the LONGEST possible path is at most 2 times the length of the shortest, or h <= 2k It can be shown that if each path from the root to a leaf has k black nodes, there must be AT LEAST 2^k - 1 nodes in the tree. Since h <= 2k, i.e. k >= h/2, ... there must be at least 2^(h/2) - 1 nodes in the tree. If there are n nodes in the tree, that means n >= 2^(h/2) - 1 Adding 1 to both sides: n + 1 >= 2^(h/2) ...and taking the log (base 2) of both sides: lg(n+1) >= h/2 or 2 lg(n + 1) >= h, which is O(lg(n)). --- Inserting into a Red-Black Tree: As usual for a binary search tree, find the location where the new element goes, and insert it there. Initially, color it RED. This insures that rules 1, 2 and 4 are preserved. But rule 3 might be violated (a red node has black children). There are several cases (assume we're inserting a key x). 1. A 2-node (black node with black children) | | | a ==> a or a (becomes a 3-node) / \ // \ / \\ ? ? x ? ? x This is not problematic; no violation of rule 3. 2. A 4-node (black node with red children) | a // \\ b c / \ / \ << x to be inserted at one of these locations Here, we have to "split" the node, promoting the middle element to a higher point in the tree. But that's easy: Just join (a) with its parent, and "unjoin" (b) and (c) from (a). This amounts to a "color flip", e.g.: || a / \ b c / \ / \\ x << wherever x is, it's okay We haven't changed the black height of anything; the new node is RED, and we just switched the order of red/black in the subtree. So we haven't broken the other properties, and now rule 3 is okay here. But we must make sure the property hasn't been violated at a higher point in the tree! We'll come back to this. 3. A 3-node (black root with one red child) This case is a bit trickier. It depends where you are inserting: | | a a // \ or / \\ b <3> <3> b / \ / \ <1> <2> <2> <1> If you are inserting at <3>, no problem; you're basically creating a 4-node, adding x at the front or the back. No color problem: | | a a // \\ or // \\ << simple case, insert at <3> b x x b / \ / \ / \ / \ But if you insert at <1> or <2>, under the child, you get two red nodes in a row. So we have to fix that. Suppose you insert at position <1>. Then, you have two red nodes in a straight line: | | a a // \ or / \\ b <3> <3> b // \ / \\ x <2> <2> x Note in this case that x < b, and b < a. We could fix this operation by "rotating" this whole structure to move b up to the root of the structure, colored black with x on one side (colored red) and a on the other side (colored red): | | b b // \\ or // \\ x a a x / \ / \ <2> <3> <3> <2> Note that we haven't changed the number of black nodes along any path by doing this, and we've fixed the double red. And the sub- trees <2> and <3> are still ordered properly. This is called SINGLE ROTATION. - On the other hand, suppose you inserted at position <2>. Then you have two red nodes in a row, but in a "zig-zag" pattern, e.g.: | | a a // \ or / \\ b <3> <3> b / \\ // \ <1> x x <1> Now, the "middle" element is x, so we basically want to move x up to the root, and have b and a on either side, both colored red: | | x x // \\ or // \\ b a a b / \ / \ / \ / \ <1> <3> <3> <1> One way to think of this is to suppose that you do it as two single rotations -- once around b, and then around x. (Here it is shown for the left case only; the other one is symmetric: | | | a a x // \ // \ // \\ b <3> ==> x <3> ==> b a / \\ // \ / \ / \ <1> x b <1> <3> / \ <1> This is called DOUBLE ROTATION. The resulting 4-node will be the same regardless of which 3-node version you started from. --- Tree Maintenance Since a red-black tree is also a binary search tree, the time complexity of the `lookup' operation is O(h), which we just argued is O(lg(n)) in the worst case. Similarly, insertion works like in a binary search tree, except that we also have to fix up the color properties. But, even in the worst case, you only have to fix colors along the path from the new node to the root, so that's only a constant factor of additional work. Furthermore, it can be proved that you will never have to do more than one rotation or double-rotation to fix up the tree. All the other changes can be done with color-flips. So, insert is also O(lg(n)). Deleting from both these sorts of trees is a bit more tricky, but it doesn't increase the time complexity -- it can still be done in O(lg(n)) time, worst case. The additional complexity comes because you sometimes have to merge nodes together, e.g.: 10 / \ 7:8 12:13 / \ / \ To delete 10, we have to promote somebody to the root. We could take 12 for this, giving: 12 / \ 7:8 13 / \ / \ To delete 12, we can't take 13, because the remaining tree would violate the constraint that all leaves are at the same level. But we can take 8 (the largest in the left subtree): 8 / \ 7 13 / \ / \ Now, to delete 8, neither 7 nor 13 can be taken, because that would leave both nodes empty. So we have to "merge" them together, into a 3-node: 7:13 / | \ The situation with Red-Black trees is analogous, and leads to several cases. We will not look at the exact algorithm here. --- Synopsis: Binary search trees are a good data structure for implementing the Dictionary ADT, but their performance is best when they are kept in balance. - Keeping "perfect" balance turns out to be too much work. - But we can make it work if we either give up "binary" (have search trees with variable numbers of children), or give up "perfect" (allow trees to be "somewhat unbalanced") - 2-3-4 trees: Use the first strategy. All leaves at the same level, all paths the same length But clumsy to implement due to variable-size nodes - Red-Black trees: Use the second strategy. Encode 2-3-4 nodes as "mini-trees", nodes colored red to indicate they are conjoined with their parent. Use "rotations" and "color flips" to keep the tree in balance, takes no more than O(lg(n)) time. - Doing this, we get guaranteed O(lg(n)) runtime for all the Dictionary operations -- insert, lookup, delete.