The worst data structure in existence, and why you should never use it

Josh Weinstein
6 min readJul 22, 2024

--

Photo by John Matychuk on Unsplash

Out of all the possible data structures developers can choose to use in their line of work, there is one that absolutely must be avoided. The sheer amount of memory and disk space it can require is simply ridiculous. It doesn’t offer hardly any advantages over commonly used data structures like B+ trees or hash maps. And it will definitely bloat or slow down your application.

That data structure is a trie.

A trie is a data structure which maps the position of a node to the structure of a key, within a tree, in and of itself. Tries are commonly used to track prefixes of strings or sets of strings, but the key of a trie can be any sequence of integers. Each node of a trie represents one character or integer of the key. Each child of a node represents one of the next possible characters or integers in the key. An example can be found below

A trie of the sequences “ac” and “c”

As you might notice above, each of the nodes of the trie has some N slots for children, but not all of them are occupied or pointing to child nodes. Although this can make a trie fast by allowing an integer or character in the key to index the next child node, it causes a massive increase in the use of space or memory as the trie grows larger.

To start, similar to the above, a trie with the sequence “ac” and “c” uses 4 nodes. Accounting for memory alignment and padding, we can assume at minimum such a node takes around 12 bytes of memory. This means the trie uses 48 bytes of memory to store a cumulative total of around 3 bytes of keys.

The rational behind this excess storage is that, both the insert and the lookup complexity of a trie becomes O(k) , where k is the length of the key. The structure of the trie also allows for a prefix query, meaning that finding all keys which begin with some prefix is the complexity of O(k) , where k is the length of the prefix. Despite these advantages, they come with a big cost, and that is space.

Tries consume space in one of two ways. The first is the unused or unoccupied child slots on each node. The more possible characters in the keys for the trie, the more child slots on each node. For example

  • DNA base pairs A, C, G, T: 4 child node slots
  • The upper and lower case english alphabets: 52 child node slots
  • All possible 8-bit numbers : 256 child node slots

As you can see, as the breadth of possible integers within a key increases, the child node sizes increase. If we assume the child node slot holds a 64 bit pointer address or equivalent, a 256 child slot node becomes 2kb! This is not the only way tries increase in size though.

Tries also bloat space and memory if keys get longer . Longer keys make the trie a deeper and wider tree at the same time. Unlike other tree data structures, tries have very few central or shared paths among all keys. To illustrate this concept, consider the following list of keys

  • ABCDEF
  • BBCDEF
  • CBCDEF
  • DBCDEF
  • EBCDEF
  • FBCDEF

There are 6 keys, each of which only differ by the first character in the key. Each of the keys is 6 characters long. If one was to insert these keys into a trie, that trie would end up looking something like this, a trie with 31 nodes:

Trie with some child node slots not shown

This means with maximal entropy or “space” between keys, the trie because very wide and very fast. If one inserted 100 keys beginning with C, and assuming those are unique keys, that would increase the number of nodes in the trie by at least a few hundred. But those new few hundred nodes would not be shared or optimized for keys starting with the other characters. To put this in terms of actual performance though, we can build a bench mark of a real world use case for a trie.

The Test

Databases need highly performant indexes over sets of data to make those data sets available for queries or different operations. The most commonly used types of indexes are B+ trees and log sorted merge (LSM) trees, optimized for reading and writing, respectively. Traditionally, tries are not used for database indexes because they require a larger number of pages (equal sized portions of a database held between disk and memory). One of the goals of any reasonable database index is that it should not exceed the size of physically available RAM.

For this benchmark, using C++, we can make a trie that functions very similarly to a database index. The data of the index will be read to and from disk, and be split into pages. It will support only two operations, an insert operation and a contains operation. Every insert operation that adds a new node to the trie will also sync the respective page that node exists on back to disk.

In order to format the trie on disk, the trie must be formatted across different pages for storage. If we consider that a page is some N bytes, we can divide the page into M blocks of storage per page. Each block will contain exactly 1 node, and the data of each node is the offsets of the child nodes within the trie. An illustrated example lies below:

The trie format on disk.

If the page size is 4096 bytes, and the range of a trie is 4 (meaning the values in any key are 0, 1, 2, 3) we can determine the other storage sizes as follows:

  • The size of any node, assuming 64 bit offsets, is 32 bytes.
  • The blocks per page will be 4096 / 32 == 128 .

For the purposes of this bench mark, we will also use keys that are both 100 characters long and randomized. This gives some insight into how a trie fairs against long keys that are distant from one another. High entropy, big keys present the hardest case for a trie to handle, which will really put it to the test.

The format of the test will work as follows:

  • Generate 1,000,000 random sequences of length 100, using only (0, 1, 2, 3)
  • Insert 1,000,000 random sequences into an empty trie, with a page size of 4096 bytes

The benchmark will only use a single thread and a single file, so there is no concurrency or contention at play here.

The above code was compiled with optimizations, on the following specifications:

  • 8GB RAM
  • Apple M1 Pro
  • Apple clang version 14.0.3 (clang-1403.0.22.14.1)
  • Target: arm64-apple-darwin23.3.0

The results were ….. really bad

Averaged across 30 runs, the performance for time and space for one million keys were

  • 25.43 seconds
  • 704,034 pages used
  • 2.9 GB in disk / memory space used

If we compare the space of the trie to the total size of the keys, that ratio is 100mb / 2900 mb == 1/29 , this means the index takes 29 times as much space as the keys by themselves do, along with being pretty slow in terms of operations. This is pretty bad performance, and bad use of space. The promises of tries are simply not worth it!

The Alternatives

But let’s say you still have a use for string based prefixes in your work as a developer. What do you do?

A possible alternative is the radix tree. The radix tree, also called a patricia tree, is a space optimized version of the trie, that condenses prefixes or portions of keys to use the minimum number of nodes as possible . Instead of each child node representing one integer or character, child nodes represent longer segments of a key , such that the tree uses the least amount of breadth or depth as possible.

Credits to Wikipedia

Radix trees are harder to insert into because of the need to retain the opitmized separation of segments of keys, but solved the problem of space and memory bloating that tries have.

--

--

Josh Weinstein
Josh Weinstein

Written by Josh Weinstein

I’m an engineer on a mission to write the fastest software in the world.

No responses yet