profile picture
HomeNotesLinks

Pick the right data structure

Sun, Jan 25 2026

I was reminded yesterday of the importance of finding the right data structure for a problem. I have this custom HTML parser I've written, and part of the implementation involves detecting named character references. Named character references are things like & or >. The HTML spec defines the complete list. There are about 2000 of them, and a good parser needs to recognize them and replace them with the unicode bytes they represent.

There are a few things to keep in mind:

  • The only reasonable way to parse HTML is byte by byte Technically you iterate over unicode scalar values, but for this explanation, that detail does not matter..
  • Some of the named character references are prefixes to other ones (e.g, ∉ and &notindot). So you can't simply stop on the first match.

My initial implementation stored the named character references in a hardcoded map. Once I hit an ampersand, I would begin accumulating subsequent bytes into a temporary buffer. For each new byte, I would iterate through the keys of the map, looking for named character references that started with whatever was currently in my temporary buffer. This worked, so I moved on.

Yesterday, I was finally putting the library to real use. When loaded with a browser, the HTML file in question happened to have ~2.2MB of source html/css/js. Not that big, but much bigger than any input I had used while testing the HTML parser. Parsing was taking ~8 seconds.

pprof revealed the CPU was spending most of its time in the state for parsing character references. This makes sense: I was looping through all the keys in the map for each byte inside that state. But it wasn't necessary to do all that work. All I wanted to do was to check if at least one element in the hardcoded map started with the contents of my temporary buffer. With that fixed, runtime was down to ~3 seconds. Progress, but pprof still showed the same function taking up most of the execution time.

There were no real improvements I could make without changing how I stored and accessed the character reference map. I needed a way to quickly identify which character references were prefixed by the current contents of the temporary buffer. I knew about tries and that they were supposed to solve this exact problem, but I was not familiar with them and had not implemented one before.

So I pasted my code into Claude and asked:

I've written an html tokenizer in go. From profiling, it seems a lot of time is spent in the named character state. Specifically iterating through the pre-built character reference map. What can I do to speed that up?

Claude:

Option 1:

Use a Trie Data Structure (Best Option) Replace the map with a trie (prefix tree) for O(n) lookups where n is the length of the reference, not the number of entries:


type trieNode struct {
    children    map[rune]*trieNode
    replacement string // non-empty if this is a valid end point
}

type charRefTrie struct {
    root *trieNode
}

func buildCharRefTrie() *charRefTrie {
    root := &trieNode{children: make(map[rune]*trieNode)}
    
    for key, value := range characterReferenceMap {
        node := root
        for _, ch := range key {
            if node.children[ch] == nil {
                node.children[ch] = &trieNode{children: make(map[rune]*trieNode)}
            }
            node = node.children[ch]
        }
        node.replacement = value
    }
    
    return &charRefTrie{root: root}
}

I pasted this code verbatim into my project, adjusted the named character state, fixed an off-by-one error and profiled again: 76ms There could be a whole other post here about how useful LLMs are for learning from first principles. This is not that post..

The fix makes sense intuitively: by using trie, I am slowly narrowing down the list of potential named character references with each byte, instead of looking at all of them over and over again. But a speedup from 8000ms to 76ms almost makes me ashamed of how wasteful the original implementation was. All because I picked the wrong data structure.

Built from git commit babbe3b