Pick the right data structure
Sun, Jan 25 2026
I was reminded yesterday of the importance of finding the right data structure for a problem.
I have this custom HTML parser I've written, and part of the implementation involves detecting named character references.
Named character references are things like & or >.
The HTML spec defines the complete list.
There are about 2000 of them, and a good parser needs to recognize them and replace them with the unicode bytes they represent.
There are a few things to keep in mind:
- The only reasonable way to parse HTML is byte by byte Technically you iterate over unicode scalar values, but for this explanation, that detail does not matter..
- Some of the named character references are prefixes to other ones (e.g,
∉and¬indot). So you can't simply stop on the first match.
My initial implementation stored the named character references in a hardcoded map. Once I hit an ampersand, I would begin accumulating subsequent bytes into a temporary buffer. For each new byte, I would iterate through the keys of the map, looking for named character references that started with whatever was currently in my temporary buffer. This worked, so I moved on.
Yesterday, I was finally putting the library to real use. When loaded with a browser, the HTML file in question happened to have ~2.2MB of source html/css/js. Not that big, but much bigger than any input I had used while testing the HTML parser. Parsing was taking ~8 seconds.
pprof revealed the CPU was spending most of its time in the state for parsing character references.
This makes sense: I was looping through all the keys in the map for each byte inside that state.
But it wasn't necessary to do all that work.
All I wanted to do was to check if at least one element in the hardcoded map started with the contents of my temporary buffer.
With that fixed, runtime was down to ~3 seconds.
Progress, but pprof still showed the same function taking up most of the execution time.
There were no real improvements I could make without changing how I stored and accessed the character reference map. I needed a way to quickly identify which character references were prefixed by the current contents of the temporary buffer. I knew about tries and that they were supposed to solve this exact problem, but I was not familiar with them and had not implemented one before.
So I pasted my code into Claude and asked:
I've written an html tokenizer in go. From profiling, it seems a lot of time is spent in the named character state. Specifically iterating through the pre-built character reference map. What can I do to speed that up?
Claude:
Option 1:
Use a Trie Data Structure (Best Option) Replace the map with a trie (prefix tree) for O(n) lookups where n is the length of the reference, not the number of entries:
type trieNode struct {
children map[rune]*trieNode
replacement string // non-empty if this is a valid end point
}
type charRefTrie struct {
root *trieNode
}
func buildCharRefTrie() *charRefTrie {
root := &trieNode{children: make(map[rune]*trieNode)}
for key, value := range characterReferenceMap {
node := root
for _, ch := range key {
if node.children[ch] == nil {
node.children[ch] = &trieNode{children: make(map[rune]*trieNode)}
}
node = node.children[ch]
}
node.replacement = value
}
return &charRefTrie{root: root}
}
I pasted this code verbatim into my project, adjusted the named character state, fixed an off-by-one error and profiled again: 76ms There could be a whole other post here about how useful LLMs are for learning from first principles. This is not that post..
The fix makes sense intuitively: by using trie, I am slowly narrowing down the list of potential named character references with each byte, instead of looking at all of them over and over again. But a speedup from 8000ms to 76ms almost makes me ashamed of how wasteful the original implementation was. All because I picked the wrong data structure.