Speeding up an HTML tokenizer
Fri, May 29 2026
I wrote and HTML parser, and a week or so ago I became dissatisfied with how the tokenizer was implemented. I made some changes:
Before:
$ go test --run '$^' --bench 'BenchmarkHtmlTokenizer' --benchmem --count=5
goos: linux
goarch: amd64
pkg: eze.works/go/gsx
cpu: 13th Gen Intel(R) Core(TM) i5-13400T
BenchmarkHtmlTokenizer-16 4 270036084 ns/op 198183348 B/op 3218584 allocs/op
BenchmarkHtmlTokenizer-16 4 271043490 ns/op 197710172 B/op 3211666 allocs/op
BenchmarkHtmlTokenizer-16 4 273574498 ns/op 197710152 B/op 3211666 allocs/op
BenchmarkHtmlTokenizer-16 4 276339333 ns/op 197710200 B/op 3211667 allocs/op
BenchmarkHtmlTokenizer-16 4 275311635 ns/op 197710160 B/op 3211666 allocs/op
PASS
ok eze.works/go/gsx 5.536s
After:
$ go test --run '$^' --bench 'BenchmarkHtmlTokenizer' --benchmem --count=5
goos: linux
goarch: amd64
pkg: eze.works/go/gsx
cpu: 13th Gen Intel(R) Core(TM) i5-13400T
BenchmarkHtmlTokenizer-16 6 176753957 ns/op 25887232 B/op 237496 allocs/op
BenchmarkHtmlTokenizer-16 6 174922978 ns/op 25578853 B/op 232892 allocs/op
BenchmarkHtmlTokenizer-16 6 174573735 ns/op 25578850 B/op 232892 allocs/op
BenchmarkHtmlTokenizer-16 6 176337642 ns/op 25579737 B/op 232893 allocs/op
BenchmarkHtmlTokenizer-16 6 185298979 ns/op 25578866 B/op 232892 allocs/op
PASS
ok eze.works/go/gsx 5.363s
The tokenizer now runs roughly 35% faster and consumes 87% less memory. The benchmark is tokenizing a 16mb file: the raw HTML contents of the full HTML spec page. Go ahead, try to open that link, your browser will probably hiccup I usually append /multipage to the link when I'm actually trying to reference the spec. . It's the only page that seems to give my browser a problem. A worthy candidate.
The benchmark is simple. The file is read into memory from disk, then I track the resource usage of tokenizing the whole thing. And by "track", I mean, I just call the function. Go makes it stupid easy to get started with profiling.
I'll explain the changes I made using the following HTML snippet as an example:
<lol< cLAsS\x00="box">&</lol">
Imagine that \x00
is a null byte (i.e. a zero byte). A spec compliant tokenier must interpret this as:
(tagstart) name=lol< attributes=[(class�, box)]
(text) content=&
(tagend) name=lol"
(eof)
No, you are not missing a font. That's the actual unicode replacement character appended to the attribute name as the spec requires.
For tokenizing HTML, you can't use zero copy techniques to avoid allocating token data. The contents of the returned tokens might not even exist in the input.
The most obvious structure for a returned token is:
type Attr struct {
Name string
Value string
}
type Token struct {
Kind TokenKind
Data string
Attrs []Attr
}
My original tokenizer implementation was creating new strings as the input was processed. It would then return a []Token
at the very end.
This worked fine, and was not a problem for my usage. But unfortunately I have enough knowledge of how memory works to be disgusted by the repeated allocations the runtime is likely making. Additionally, from what I understand about the Go garbage collector, it needs to scan through all pointers in the program to figure out what objects are still "alive". This seems to imply that generally The Go runtime is complex enough that this is not always true. The only way to be sure is to measure and investigate. , the more pointers the program has, the more work the garbage colector has to do.
My tokenizer was materializing a huge slice of Token structs, filled to the brim with string pointers to arbitrary places on the heap. I started to feel bad for the machine.
The first change I made was to avoid returning the full list of tokens all at once. I made the tokenizer function like an iterator; returning one token at a time. The complete list is not necessary because the parser only ever processes one token at a time I think this change is worth doing in the context of a GC like go's that traces pointers. Fewer pointers for the GC to trace is a good thing. In the absence of a shadow process using resources to scan your heap, this change might not be as beneficial? Idk. .
The second change was to stuff all the token data into a contiguous buffer. This buffer is likely to be slightly smaller than the original input, so you can pre-allocate the capacity ahead of time This is actually important. The returned tokens keep the backing buffer alive. If in the middle of tokenizing, the buffer needs to be reallocated, you'll end up with two or more big buffers hanging out in memory. . This coalesces thousands of little allocations into one big one. So in our example, the tokenizer would gradually accumulate a single buffer whith the following contents:
lol<class�box&lol"
The strings within the tokens then become pointers into this buffer, at different offsets. To do this, the tokenizer would need to keep track of where the current token's data starts and ends in this big buffer. Then when a token is about to be emitted you do:
data := buf[start:end]
ptr := unsafe.SliceData(data)
return unsafe.String(ptr, len(data))
You can then return the same Token structure outlined earlier, but the strings are now all pointing into the same allocation. This incantation is necessary because in almost all cases using the natural string(buf[start:end])
will allocate, defeating the purpose of the buffer. Using unsafe.String here is safe to do because the buffer is append only, so the underlying bytes that form the string do not change after creation.