> No software decodes data by reading a stream byte-by-byte. Maybe not in your b...

> No software decodes data by reading a stream byte-by-byte.

Maybe not in your bubble, but in my bubble this is the common case and highly optimized components using SIMD parsing is the exception.

By byte-by-byte, I presume you mean logically, not invoking a syscall for every byte. A parsing library is often handed a buffer to parse. If every layer had it's own buffering, you'd end up with too much data copying, which compounds and can spill your CPU caches in ways that microbenchmarks won't reflect, particularly in a streaming pipeline. So you stick to simpler, straight-forward code and data paths, unless and until you know a particular component is a bottleneck. I've had much better results optimizing globally first before optimizing locally. You can usually go back and optimize locally whenever you want, but optimizing globally is more often a one-shot deal as it typically requires non-local (i.e. cross-component) analysis and refactors, which nobody has time for.