Interesting you mention http headers. I had a program converted Python 2 -> Pyth...

takeda · on Jan 13, 2020

I believe the headers are encoded using ISO-8859-1 not Unicode. That encoding has 1:1 mapping with bytes so wouldn't break this way. Treating them as UTF-8 was the bug.

code_biologist · on Jan 13, 2020

This is exactly the sort of encoding issues that the python 2 to 3 transition has flushed out. People get frustrated with python 3, yet the actual failure was their mishandling of encoding issues -- papered over by python 2.

xorcist · on Jan 13, 2020

But that's not what frustrates people with the transition. It's that they suddenly get encoding issues where there should have been no encoding to begin with!

cbsmith · on Jan 13, 2020

No observed encoding issues.

CJefferson · on Jan 13, 2020

When I treated headers as bytes, there wasn't an "encoding".

What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

cbsmith · on Jan 14, 2020

> When I treated headers as bytes, there wasn't an "encoding".

If you are representing strings as bytes, you are intrinsically using an encoding.

> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.

But yes, this strategy largely avoids encoding issues... until it doesn't.

Dylan16807 · on Jan 14, 2020

> If you are representing strings as bytes, you are intrinsically using an encoding.

It's just binary data that might resemble a string. No encoding necessary.

fiedzia · on Jan 14, 2020

This is false more often than not. Many programs taking user input will treat it as string, assuming specific encoding or compatibility with screen output/some api, at least in some code paths. For example if you print an error message when you can't open some file, you are very likely to assume its encoded in a way terminal can handle, so its no longer "just binary data".

CJefferson · on Jan 14, 2020

Yes, I have to worry about how to make a "best effort" to show it to users, but in all internal code paths it must stay as "just binary data", else I lose information. This is exactly how chrome and Firefox handle headers internally.

cbsmith · on Jan 15, 2020

It might resemble a particular encoding of a string... and the way you got that string to that particular sequence of bytes is by... encoding it.

Dylan16807 · on Jan 15, 2020

> and the way you got that string to that particular sequence of bytes

No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.

cbsmith · on Jan 16, 2020

In that context, you aren't using strings. You are using bytes. HTML without interpreting it as strings isn't really HTML, nor is it a string. It's just a blob that is passing through.

takeda · on Jan 14, 2020

> When I treated headers as bytes, there wasn't an "encoding".

oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.

It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.

CJefferson · on Jan 13, 2020

I'll admit, I'm not positive what the encoding should be. However there is a bunch of people who do clearly send UTF-8, and I can also promise you there are headers out there which just have binary nonsense in them. See for example https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

If you want to handle all headers, you have to be prepared to just get binary data.

takeda · on Jan 14, 2020

Yes, and using ISO-8859-1 is the way to handle them without issues. You will never get error when decoding it that way. If you are using UTF-8 there are character combinations that are invalid.

dnautics · on Jan 13, 2020

> It is a stupid client which doesn't send valid ascii for http headers of course.

...or a smart malicious actor.