I've been involved in multiple non-trivial libraries and frameworks that support...

		ploxiln on Jan 13, 2020 \| parent \| context \| favorite \| on: Mercurial’s journey to and reflections on Python 3 I've been involved in multiple non-trivial libraries and frameworks that supported both python2 and python3 for many years with the same codebase ... and it really wasn't anything like this. The python3 "adaptation" effort for mercurial was just bungled by multiple terrible decisions. First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"". But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years. Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.

pdonis on Jan 13, 2020 | [–]

> The natural string type is the right type in many places

For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.

I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

jharsman on Jan 13, 2020 | | [–]

I was an early adopter of Mercurial and the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support.

For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.

In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.

I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.

pdonis on Jan 13, 2020 | | | [–]

> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.

hsivonen on Jan 14, 2020 | | | [–]

The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.

pdonis on Jan 14, 2020 | | | [–]

> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.

simias on Jan 13, 2020 | | | | [–]

Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?

SSLy on Jan 13, 2020 | | | [–]

On windows? Of course you can.

edgyquant on Jan 14, 2020 | | | [–]

I'm certain you can on Linux as well. Only Macs old HFS would not allow it.

cataflam on Jan 14, 2020 | | | | [–]

Isn't this a fairly recent change?

amaranth on Jan 14, 2020 | | | [–]

NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.

mathw on Jan 14, 2020 | | | [–]

Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!

rurban on Jan 15, 2020 | | | [–]

Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.

gpderetta on Jan 13, 2020 | | | | [–]

Hum, any program that doesn't treat filenames as bytestreams on unix is broken. Doubly so if its primary purpose is preserving and archiving files.

Are you sure the issue wasn't something else?

lmm on Jan 14, 2020 | | | [–]

The point is that filenames aren't bytestreams on windows, and if you treat them as such then your program won't work.

WorldMaker on Jan 14, 2020 | | | [–]

By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)

lmm on Jan 14, 2020 | | | [–]

> By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.

WorldMaker on Jan 14, 2020 | | | [–]

Sure, but that's also not the only resulting option; instead of erroring you could also do something nice like help those users migrate to cleaner Unicode encodings of their filenames by asking them to correct mistakes or provide information about the original encoding. It takes more code to do that than just throwing an error, of course, but who knows how many users that might help that don't even realize why their repositories don't work correctly on, say, Windows.

Dylan16807 on Jan 14, 2020 | | | | [–]

Windows filenames basically are bytestreams. But the bytes come in pairs.

lmm on Jan 15, 2020 | | | [–]

Not really. Certain byte sequences are invalid.

Dylan16807 on Jan 15, 2020 | | | [–]

Certain byte sequences are invalid in unix filenames too. So that can't be the factor that decides if they are bytestreams or not.

xorcist on Jan 13, 2020 | | | | [–]

If hg borked on non-ascii characters, it sounds like the problem was rather that it didn't treat that data as a bag-of-bytes. Not the other way around?

ploxiln on Jan 13, 2020 | | | [–]

He was trying to use Windows. For Windows, you pretty much have to go through unicode to utf-16, can't be arbitrary bytes, can't be utf8.

(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)

Dylan16807 on Jan 14, 2020 | | | [–]

Windows uses arbitrary shorts that are sort of supposed to be utf-16. Just like Unix uses arbitrary bytes that are sort of supposed to be utf-8.

You have to convert between them, but neither uses proper Unicode to represent filenames.

cbsmith on Jan 13, 2020 | | | | [–]

Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.

mynegation on Jan 14, 2020 | | | [–]

Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).

mikepurvis on Jan 14, 2020 | | | [–]

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

cbsmith on Jan 14, 2020 | | | | [–]

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.

takeda on Jan 14, 2020 | | | [–]

It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

cbsmith on Jan 15, 2020 | | | [–]

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.

lmm on Jan 14, 2020 | | | | [–]

UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.

cbsmith on Jan 14, 2020 | | | [–]

Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.

masklinn on Jan 13, 2020 | | | [–]

> For many programs, yes.

For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else[0], which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.

> Repository data is bytes, not Unicode.

It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.

[0] though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well

pdonis on Jan 13, 2020 | | | [–]

> For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

This is a problem with the Python 3 standard library; in many places it requires Unicode when it shouldn't.

takeda on Jan 14, 2020 | | | [–]

This is a really bad way of thinking. The distinction in Python 3 is between text (str) and bytes.

str is not Unicode in fact if you don't use fancy characters internally it stores text as a byte array.

You should think of text the same as of image or sound, what you see in the screen or hear in the speaker is the actual thing, but if you need to save it on disk you encode it as for example png or wav.

owl57 on Jan 14, 2020 | | | [–]

You can just read "requires text when it shouldn't". But I don't recommend this terminology: in most modern computer programs, including Python 3 implementations, "text" and "Unicode" mean the same thing, but outside of this context Unicode is just more precise: sometimes "text" means ASCII and sometimes it means things non-represantable in current version of Unicode.

pdonis on Jan 14, 2020 | | | | [–]

> The distinction in Python 3 is between text (str) and bytes.

Feel free to s/Unicode/str/ in what I posted if you prefer that terminology. The problem is still the same.

An example of the problem: Python's standard streams (stdin|out|err) in Python 2 are streams of bytes, but in Python 3 they're streams of Unicode (or str if you prefer that terminology) characters. The problem is twofold: first, if my standard streams are hooked to a console, Python can't always properly detect the encoding of the bytes coming from the console, so it can give me the wrong Unicode characters; second, if my standard streams are hooked to pipes, there is no encoding it can pick that is right, since the bytes aren't even coming from a console (where at least there is some plausible argument for saying the user meant to type Unicode characters, not bytes). What Python 3 should have done was keep the standard streams as bytes, since that's the only common denominator you can rely on, and then let the application decide how to decode them if it decides it needs to, just as in Python 2.

takeda on Jan 14, 2020 | | | [–]

I believe the behavior is correct though. Python uses encoding specified through LANG/LC_* which is the encoding that supposed to be used, and all properly behaved applications use it.

If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version. Most people will use it for text, so the defaults make sense. Personally I would like if there was no automatic conversion when using files/network/pipes etc. but I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

pdonis on Jan 15, 2020 | | | [–]

> Python uses encoding specified through LANG/LC_

Yes, that's the best you can do, but it's still not always correct. I agree that it should be, but "should be" and "is" aren't always the same.

> If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version.

Yes, but there are still standard library functions that will use the regular streams, and that might conflict with what your application is doing. There is no way to tell Python as a whole "use binary streams everywhere because they are pipes for this application".

> Personally I would like if there was no automatic conversion when using files/network/pipes etc.

That would work if (a) Python could always detect that condition (it can't) and (b) the entire standard library adjusted itself accordingly.

> I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

Python 2 worked fine with the standard streams being binary, and applications wrapping them to decode to Unicode when necessary. Python 2.7 even back ported the TextIOWrapper and similar classes to make the wrapping as simple as possible. A similar approach could have been taken in Python 3 (binary streams and a simple wrapper class), but it wasn't.

masklinn on Jan 14, 2020 | | | | [–]

Complaining that the world is not as it should be does not solve the issue.

ploxiln on Jan 13, 2020 | | | [–]

Repository data bytes does not show up as string literals in your code, or keyword argument names, or http header names. The vast majority of code involved in this struggle is misc business logic, not repository tracked file contents itself.

reubenmorais on Jan 13, 2020 | | | [–]

Python 3's approach means bytes/str poisons the whole expression. So if you want to do something like:

"%s/%s" % (repository_data_1, repository_data_2)

And have it work on Python 2 and 3, you're screwed.

luhn on Jan 13, 2020 | | | [–]

And Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things. Python 2 would let you do that, and it would often cause subtle bugs with non-ASCII data. Python 3 requires you to encode/decode, so you're working consistently and explicitly with binary or text.

I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

reubenmorais on Jan 14, 2020 | | | [–]

> I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

We're discussing the linked article, so I'm talking in the context of the linked article. I know it works now, but Python 3 initially removed %-formatting for bytes. I guess I should have used past in my comment, "you were" screwed instead of "you are". From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.

pdonis on Jan 14, 2020 | | | | [–]

> Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things.

Python 3's behavior as far as forcing you to explicitly recognize data type conversions is more correct, yes.

Python 3's behavior in assuming that nobody would ever need to do "text-like" operations like string formatting on byte sequences was not. At least this particular wart was fixed. But there are still a lot of places where Python makes you use the str "textual" data type when it's not the right one.

Python 3's behavior in making individual elements of a byte string integers instead of length-one byte strings is, frankly, braindead.

acdha on Jan 14, 2020 | | | | [–]

That example works fine in both Python 2 and 3 if you’re not mixing types incorrectly. If you are, it will appears to work on Python 2 before failing the first time you encounter non-ASCII data and tends to greatly confuse people with errors which would have been caught immediately on Python 3. I’ve seen teams waste hours trying to track down errors like that.

zo1 on Jan 14, 2020 | | | [–]

Exactly this. The amount of times I saw juniors fixing thses sort of obscure subtle bugs with str_var.decode("utf-8").encode("latin-1") and this after attempting every which combination of the above two de/encode operations is mind boggling.

reubenmorais on Jan 14, 2020 | | | | [–]

It works after Python 3.5. From the article:

takeda on Jan 13, 2020 | | | [–]

The rule of thumb (not just for Python, but anything that deals with encoding) is to use binary encoding at the bounds of your program (reading/writing files, sending/receiving data over network etc) it applies to everything including tools like this. If you follow it your life will be simpler.

You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

masklinn on Jan 13, 2020 | | | [–]

> You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.

Do not ever use text-mode `open` without specifying an encoding.

mintplant on Jan 14, 2020 | | | [–]

Node.js tries to be helpful in defaulting file writes to UTF-8, but defaults file reads to returning a raw byte buffer [0]. So you have to either remember to treat the two operations differently, or, like in Python, manually specify the encoding for both.

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.

takeda on Jan 14, 2020 | | | | [–]

The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.

masklinn on Jan 14, 2020 | | | [–]

> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.

takeda on Jan 14, 2020 | | | [–]

Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.

slavik81 on Jan 13, 2020 | | | | [–]

> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.

mark-r on Jan 13, 2020 | | | [–]

Every Python program should be tested with Emoji characters, they're a real torture test.

slavik81 on Jan 14, 2020 | | | [–]

Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.

mark-r on Jan 14, 2020 | | | [–]

Good point. I do almost all of my Python on Windows where it's much easier to get an error.

WorldMaker on Jan 14, 2020 | | | | [–]

Every program in general should be tested with Emoji characters at this point.

mark-r on Jan 14, 2020 | | | [–]

Not a bad idea, but I think Python is more likely to have hidden bugs that this will uncover. A language that accepts bytes as input and emits the same on output will probably work fine on UTF-8 for example.

WorldMaker on Jan 14, 2020 | | | [–]

That's the Python 2 mentality and a large part of this discussion was that it didn't work in hindsight, that you can't just be "encoding oblivious", but it usually doesn't show up as an obvious problem until you least expect it. Our input and output devices are aren't always homozygous on byte encoding (and quite possibly very rarely are; we have decades and decades of kludges around this), and testing every program with Emoji has become one of my favorite pastimes for finding failure cases.

takeda on Jan 14, 2020 | | | | [–]

It defaults to the system encoding. I don't use Python on Windows, but Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8. Perhaps Python needs to be updated to reflect that?

You can also specify encoding when calling open.

Dylan16807 on Jan 14, 2020 | | | [–]

> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.

ygra on Jan 15, 2020 | | | [–]

> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.

But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.

slavik81 on Jan 14, 2020 | | | | [–]

The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.

cbsmith on Jan 13, 2020 | | | [–]

To be fair... the problem was more in Python 2 where this stuff was often conflated. Python 3 really just brought the problem in to stark relief.

TBH I do think the problem is easier to address in a statically typed world.

speedplane on Jan 15, 2020 | | | [–]

> I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.

Python set the entire community back 10 years or more by making this drastic mistake.

simias on Jan 13, 2020 | | [–]

It might be my own pro-typed-language bias showing but this migration from byte strings to unicode strings is really where dynamically typed languages really don't shine.

If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.

In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...

And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.

monoideism on Jan 14, 2020 | | [–]

> added unicode as an afterthought like Python did

I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.

Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.

WorldMaker on Jan 14, 2020 | | | [–]

Another useful red letter date for language/tool adoption is the standardization of UTF-8 in 1993. Before UTF-8 there were a lot of tools, especially in the POSIX world, that didn't feel comfortable without an 8-bit safe encoding format.

Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then (before a large influx of users), but a corresponding complaint about UTF-8 is that because it was 8-bit safe, a lot of tools also felt they could kick the can on dealing with it more directly (as a default), and Python 2 seems to be among them. Hindsight has told us a lot about the problems to expect (and exactly why Python 3 did what it felt it had to do), but they probably weren't as clear in 2000. (In further hindsight, imagine if Astral Plane Emoji had been standard and been common around 2000 instead of 2010 how much further we might be in consistent Unicode implementation today. I suppose that makes 2010 another red letter date for Unicode adoption.)

im3w1l on Jan 14, 2020 | | | [–]

And it was much later than 1993 that unicode conclusively defeated latin-1. Something like 2010?

monoideism on Jan 14, 2020 | | | | [–]

> Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then

That's true, but I would argue that given the difficulty and backlash we've seen moving from Python 2 to Python 3, such a move would have risked destroying Python's rapid forward momentum and condemned it to the ash heap of programming language history.

monoideism on Jan 14, 2020 | | | [–]

To add on to this, I'm not agreeing with the backlash from Python 2 to 3. And I wouldn't want it in the ash heap of history - I definitely think there's a definite place for nice, quick, easy dynamic langs like Python, particularly for exploratory programming.

I'm just saying the move to Python 3 turned out to be a huge deal to a lot of people (it surprised me), and for that reason, trying such a big jump at Python 2 would have been risky and could have derailed Python's forward progress at a critical point.

Would the downvoters like to share their reasons for disagreement?

WorldMaker on Jan 14, 2020 | | | [–]

I think the question goes back to the size and scale of users at the 1 to 2 jump versus the 2 to 3 jump. Python didn't really start to hit most of its "forward progress" in terms of both user adoption and being so deeply integrated into systems. There was no Django for Python 1, for one example. As another example, I'm pretty sure Debian and its heavy reliance on Python for so much of its system scripting didn't happen until Python 2, either, but a quick search didn't turn up a reliable date.

It probably would have been a lot less risky with so many fewer daily users, so many fewer huge projects to migrate.

monoideism on Jan 14, 2020 | | | [–]

You may be right. I first used Python on a regular basis in 2002 (after release of Python 2), so I wasn't aware it had so little adoption prior to Python 2. But it definitely was picking up by 2002.

dkarl on Jan 13, 2020 | | [–]

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.

I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.

mixmastamyk on Jan 13, 2020 | | [–]

Yes, I was really surprised that they avoided upgrading to Python 2.7-level best practices and future statements for as long as they did and tried to hide it from most developers thru custom compatibility layers. Huh? That's step 0, getting except, stdlib imports, and print statements up to date. Folks can deal with that, that's the easy part.

Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.

indygreg2 on Jan 14, 2020 | | [–]

The late start was mostly due to having to retain Python 2.4/2.5 compatibility until May 2015 and it was literally impossible to use some future statements or some Python 3 syntax until 2.6 was required. I have updated the post to reflect this.

mixmastamyk on Jan 14, 2020 | | | [–]

IC, that’s unfortunate. Believe that is the time to cut a legacy branch/release rather than block progress for a decade.

CJefferson on Jan 13, 2020 | | [–]

Interesting you mention http headers. I had a program converted Python 2 -> Python 2 which was crashing occasionally, and it turned out it was because I was being sent a http request which wasn't valid unicode, so decoding failed.

I had to switch back to treating headers as bytes for as long as possible.

It is a stupid client which doesn't send valid ascii for http headers of course.

takeda on Jan 13, 2020 | | [–]

I believe the headers are encoded using ISO-8859-1 not Unicode. That encoding has 1:1 mapping with bytes so wouldn't break this way. Treating them as UTF-8 was the bug.

code_biologist on Jan 13, 2020 | | | [–]

This is exactly the sort of encoding issues that the python 2 to 3 transition has flushed out. People get frustrated with python 3, yet the actual failure was their mishandling of encoding issues -- papered over by python 2.

xorcist on Jan 13, 2020 | | | [–]

But that's not what frustrates people with the transition. It's that they suddenly get encoding issues where there should have been no encoding to begin with!

cbsmith on Jan 13, 2020 | | | [–]

No observed encoding issues.

CJefferson on Jan 13, 2020 | | | [–]

When I treated headers as bytes, there wasn't an "encoding".

What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

cbsmith on Jan 14, 2020 | | | [–]

> When I treated headers as bytes, there wasn't an "encoding".

If you are representing strings as bytes, you are intrinsically using an encoding.

> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.

But yes, this strategy largely avoids encoding issues... until it doesn't.

Dylan16807 on Jan 14, 2020 | | | [–]

> If you are representing strings as bytes, you are intrinsically using an encoding.

It's just binary data that might resemble a string. No encoding necessary.

fiedzia on Jan 14, 2020 | | | [–]

This is false more often than not. Many programs taking user input will treat it as string, assuming specific encoding or compatibility with screen output/some api, at least in some code paths. For example if you print an error message when you can't open some file, you are very likely to assume its encoded in a way terminal can handle, so its no longer "just binary data".

CJefferson on Jan 14, 2020 | | | [–]

Yes, I have to worry about how to make a "best effort" to show it to users, but in all internal code paths it must stay as "just binary data", else I lose information. This is exactly how chrome and Firefox handle headers internally.

cbsmith on Jan 15, 2020 | | | | [–]

It might resemble a particular encoding of a string... and the way you got that string to that particular sequence of bytes is by... encoding it.

Dylan16807 on Jan 15, 2020 | | | [–]

> and the way you got that string to that particular sequence of bytes

No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.

cbsmith on Jan 16, 2020 | | | [–]

In that context, you aren't using strings. You are using bytes. HTML without interpreting it as strings isn't really HTML, nor is it a string. It's just a blob that is passing through.

takeda on Jan 14, 2020 | | | | [–]

> When I treated headers as bytes, there wasn't an "encoding".

oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.

It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.

CJefferson on Jan 13, 2020 | | | | [–]

I'll admit, I'm not positive what the encoding should be. However there is a bunch of people who do clearly send UTF-8, and I can also promise you there are headers out there which just have binary nonsense in them. See for example https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

If you want to handle all headers, you have to be prepared to just get binary data.

takeda on Jan 14, 2020 | | | [–]

Yes, and using ISO-8859-1 is the way to handle them without issues. You will never get error when decoding it that way. If you are using UTF-8 there are character combinations that are invalid.

dnautics on Jan 13, 2020 | | | [–]

> It is a stupid client which doesn't send valid ascii for http headers of course.

...or a smart malicious actor.

utxaa on Jan 20, 2020 | | [–]

> But you don't need all b"" everywhere.

as a mercurial user i never understood this decision. for instance look at this recent commit: https://www.mercurial-scm.org/repo/hg/rev/b4c82b704180

would anyone disagree with the fact that an error message should be a string?

a source transformer to add b'' all over the place? really?

and i still don't understand why the hg transition had to be more complex than: https://docs.djangoproject.com/en/1.11/topics/python3/

... and of course now this: https://www.mercurial-scm.org/wiki/OxidationPlan

i wonder what does matt mackall think of all these developments?

skywhopper on Jan 14, 2020 | [–]

Why are you so certain about your assertions here about when they did and did not need to use explicit byte strings?