Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, coming from C#, Python 2's unicode support was so bad I almost abandoned it immediately as a Chinese speaker (and to make it worse, I use Windows). You literally can't use IDLE for learning/testing properly half of time due to encoding issues.

And what surprised me most is that every time I mentioned this, there would be lots people telling me how this is a superior design because you can operate string like bytes. I mean, it of course has its upside, but I don't think it's worth it if you care even slightly beyond ASCII.



We had crazy amounts of code handling unicode support and conversion from our ecommerce site to our ERP system (running on Windows using some Windows code page thing). With Python 3 all that went away, you can now just seamlessly parse text from one system to another.

For me, the unicode handling alone was worth the time spend migrating from Python 2. That was a decade ago, to finding that the "python" command still launches a Python 2.7 interpreter in 2023 is just beyond belief. Personally I feel like they should have yanked Python 2 in Jessie (Debian 8) in 2015, more realistically in Stretch in 2017.


The biggest danger in Python 2's unicode handling is that incorrect things somewhat worked (until you got a non-ascii character at which point it exploded or produced incorrect behaviour).

I'm sure you could do things well in Python 2 with proper combinations of encode/decode, but it wasn't obvious where you even needed those because with ascii text, things "just worked" transparently. With Python 3 it's very obvious where you need encoding/decoding because bytes != str.


You could do things correctly in Python 2, but as soon as you used any third-party library in your project, chances were it is going to explode underneath you anyway.

In the early 2000s, I maintained an py2, wxPython app with the users having the system encoding win-1250; the effort to patch this was unbelievable. The migration to python3-style handling forced everyone to think about these issues, not just few people for which things were crashing. Even just popularizing the issue was great, until then, many maintainers of third-party libraries didn't even understand what is the problem that you want to "needlessly complicated" fix in their libs.


> That was a decade ago, to finding that the "python" command still launches a Python 2.7 interpreter in 2023 is just beyond belief.

The problem is not the end-user invoking the command.

The problem is scrips expecting `#!/usr/bin/env python` to invoke python-2.


> finding that the "python" command still launches a Python 2.7

apt install python-is-python3


I don't think my co-workers would like that :-)


dpkg-reconfigure co-workers?


apt purge co-workers


I've broken a bunch of stuff when I tried to replace python 2 with 3.


> Personally I feel like they should have yanked Python 2 in Jessie (Debian 8) in 2015, more realistically in Stretch in 2017.

For example GnuRadio started supporting Python 3 with GnuRadio 3.8 released in 2019, and then you had to port all your programs using it to this version. So no, in 2017, the ecosystem was not ready.


FWIW, in Linux, this problem does not exist. Everything is UTF-8 and Python 2 would work just fine (and always did).

In order to support Windows better, Python 3 introduced support for UCS-4 (or worse, UTF-16) strings (depending on a compilation setting when Python was compiled) and they had to introduce extra string types to distinguish readable strings from binary strings ("bytes").

These extra types made Python 3 a lot harder to teach (I teach 30 person classes every year).

So it's not all roses now.

In the end, I got used to it, BUT I just gave up asking encode()/decode() questions at the exams. Very few people understand it, or care enough (and I understand why--it's a ridiculous thing to have). You only need it if your OS somehow slept through the introduction of UTF-8, which is backward compatible with ASCII, resilient even if there are transfer errors and can encode all unicode characters.

Encoding problem used to be really common in UNIX (and before that, in mainframes), but with the introduction of UTF-8, all encoding problems I had vanished and never appeared again.

Even Windows 10 has an UTF-8 mode now and the Windows API functions that end in "A" can be made to use UTF-8.

Now, in a sense, Python 3 has this entire complication for no reason.

That said, Python 3 is ok to use now--and, conceptually, distinguishing byte strings from unicode strings is better (for example so that you don't accidentially print the former to the terminal). It just uses up brain cycles that you could be using for solving your actual problems.


> I just gave up asking encode()/decode() questions at the exams. Very few people understand it, or care enough (and I understand why--it's a ridiculous thing to have).

I get it from the "pass the exam" perspective, since that's one more thing to worry about.

But from my experience in teaching others, doing the conversion between bytes and string implicitly (à la Python 2's way) hinders actual understanding of this very important concept, and it's quite harmful in further study.

Bytes should be considered as a separate, more low-evel thing, away from int/float/strings; at the very least, it should be considered as bits/hex numbers. If you want strings, you explicitly encode/decode them in a way, even if everything is UTF-8.

On top of that, "byte string" is just a confusing concept. It might works for English speaker (by "it's a ridiculous thing to have" I assume you mean that, "'english'.encode() is just b'english', why bother?"), not at all for Chinese speakers, even in UTF-8. There is no b'中文' -- only b'\xe4\xb8\xad\xe6\x96\x87' which has zero meanings in their own.

And even from an easy-to-use perspective: most people don't even work on bytes often nowadays. A more abstract "string" type is all they need, without worrying about how it works under the hood (and if they do, they need to understand how encode/decode works properly anyway).


>doing the conversion between bytes and string implicitly

There was no conversion. `bytes` and `str` were the same type.

http://docs.python.org/whatsnew/2.6.html#pep-3112-byte-liter... says:

> Python 2.6 adds bytes as a synonym for the str type, and it also supports the b'' notation.

I just checked in Python 2.7:

    >>> bytes is str
    True
    >>> print("Hänsel")
    Hänsel
    >>> "Hänsel"
    'H\xc3\xa4nsel'
I'm working with Germans, Japanese and Polish that use a lot of special characters, including Kanji, umlauts, extra quote characters etc. I need the non-ASCII parts and had no problem with them in Python 2 on Linux (now, C++ libraries that reinvented their own string classes: many problems; C libraries: no problems).

The point is when bytes is str, everything works just fine in Python 2 Linux with UTF-8 locale (which are used in all modern Linux distributions). No need to have a distinction between bytes and str.

That how the rest of the OS works, too. Even a lot of Gtk, Glib and so on (for example the GNOME desktop environment) assume that you are in an UTF-8 locale for file names, for example.

> A more abstract "string" type is all they need, without worrying about how it works under the hood (and if they do, they need to understand how encode/decode works properly anyway).

Ehh, we had students write drivers for measurement apparatuses and they all used Python 2 str (without being prompted to do so). No encode or decode anywhere. Of the students, almost no one who tried Python 3 for that stayed with it (instead they were using Python 2). There was just no upside for this use case.

I agree that, long term, having a distinction str vs bytes makes sense. But then you ARE juggling things that the OS doesn't need--it's basically busywork in Linux.

I'm not trying to minimize your experience--but I don't think it would happen if you tried python2 on Linux today. Not sure it was worth it breaking compat for that.


> FWIW, in Linux, this problem does not exist. Everything is UTF-8 and Python 2 would work just fine (and always did).

That's not true at all. I remember all kinds of encoding errors when dealing with the FS, the network or any user input when using Linux.

Unless you're talking specifically about IDLE?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: