Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To be pedantic, the Unicode standard disrecommends the use of a BOM in UTF-8-encoded documents rather than declaring it invalid.


It makes zero sense to specify a byte order for an encoding in which it is irrelevant. It only persists because of a lazy vendor that can't encode Unicode correctly.


It would have been nice if every well-encoded Unicode document started with BOM and every legacy doc did not, instead of having to guess whether a doc is more likely UTF-8 or Latin-1.


Then concatenating to valid Unicode documents would no longer be valid Unicode. That is bad. And ASCII text would no longer be a valid UTF-8 encoded Unicode document. That is bad. And even when everything has finally switched to UTF-8 every tool ever will still need to handle the BOM. That is bad.

Guessing between valid UTF-8 and Latin-1 is only ever ambiguous when there are multiple non-ASCII characters in a row and all those sequences are made up of a lead byte with the correct number of trailing bytes. How often is that a problem for you in practice?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: