Natural language isn’t just English [pdf]

briga · on Dec 25, 2019

Part of the problem seems to be logistical. English natural language training sets are probably just way more common than other languages. About the only language that has more training data is Chinese, and they seem to be doing fine in the NLP department.

Aren't modern transformer networks able to deal with multiple languages? Seems like models powerful enough to do that should be fairly language-agnostic, assuming you can pull together enough training data for that language.

zdragnar · on Dec 25, 2019

My very limited understanding is that some interstitial language data is used to map between languages, but they are often somewhat "polluted" by the idioms or idiosyncrasies of whichever language you first start with to build it up.

I seem to recall, for example, google translate having difficulty with the word "plane" between two non-english languages, accidentally confusing the word with two meanings from English. This was, iirc, because the internal representation of the words were originally generated by an English dataset, and so airplane in one non-english language became to level wood with a planer in the other non-english language (or something to that effect).

Mandarin / Chinese has a comparatively simpler grammar (no verb conjugation) and no phonetic alphabet with arbitrarily varying phonetic rules, it is also chock full of idioms and homophobes. There is a lot that can be done, but the 80 / 20 rule is in full effect.

simias · on Dec 25, 2019

I can definitely attest for the English bias of Google translate, even if you're translating between pairs of languages that don't include English. In particular, as you point out, it's often confused by English homonyms (even if they're not homonyms in either the source or target language).

For instance attempting to translate Portuguese "o báculo" (the staff, the object) into French gives you "le personnel" (the staff, people working together). It's a completely wrong translation that, I'm guessing, is caused by the algorithm not managing to differentiate these two different meanings because they're homonyms in English: https://translate.google.com/#view=home&op=translate&sl=pt&t...

An other thing Google translate is terrible at is handling the various level of addressing. Many languages have the notion of polite vs informal "you" (you/thou) which have mostly disappeared from modern English. Google very often translates a formal/polite form of address into an informal one, or vice versa: https://translate.google.com/#view=home&op=translate&sl=fr&t...

Here the polite French "how are you" is translated with the informal "ты" in Russian instead of "вы". Interestingly if I then translate in the other way around it correctly uses the "tu" form in French: https://translate.google.com/#view=home&op=translate&sl=ru&t...

Veedrac · on Dec 25, 2019

This is simply because Google Translate relies on language-language models trained on language-language pairs. Where training data is not available in quantity, the quality of translations is sufficiently low that going through English is preferred.

This is something that has been solved recently[1] by training ‘massively multilingual’ models. However, such models come with fairly stark compute costs, especially given the number of users Google Translate has, so it will take a while for these advances to roll out. Which ultimately points out how silly this article is: NLP is all the same in relevant matters, and (to first approximation) what works for English works for everything else too, as long as you can get enough training data.

[1] https://ai.googleblog.com/2019/10/exploring-massively-multil...

TazeTSchnitzel · on Dec 25, 2019

> what works for English works for everything else too

…if they're similar to English. Which isn't good enough, there are many important languages that are quite different from English.

Veedrac · on Dec 25, 2019

They're all similar to English. Languages have their syntactic ambiguities in different places, but they're all fundamentally similar. In every case the hard part is in understanding the semantics expressed, and the characters used to express it are a side-issue.

_emacsomancer_ · on Dec 25, 2019

They're similar at the atomic level, so to speak, but not very similar at the levels relevant for translation.

The characters used to express languages are certainly a side-issue, but I don't understand why you would bring that into a discussion of syntactic structure.

For compositional semantics, understanding the syntactic structures involved is crucial. Otherwise you end up with gibberish (i.e. what I find so often with Google Translate).

Veedrac · on Dec 25, 2019

My claim isn't that it's easy to solve NLP; of course a resource-constrained system like Google Translate that inderects many language pairs through English is going to have glaring issues.

My claim is that the challenges generalize between languages. If you can handle English, you can handle any other language, and while the easy parts might at least look different at the surface, the hard parts are all the same.

TazeTSchnitzel · on Dec 25, 2019

This is assuming the problems are similar between languages. But are they? An NLP system tested only on English might be useless on CJK languages, because they do not use spaces, so the system cannot rely on the almost-free segmentation you get from English. Another example is that if you try a heavily-inflected language, you suddenly have vastly more forms of the same word than in English, and your system needs to be robust to that in a way that's unnecessary for English.

Veedrac · on Dec 25, 2019

An NLP model that can't even (implicitly) segment CJK languages is laughable. It's like saying ‘sure, Magnus Carlsen is really good at chess, but in draughts you can take multiple moves in a row.’ If you can handle the ambiguities in natural casual English, you can handle a little inflection.

TazeTSchnitzel · on Dec 25, 2019

Translating via English consistently creates nonsense even in similar languages (English and Swedish!). Though Google Translate can't even get English right, it thinks “cheque” (the monetary instrument) can be translated to the Swedish verb corresponding “check” (to look at) :D

yorwba · on Dec 25, 2019

Another problem affecting Chinese NLP is that text doesn't come pre-segmented with spaces between words and segmentation errors can produce total nonsense.

E.g. 我们很快会谈到这个的。[0] "We'll talk about this soon." Should be segmented like 我们(we)很(very)快(quick)会(will)谈(talk)到(get to)这个(this)的(emphasis)。Literally, "We will very quickly get to talk about this!" But 会谈 is also a noun meaning "negotiation", so the sentence could end up as "We are very quick negotiation gets to this!" which is still kind of comprehensible, but loses the future aspect, so it might as well be a comment about how quickly the negotiation has progressed.

This kind of error affects all downstream tasks. Try doing named-entity recognition when names keep getting partially glued to surrounding words. Very annoying.

[0] from https://tatoeba.org/eng/sentences/show/7781672

duguxu · on Dec 25, 2019

Another classical, interesting but evil example:

工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作

Literal translation of each character one by one:

工(industry)信(information)处(place)女(female)干(do)事(thing)每(every)月(month)经(through)过(pass)下(below)属(obey)科(division)室(room)都(always)要(would)亲(oneself)口(mouth)交(deliver/intersect)代(replace)24(24)口(mouth)交(deliver)换(exchange)机(machine)等(etc)技(ability)术(technique)性(nature/gender)器(device/organ)件(piece)的(of)安(install)装(costume)工(work)作(job)

Correct segmentation:

工信处(ministry of industry and information)女(female)干事(secretary)每月(every month)经过(pass by)下属(subordinate)科室(department)都要(would always)亲口(from her own lips)交代(arrange)24口交换机(24-port switch)等(etc)技术性(technological)器件(device)的(of)安装(installation)工作(work)

Every month when the female secretary of the ministry of industry and information passes by the subordinate department, she would always arrange in person the installation work of the 24-port switch and other technological devices.

Vulgar segmentation:

工信处女(indoor female=virgin) 干事每月经(month through=menstruation) 过下属科室都要亲口交(mouth intersect=oral sex) 代24 口交(oral sex) 换机等技术性器(sex organ=genital) 件的安装工作

Obviously google translation passed the test!

misterman0 · on Dec 25, 2019

>> idioms and homophobes

Did you mean idiots?

the_duke · on Dec 25, 2019

https://en.wikipedia.org/wiki/Homophone

Sharlin · on Dec 25, 2019

I believe the GP was joking about great-GP's typoing of "homophones" as "homophobes" :)

pjmlp · on Dec 25, 2019

Indeed, one of the reasons why I usually don't bother with natural language processing tools, as I want to use my own native language, instead of English or Brazilian Portuguese (if I am lucky enough for it to be supported).

zzo38computer · on Dec 25, 2019

It is correct, natural language isn't just English there is also many other kind too. But, the document is itself written in English, so we can guess that it is English. Still it doesn't help when you want to distinguish the natural language stuff dealing specifically English or in general (and if in general, you must be prepared to deal with stuff that is different from English, because they have their own features which are different from English).

elfexec · on Dec 25, 2019

Pages 4 and 5 ( "Languages of the world" ) of the pdf are duplicates.

I agree that NL isn't just English, but the world of computer science, information processing, AI and even linguistics is heavily english-centric for a wide variety of reasons. And unless something drastic happens ( the end of pax americana and the fracturing of the world order ), everything will get even more english-centric. The internet is spreading all over the world. American culture is spreading all over the world. And so is american/english.

I do wish CS and programming would be less english-centric. It doesn't make sense to me that people in china, russia, europe, india, etc are writing code in english. Why not translate computer languages to their native language? Given the nature of programming languages, it's so simple to do. Wouldn't it be simpler to translate C or python or java to chinese and have chinese students/programmer code in the chinese language ( mandarin )? Rather than having chinese students learn english and then learn to code in english?

tastroder · on Dec 25, 2019

I don't believe that's an inherently bad thing for the development world. As a non native speaker I can recall quite a few situations where my native language has annoyed me in the context of coding to no end. Just to randomly pick two:

Microsoft Excel translated (still might, no idea) their function names and internationalised stuff like floating point numbers. That might be beneficial for some parts of the user base, for me it meant broken spreadsheets, broken data. That might still be a thing today and working fine but if it is, it's just another days point to show that it's an investment and not some easy to add mechanism.

The second one is mostly programming language independent. There's been quite a few times where I tried to onboard developers into legacy code bases developed fully by German language teams (read: German comments, tickets, documentation) and it's a pain. It served no purpose other than catering to the lacking English skills of some team members while essentially blocking the possibility to bring in talented outside help.

I like that English has become the common language of our industry, the alternative just compartmentalizes information without much apparent benefit.

stoicShell · on Dec 25, 2019

Oh my, don't get me started about Excel localization.

It played a big part in my decision 20+ years ago to switch all my computers (any electronics actually) to US/English locale. Years later when I began programming seriously, I switched to a US-ANSI QWERTY layout — never looked back.

The premise is simple and just works:

- remove all friction (cognitive bloat) with foreign knowledge (blogs, docs, books, vids, courses, etc) because I don't have the additional step of translating, or rather guessing which snowflake word some engineer of my country thought was a good idea.

- avoid waiting for translations, which may never come (been there, done that, burned by JP/US video games in the 1990's never making it to the EU, thank you very much... not). Currently this usually means a lag of anything between 3 days and 3 months to get features from US app stores to EU ones. Some features never make it past the English world (Amazon is very bad in this regard, the whole Alexa offer is much lesser outside the US in my experience).

- I'll go one step further thus, and submit that even documentation, while written in local language usually, should at least retain English terms for fixed, known things — do not translate already formalized concepts like "reverse proxy", acronyms like DNS, cultural concepts like Agile or devops. It just makes it much, much harder for everyone to know what we're talking about, and to relate to the English world (makes your country, locale, a silo, which is bad-bad). Ideally, you'd write documentation fully in English, and optionally have a localized translation when it can't be avoided (regulation, compliance comes to mind).

I love that there is a common language in our industry; which one is a distant concern (although I love English more than my mother tongue, so there's that for me).

yorwba · on Dec 25, 2019

> It doesn't make sense to me that people in china, russia, europe, india, etc are writing code in english. Why not translate computer languages to their native language?

It doesn't make sense to them to cut themselves off from the English-based open-source ecosystem just so they can use their native language when they already know enough English to get by.

Many programming languages already support Unicode identifiers and nativist movements exist (e.g. [0]), some people produce interesting mixed-language programs (e.g. stdlib in English, other identifiers in Chinese with Latin prefixes to indicate type: [1]) but it's overall a very small niche, because most people can see the value of being able to share code globally, which means using English.

[0] https://github.com/program-in-chinese/overview/

[1] https://github.com/cflw/cflw_py/blob/master/%E7%A8%8B%E5%BA%...

elfexec · on Dec 25, 2019

> It doesn't make sense to them to cut themselves off from the English-based open-source ecosystem just so they can use their native language when they already know enough English to get by.

Just because they develope their own chinese-based open-source ecosystem doesn't mean they have to cut themselves off from the english-based one. But by using the english-based one, they are cutting themselves off of their potential native chinese-based one.

My point is that it would be easier for chinese people to develope in their own language. I'd consider it silly if I had to learn chinese or greek just to program "hello world".

> because most people can see the value of being able to share code globally, which means using English.

You could share code globally even if the chinese code in mandarin. You can translate code from chinese to english and vice versa. It would be even simpler in assembly.

Also, there would be more code overall if programming didn't have an added hurdle of having to learn english in the first place. If people around the world could code in their native language, we'd have more programmers and more code.

Using your argument, the chinese should write their novels in english too. But everyone would consider that absurd. That's the same for programming. It's an absurd system that exists, but hopefully things will change in the future.

yorwba · on Dec 25, 2019

> Just because they develope their own chinese-based open-source ecosystem doesn't mean they have to cut themselves off from the english-based one. But by using the english-based one, they are cutting themselves off of their potential native chinese-based one.

So does using one language cut one off from the ecosystem of another or does it not? In practice, this depends on the prevalence of bilingualism: few English-speaking developers know Chinese, but most Chinese-speaking developers know English. So using English doesn't cut them off (because they understand it) while using Chinese cuts everyone else off.

> You can translate code from chinese to english and vice versa.

And who do you think is going to do that translation? Someone who understands the program in question and knows both human languages well enough to translate between them. In most cases, that will be the developers. They're saving themselves a lot of hassle by just writing it in English to begin with.

> Also, there would be more code overall if programming didn't have an added hurdle of having to learn english in the first place. If people around the world could code in their native language, we'd have more programmers and more code.

People around the world can code in their native language. I linked to projects that are all about programming in Chinese. That there isn't more Chinese code should tell you something about the priorities that Chinese programmers actually have.

thrower123 · on Dec 25, 2019

The problem with Chinese, in particular, is that despite unicode and additional software for input, the language is terribly unergonomic to use.

stoicShell · on Dec 25, 2019

It's no wonder why the latin alphabet got chosen, the same reason why it became popular in the first place, thousands of years ago. It's just efficient to use 26-ish letters, like numeral systems are easier to manipulate with 8~12 base digits.

zzzcpan · on Dec 25, 2019

> It doesn't make sense to me that people in china, russia, europe, india, etc are writing code in english. Why not translate computer languages to their native language?

It doesn't make sense to you, because people are not writing code in English, they are writing code using a set of characters, only some of those characters and some keywords overlap with the English language, namely latin alphabet and those keywords have different distinct meanings, than what they mean in English. Furthermore, most of the characters, including the latin alphabet, already have a significant part in every education system in the world. People are taught latin alphabet, some keywords and most of the operators you see in programming languages at school in math and physics classes. So, how in the world would it benefit anyone to have programming languages with non-latin alphabets? It can only make things a) harder to learn, because there is no utilizing familiarity with other fields people already studied and b) harder to share code and knowledge, because other people in the world can't read and input your non-latin characters.

By the way, this is the same stupid idea that got people to allow unicode variable names in programming languages.

TazeTSchnitzel · on Dec 25, 2019

Programming languages are in English though. The keywords are pulled from English, the names of functions are clearly made up of English words, the documentation is in English first and often only English, the untranslated error messages are in English, the filenames are in English. Already speaking English is a massive advantage in understanding and working with all of that.

zzzcpan · on Dec 25, 2019

Keywords, names, filenames don't retain their meanings from English, so it's not important where they came from. And error messages use the same few English words, but mostly all those keywords, names, filenames. Documentation available in local languages is the only important part here. But documentation and other reading materials is an issue for people who are already many years in programming and the deeper they go into the field, the more necessary it becomes to learn English.

stoicShell · on Dec 25, 2019

> Already speaking English is a massive advantage in understanding and working with all of that.

I think we're past that, it's become a de facto prerequisite now. You just can't become a programmer without at least written comprehension. You'll learn English anyway as you learn programming; but you'd do yourself a favor to double down — my suggestion: immerse yourself as much as possible, switch all your systems to English, watch only English TV, video, music, books, articles, etc. After 3-6 months, most people reported to me a "huge" leap forward, and within 2 years they can watch most TV un-subtitled (live shows are harder because less scripted, less grammatically formal I suppose, more idiomatic).

In terms of programming and tech in general, a keen understanding of English helps understanding much of the subtext, the intent, the "between the lines" meaning of techs, concepts, namespaces.

The same is true in some martial arts: in some countries, they have you learn only the original (say, Japanese) name of a technique. The problem is that it's hard to remember, it means nothing, it's just a burden; whereas it literally usually means simply what it does: "wrist torsion", "quick stun hit", etc. It may seem nothing but taking the time to translate all of it (at least), understand the language (at best), tells you a lot that you never hear because, well, it's obvious to those who speak it, and oblivious to those who don't. Cultural barriers are most confounding when they are invisible, unbeknownst to us.

TL;DR; English is almost necessary to be autonomous in programming and tech in general, even if you can learn without you just won't be able to navigate such a fast moving landscape and make deep sense of it.

elfexec · on Dec 26, 2019

> It doesn't make sense to you, because people are not writing code in English

What language are they writing in? What language is C, Python, C++, Java, C#, etc written in? Chinese? Greek? French?

> they are writing code using a set of characters, only some of those characters and some keywords overlap with the English language, namely latin alphabet and those keywords have different distinct meanings, than what they mean in English.

Care to name a keyword which has a different meaning in English? Also, every programming language is written in the american/english alphabet, not the latin alphabet. The old latin alphabet don't have the same letters as the english alphabet.

> People are taught latin alphabet, some keywords and most of the operators you see in programming languages at school in math and physics classes.

That's my point. It's simple. Why not just translate the programming language into their own native language. That way, people who don't know the english alphabet or the language can write code?

> So, how in the world would it benefit anyone to have programming languages with non-latin alphabets?

We already write in non-latin alphabet - the american/english alphabet. And you are missing the point. I'm saying that programming languages should be translated into french, swedish, russian, chinese, etc. Native languages, not necessarily just the script. As you noted, a programming language consists of a few keywords and operators. It's wouldn't be that difficult to translate programming language and code into people's native language.

> It can only make things a) harder to learn, because there is no utilizing familiarity with other fields people already studied

It's simply not true. Imagine if every programming language was in chinese or russian or french. Would it make it easier or harder for me as an native english speaker? It would make it more difficult for sure.

> b) harder to share code and knowledge, because other people in the world can't read and input your non-latin characters.

You can easily translate code. After all, you noted that it is just a few keywords, operators, etc.

Using you logic, no other language should be used other than english. All literature should be in english. All music should be in english. Is that what you are advocating? Imagine what we'd lose in knowledge and culture? Imagine what we are losing because billions of people are not able to program because one language dominates programming.

> By the way, this is the same stupid idea that got people to allow unicode variable names in programming languages.

This should tell you that people should program in their own native languages...

ben_w · on Dec 25, 2019

What do you mean by “translate computer languages”?

The reserved keywords, like ‘class’ or ‘if’? That might be easy, though I won’t be surprised if some natural languages will have multiple forms of those keywords depending on numbering or word-gender.

The function and class names in the standard libraries, like UITableViewController or String.toUpperCase()? That’s harder, and would have to be manual to avoid the “water sheep” problem (hydraulic rams). Also: does Chinese have the concept of upper/lower case?

Or do you mean the entire documentation, which is sometimes lacking useful details even when it’s only written in English? That’s not going to be possible automatically, and when it is, software developers will be obsolete because business speak could be translated into code without us.

int_19h · on Dec 25, 2019

Translating programming languages was tried. Indeed, when ALGOL was originally designed, that was one of the reasons why its keywords were standalone "symbols", completely distinct from the letters they consist from. The idea was that a particular representation language could choose different representations for those "symbols" - and one obvious option was to use the translated words.

Here's an example of an Algol-60 program in Russian representation: https://i.imgur.com/9edem3y.png

This was also done for Algol-68, and some Pascal dialects. But the practice has mostly died out once newsgroups and Internet became widely accessible. There's honestly very little upside to translating keywords and identifiers, because learning words is the easiest part of learning any language (grammar is much more complicated). And you don't actually have to learn the original meanings of words, either - you just need to know what they mean in programming context. In fact, not knowing the original meaning can be beneficial sometimes, because it can prevent poor understanding due to false equivalence or poor metaphors.

But there's considerable downside in not being able to understand or reuse snippets from another language...

TL;DR: we tried - it's not worth it. English FTW.

rockinghigh · on Dec 25, 2019

> It doesn't make sense to me that people in china, russia, europe, india, etc are writing code in english.

English is the native language for 15% of European. Most European know more English that you need to learn these programming languages. C has only 32 keywords. You can name variables, structs, or functions and write comments in your native language.

the_decider · on Dec 25, 2019

This blog-post discusses the dangers of blindly applying English-optimized NLP models to other languages: https://primer.ai/blog/Russian-Natural-Language-Processing/

Veedrac · on Dec 25, 2019

If anything this post disagrees with the claim that ‘English isn't generic for language’. It considers a naïve approach that fails for English, and shows that it fails for Russian in a similar way, and then compares it with an improvement also developed for English, which too helps Russian in a similar way.

It is true that when Facebook trained their model “blindly”, this introduced meaningful bugs, so I agree with that claim from the post. It does not, however, give credence to the idea that techniques developed for English don't reliably work for other languages.

Unfortunately the specific comparisons they make are suspect given how different the training sets were.

avmich · on Dec 25, 2019

It could be extremely useful to concentrate efforts on usage and details of one specific language rather than spread them over all different communication forms.

The opposite could also be true.

natalyarostova · on Dec 25, 2019

Methods that work on English are an order of magnitude more important than others. There is a fair methodological point towards saying English=!natural language. But the economics of it simply make it the most important one to get right.

rahulnair23 · on Dec 25, 2019

Fully agree on the economic merits of English here. I think the point is more than methodological though. It is one of equity and access.

Broad swaths of society don't have the same access to the technology. My mother tongue Malayalam in the 1960's redefined its script to be better suited for typewriters. It is still non-trivial to typeset correctly. Like my Chinese colleagues, there are all sorts of constraints on how text is input.

Machine translation beyond the "main" languages aren't too great. To see bias in action just try English->Turkish, a gender neutral language. "o multu. o mutsuz." translates to "he's happy. she is unhappy."

Things built for English don't readily transfer. Its worth raising awareness around this as Emily does.

natalyarostova · on Dec 25, 2019

I think people are aware, it’s just that English is clearly the most important. That’s just reality.

unishark · on Dec 25, 2019

A NLP researcher is supposed to describe how language is used, not insist on how it should be used.

hibbelig · on Dec 25, 2019

Emily's point is that NLP researchers tend to describe only how English is used, not how other languages are used.

unishark · on Dec 26, 2019

That's like criticizing a biologist for only studying nematodes. Research is specialized by its nature. She does say a bit about directing research to other languages. But mostly she complains about researchers not identifying their research as English, including threats to heckle speakers who don't identify it as such, i.e. who don't use the spoken language in the way she prefers.

mamon · on Dec 25, 2019

I would jokingly say that English is halfway between natural languages and programming languages. I mean: English grammar is so simple and all sentences are so well structured that processing English is order of magnitude easier than, say, German.

gbear605 · on Dec 25, 2019

English is in fact not particularly simple, especially not if your only comparison is German. Both English and German can be modeled at approximately the same level of difficulty, with only a few minor differences. There is case assignment, but that’s simple compared to everything else (and English has deal with case too, if not to the same extent). The field of linguistics has been trying to fully model English syntax for the last 70 years and is still turning up new corner cases. And a lot of languages are in fact a lot simpler. To quote one of my bilingual friends after waking up from general anesthesia:

> [I] couldn't put English in order so I just switched to Russian since I knew what syntactic roles I wanted, just not how to order words

Latin, a supposedly more complex language, is another example of a language with much less strict syntax than English. And Google Translate is horrible at Latin.

Mandarin, on the other hand, has a more complex grammar than English, but that doesn’t mean that it is more or less easy for NLP. After all, Google Translate is horrible at Mandarin.

ben_w · on Dec 25, 2019

Yeah right.

Examples of well-recognised sarcastic phrases that would mess with simple sentiment analysis aside, I think the fact that it’s easy to think that English is simple and well-formed has probably held back serious development of chatbots etc. — the hard work of learning the real rules can be substituted with a simple grammatical approximation that gets a lot wrong and can’t be easily improved.

Or, to put it another way: “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” is treated as a string of nouns, not [proper noun] [noun] [proper noun] [noun] [verb] [verb] [proper noun] [noun].