Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Part of the problem seems to be logistical. English natural language training sets are probably just way more common than other languages. About the only language that has more training data is Chinese, and they seem to be doing fine in the NLP department.

Aren't modern transformer networks able to deal with multiple languages? Seems like models powerful enough to do that should be fairly language-agnostic, assuming you can pull together enough training data for that language.



My very limited understanding is that some interstitial language data is used to map between languages, but they are often somewhat "polluted" by the idioms or idiosyncrasies of whichever language you first start with to build it up.

I seem to recall, for example, google translate having difficulty with the word "plane" between two non-english languages, accidentally confusing the word with two meanings from English. This was, iirc, because the internal representation of the words were originally generated by an English dataset, and so airplane in one non-english language became to level wood with a planer in the other non-english language (or something to that effect).

Mandarin / Chinese has a comparatively simpler grammar (no verb conjugation) and no phonetic alphabet with arbitrarily varying phonetic rules, it is also chock full of idioms and homophobes. There is a lot that can be done, but the 80 / 20 rule is in full effect.


I can definitely attest for the English bias of Google translate, even if you're translating between pairs of languages that don't include English. In particular, as you point out, it's often confused by English homonyms (even if they're not homonyms in either the source or target language).

For instance attempting to translate Portuguese "o báculo" (the staff, the object) into French gives you "le personnel" (the staff, people working together). It's a completely wrong translation that, I'm guessing, is caused by the algorithm not managing to differentiate these two different meanings because they're homonyms in English: https://translate.google.com/#view=home&op=translate&sl=pt&t...

An other thing Google translate is terrible at is handling the various level of addressing. Many languages have the notion of polite vs informal "you" (you/thou) which have mostly disappeared from modern English. Google very often translates a formal/polite form of address into an informal one, or vice versa: https://translate.google.com/#view=home&op=translate&sl=fr&t...

Here the polite French "how are you" is translated with the informal "ты" in Russian instead of "вы". Interestingly if I then translate in the other way around it correctly uses the "tu" form in French: https://translate.google.com/#view=home&op=translate&sl=ru&t...


This is simply because Google Translate relies on language-language models trained on language-language pairs. Where training data is not available in quantity, the quality of translations is sufficiently low that going through English is preferred.

This is something that has been solved recently[1] by training ‘massively multilingual’ models. However, such models come with fairly stark compute costs, especially given the number of users Google Translate has, so it will take a while for these advances to roll out. Which ultimately points out how silly this article is: NLP is all the same in relevant matters, and (to first approximation) what works for English works for everything else too, as long as you can get enough training data.

[1] https://ai.googleblog.com/2019/10/exploring-massively-multil...


> what works for English works for everything else too

…if they're similar to English. Which isn't good enough, there are many important languages that are quite different from English.


They're all similar to English. Languages have their syntactic ambiguities in different places, but they're all fundamentally similar. In every case the hard part is in understanding the semantics expressed, and the characters used to express it are a side-issue.


They're similar at the atomic level, so to speak, but not very similar at the levels relevant for translation.

The characters used to express languages are certainly a side-issue, but I don't understand why you would bring that into a discussion of syntactic structure.

For compositional semantics, understanding the syntactic structures involved is crucial. Otherwise you end up with gibberish (i.e. what I find so often with Google Translate).


My claim isn't that it's easy to solve NLP; of course a resource-constrained system like Google Translate that inderects many language pairs through English is going to have glaring issues.

My claim is that the challenges generalize between languages. If you can handle English, you can handle any other language, and while the easy parts might at least look different at the surface, the hard parts are all the same.


This is assuming the problems are similar between languages. But are they? An NLP system tested only on English might be useless on CJK languages, because they do not use spaces, so the system cannot rely on the almost-free segmentation you get from English. Another example is that if you try a heavily-inflected language, you suddenly have vastly more forms of the same word than in English, and your system needs to be robust to that in a way that's unnecessary for English.


An NLP model that can't even (implicitly) segment CJK languages is laughable. It's like saying ‘sure, Magnus Carlsen is really good at chess, but in draughts you can take multiple moves in a row.’ If you can handle the ambiguities in natural casual English, you can handle a little inflection.


Translating via English consistently creates nonsense even in similar languages (English and Swedish!). Though Google Translate can't even get English right, it thinks “cheque” (the monetary instrument) can be translated to the Swedish verb corresponding “check” (to look at) :D


Another problem affecting Chinese NLP is that text doesn't come pre-segmented with spaces between words and segmentation errors can produce total nonsense.

E.g. 我们很快会谈到这个的。[0] "We'll talk about this soon." Should be segmented like 我们(we)很(very)快(quick)会(will)谈(talk)到(get to)这个(this)的(emphasis)。Literally, "We will very quickly get to talk about this!" But 会谈 is also a noun meaning "negotiation", so the sentence could end up as "We are very quick negotiation gets to this!" which is still kind of comprehensible, but loses the future aspect, so it might as well be a comment about how quickly the negotiation has progressed.

This kind of error affects all downstream tasks. Try doing named-entity recognition when names keep getting partially glued to surrounding words. Very annoying.

[0] from https://tatoeba.org/eng/sentences/show/7781672


Another classical, interesting but evil example:

工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作

Literal translation of each character one by one:

工(industry)信(information)处(place)女(female)干(do)事(thing)每(every)月(month)经(through)过(pass)下(below)属(obey)科(division)室(room)都(always)要(would)亲(oneself)口(mouth)交(deliver/intersect)代(replace)24(24)口(mouth)交(deliver)换(exchange)机(machine)等(etc)技(ability)术(technique)性(nature/gender)器(device/organ)件(piece)的(of)安(install)装(costume)工(work)作(job)

Correct segmentation:

工信处(ministry of industry and information)女(female)干事(secretary)每月(every month)经过(pass by)下属(subordinate)科室(department)都要(would always)亲口(from her own lips)交代(arrange)24口交换机(24-port switch)等(etc)技术性(technological)器件(device)的(of)安装(installation)工作(work)

Every month when the female secretary of the ministry of industry and information passes by the subordinate department, she would always arrange in person the installation work of the 24-port switch and other technological devices.

Vulgar segmentation:

工信 处女(indoor female=virgin) 干事每 月经(month through=menstruation) 过下属科室都要亲 口交(mouth intersect=oral sex) 代24 口交(oral sex) 换机等技术 性器(sex organ=genital) 件的安装工作

Obviously google translation passed the test!


>> idioms and homophobes

Did you mean idiots?



I believe the GP was joking about great-GP's typoing of "homophones" as "homophobes" :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: