Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a really cool pipeline for shepherding Toolbox-cataloged utterances and definitions into Delphin-project symbolic grammars to basically auto generate morphological models & do TDD on symbolic grammars. Took a class from Emily Bender where we built a grammar of Chintang with this system, and were able to do simple machine translation via an intermediary language to other grammars built in the class for e.g. Nuuchahnulth and other small languages. Its a really cool system for machine understanding of languages that never would have enough tagged text for a machine learning system.


When you say "basically auto generate morphological models" how automatic is the process exactly? I've been asked to help write a stemmer for Berber (which I do not speak) for use on https://tatoeba.org and I'm wondering whether this Toolbox and/or Delphin ( http://www.delph-in.net right?) would be useful for that.


For morphology (esp. languages with complicated morphologies, like Berber), the tool that most computational linguists reach for is finite state transducers, particularly those that are built for use in morphology and phonology. An early one of these was the Xerox xfst/ lexc program, which has since been re-implemented in open source form as Foma (https://fomafst.github.io/). The book on xfst/lexc, https://www.press.uchicago.edu/ucp/books/book/distributed/F/..., is probably still the best place to go for a tutorial. Other FST programs that have been used for morphology and phonology include the Stuttgart FST (sfst, https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkze...) and the Helsinki HFST (http://hfst.github.io/). HFST allows the use of weights, which can be useful for spell correction.

I've built built morph parsers with all of these except HFST, although that's next on my list.

I'm not familiar with Delphin, but a quick glance at their website implies that it's for syntax, not so much for morphology. They mention a Japanese grammar implemented in Delphin, but it uses a separate tool for morphology.

In answer to your other question, the last time I looked, machine learning of morph parsers (or stemmers, which are like morph parsers that throw away the affixal information) is reasonably good for "fusional" morphologies, which most modern IndoEuropean languages have. I don't think the state-of-the-art ML would work well for Berber, because of its much more complex morphology.


The Delph-in system has a morphological model based on position classes. Morphology items are treated similarly to syntax items, in that they can hold constraints on & apply features to what they attach.


Interesting, I hadn't seen that. Does it handle phonologically-conditioned allomorphy or inflection classes? Stem allomorphy conditioned by phonology or position in the paradigm (like the stem allomorphs of the Spanish verbs 'tener', 'crecer' etc.)?


I think the auto-detivation of morphological rules is still an active research project and not public yet, unfortunately. But it's been a couple years so I may not be up to date.


Thanks for the clarification. Am I right in assuming that if the research has been published, it would be found somewhere on Emily Bender's page? http://faculty.washington.edu/ebender/


Yeah I actually found a link to the overview page: http://depts.washington.edu/uwcl/aggregation/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: