There's a really cool pipeline for shepherding Toolbox-cataloged utterances and ...

yorwba · on Jan 18, 2020

When you say "basically auto generate morphological models" how automatic is the process exactly? I've been asked to help write a stemmer for Berber (which I do not speak) for use on https://tatoeba.org and I'm wondering whether this Toolbox and/or Delphin ( http://www.delph-in.net right?) would be useful for that.

mcswell · on Jan 19, 2020

For morphology (esp. languages with complicated morphologies, like Berber), the tool that most computational linguists reach for is finite state transducers, particularly those that are built for use in morphology and phonology. An early one of these was the Xerox xfst/ lexc program, which has since been re-implemented in open source form as Foma (https://fomafst.github.io/). The book on xfst/lexc, https://www.press.uchicago.edu/ucp/books/book/distributed/F/..., is probably still the best place to go for a tutorial. Other FST programs that have been used for morphology and phonology include the Stuttgart FST (sfst, https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkze...) and the Helsinki HFST (http://hfst.github.io/). HFST allows the use of weights, which can be useful for spell correction.

I've built built morph parsers with all of these except HFST, although that's next on my list.

I'm not familiar with Delphin, but a quick glance at their website implies that it's for syntax, not so much for morphology. They mention a Japanese grammar implemented in Delphin, but it uses a separate tool for morphology.

In answer to your other question, the last time I looked, machine learning of morph parsers (or stemmers, which are like morph parsers that throw away the affixal information) is reasonably good for "fusional" morphologies, which most modern IndoEuropean languages have. I don't think the state-of-the-art ML would work well for Berber, because of its much more complex morphology.

Rotten194 · on Jan 19, 2020

The Delph-in system has a morphological model based on position classes. Morphology items are treated similarly to syntax items, in that they can hold constraints on & apply features to what they attach.

mcswell · on Jan 19, 2020

Interesting, I hadn't seen that. Does it handle phonologically-conditioned allomorphy or inflection classes? Stem allomorphy conditioned by phonology or position in the paradigm (like the stem allomorphs of the Spanish verbs 'tener', 'crecer' etc.)?

Rotten194 · on Jan 18, 2020

I think the auto-detivation of morphological rules is still an active research project and not public yet, unfortunately. But it's been a couple years so I may not be up to date.

yorwba · on Jan 19, 2020

Thanks for the clarification. Am I right in assuming that if the research has been published, it would be found somewhere on Emily Bender's page? http://faculty.washington.edu/ebender/

Rotten194 · on Jan 19, 2020

Yeah I actually found a link to the overview page: http://depts.washington.edu/uwcl/aggregation/