We can! At Kyutai, we released a real-time, on-device speech translation demo last week. For now, it is working only for French to English translation, on an iPhone 16 Pro: https://x.com/neilzegh/status/1887498102455869775
We decided to keep the casing, as it is useful for some applications such as named entity recognition.
Regarding the punctuation, as pointed out in another comment, these tokens might also be useful for some applications (and they are easy to filter out if you don't need them).
In the Tagalog file, } is near the top but { is over 8,000 lines down. Is there a reason they have such different frequencies? ( and ) are right next to each other.
And yes I realize this is a really odd question :)
Oh true. I tried to clean up Wiki markup for ML years ago and it was a huge pain. Next time I think I'll parse the HTML version and pull out the text from the tags explicitly.
This is a much better way to do it. It's easier, cleaner, and gets the text which is generated by templates, which there is a surprising amount of (you get weird artifacts from that otherwise).
Hi, because we trained these vectors on Wikipedia, we released models corresponding to the 90 largest Wikipedia first (in term of training data size). More models are on the way, including Irish.
I suspected it was something like this. Unfortunately the Vicipéid is not of very high quality. I just just hope Facebook doesn't forget which side its bread is buttered on.
Models are trained independently for each language. So unfortunately, you cannot directly compare words from different languages using these vectors.
If you have a bilingual dictionary, you might try to learn a linear mapping from one language to the other (e.g. see https://arxiv.org/abs/1309.4168 for this approach).
The graph algorithm described in the blogpost is more related to label propagation (which is more than 10 years old), than to "retrofitting". And the Google paper linked in the blogpost is citing the relevant literature correctly.
I probably sounded more accusatory than I should have, and I apologize for that wording.
But I do think this is much more like retrofitting than like label propagation. It's the vectors that are being propagated, as I understand it, not labels.
We released inference code and weights, you can check our github here: https://github.com/kyutai-labs/hibiki