More

exgrv · on Feb 9, 2025

We can! At Kyutai, we released a real-time, on-device speech translation demo last week. For now, it is working only for French to English translation, on an iPhone 16 Pro: https://x.com/neilzegh/status/1887498102455869775

We released inference code and weights, you can check our github here: https://github.com/kyutai-labs/hibiki

mastermedo · on Feb 9, 2025

Good work. The delay seems to be around 5 secods. This is a step in the right direction. I'm wondering how much more real-time can we push it.

ketzo · on Feb 9, 2025

Damn, this is pretty amazing. Feels like we’re not far off from the babel fish.

exgrv · on Feb 6, 2025

Samples: https://huggingface.co/spaces/kyutai/hibiki-samples

Inference code: https://github.com/kyutai-labs/hibiki

Models: https://huggingface.co/collections/kyutai/hibiki-fr-en-67a48...

exgrv · on June 3, 2019

Except it does? After Equation 2: "v_w and v'_w are the input and output vector representations of w."

exgrv · on Feb 23, 2018

We decided to keep the casing, as it is useful for some applications such as named entity recognition.

Regarding the punctuation, as pointed out in another comment, these tokens might also be useful for some applications (and they are easy to filter out if you don't need them).

sp332 · on Feb 23, 2018

In the Tagalog file, } is near the top but { is over 8,000 lines down. Is there a reason they have such different frequencies? ( and ) are right next to each other.

And yes I realize this is a really odd question :)

exgrv · on Feb 23, 2018

This is probably due to our preprocessing of Wikipedia that did not get rid of all the '}' from the markup.

sp332 · on Feb 23, 2018

Oh true. I tried to clean up Wiki markup for ML years ago and it was a huge pain. Next time I think I'll parse the HTML version and pull out the text from the tags explicitly.

mkl · on Feb 24, 2018

This is a much better way to do it. It's easier, cleaner, and gets the text which is generated by templates, which there is a surprising amount of (you get weird artifacts from that otherwise).

jauntbox · on Feb 23, 2018

Your comment has twice as many ) as it does (

My first guess would be emojis ;)

exgrv · on March 2, 2017

These models were trained in an unsupervised way, and thus cannot be used with the "predict" mode of fastText.

The .bin models can be used to generate word vectors for out-of-vocabulary words:

  > echo 'list of words' | ./fasttext print-vectors model.bin

or

  > ./fasttext print-vectors model.bin < queries.txt

where queries.txt is a list of words you want a vector representation for.

exgrv · on March 2, 2017

Hi, because we trained these vectors on Wikipedia, we released models corresponding to the 90 largest Wikipedia first (in term of training data size). More models are on the way, including Irish.

Y_Y · on March 2, 2017

I suspected it was something like this. Unfortunately the Vicipéid is not of very high quality. I just just hope Facebook doesn't forget which side its bread is buttered on.

exgrv · on March 2, 2017

Regarding the size of the word vectors files: the text files are sorted by frequency, so it is possible to easily load the top k words only.

We might also release smaller models in the future, for training on machines without large memory.

morenoh149 · on March 3, 2017

fwiw I have 32gb on my workstation and my personal laptop is maxed out at 16gb. Keeping within these thresholds may be useful to others.

exgrv · on March 2, 2017

These models were trained on Wikipedia.

It should be "Western Frisian" instead of "Western" (https://en.wikipedia.org/wiki/West_Frisian_language). Thanks for the catch!

ma2rten · on March 2, 2017

Thanks.

exgrv · on March 2, 2017

Models are trained independently for each language. So unfortunately, you cannot directly compare words from different languages using these vectors.

If you have a bilingual dictionary, you might try to learn a linear mapping from one language to the other (e.g. see https://arxiv.org/abs/1309.4168 for this approach).

exgrv · on Oct 8, 2016

The graph algorithm described in the blogpost is more related to label propagation (which is more than 10 years old), than to "retrofitting". And the Google paper linked in the blogpost is citing the relevant literature correctly.

rspeer · on Oct 8, 2016

I probably sounded more accusatory than I should have, and I apologize for that wording.

But I do think this is much more like retrofitting than like label propagation. It's the vectors that are being propagated, as I understand it, not labels.