Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OK, sounds counter-intuitive, but I'll take your word for it!

It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.

Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.

Do you have any explanation for why this captures any similarity in meaning?



> rather than making similar-count words closer in the embedding space.

Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.

The problem of multiple meanings isn't solved by this approach at all, at least not directly.

Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.

Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.

1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.

2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.

3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.

These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: