OK, sounds counter-intuitive, but I'll take your word for it! It seems odd since...

IanCal · on April 17, 2024

> rather than making similar-count words closer in the embedding space.

Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.

The problem of multiple meanings isn't solved by this approach at all, at least not directly.

Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.

Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.

1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.

2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.

3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.

These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.