I think that strips away way too much. What you describe is “counting words”. It...

vineyardmike · on April 19, 2024

> I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.

For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.

> I also doubt your claim “and the results aren't totally terrible”.

> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc

(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.

> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

Wow that seems a lot more complicated for something that was supposed to be a learning exercise.

[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...

Someone · on April 19, 2024

>> I also doubt your claim “and the results aren't totally terrible”. >> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc

> (1) the comment suggested filtering out these words,

It mentions that as a possible improvement over the version whose results it claims aren’t terrible.