I've done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos/negs.
*There are 2 reasons:*
1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It'll be incredibly high! Even for gibberish and sensical sentences.
2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.
*There are solutions I've tried with mixed results:*
1. embedding "whitening" forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.
2. train a super light neural net on your codebase's embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase's embeddings.
*There are solutions from the literature I am working on next that I find conceptually more promising:*
1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.
2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.
3. We use the AST of the code to further inform embeddings.
I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I'm actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by us, we do that through these local-first projects and building strong communities.
I've done the same but was very disappointed with the stock sentence embedding results. You can get any arbitrary embedding, but then the cosine similarity used for nearest neighbor lookup gives a lot of false pos/negs.
*There are 2 reasons:*
1. All embeddings from these models occupy a narrow cone of the total embedding space. Check out the cos sim of any 2 arbitrary strings. It'll be incredibly high! Even for gibberish and sensical sentences.
2. The dataset these SentenceTransformers are trained on don't include much code, and certainly not intentionally. At least I haven't found a code focused one yet.
*There are solutions I've tried with mixed results:*
1. embedding "whitening" forces all the embeddings to be nearly orthogonal, meaning decorrelated. If you truncate the whitened embeddings, and keep just the top n eigenvalues, you get a sort of semantic compression that improves results.
2. train a super light neural net on your codebase's embeddings (takes seconds to train with a few layers) to improve nearest neighbor results. I suspect this helps because it rebiases learning to distinguish just among your codebase's embeddings.
*There are solutions from the literature I am working on next that I find conceptually more promising:*
1. Chunk the codebase, and ask an LLM on each chunk to "generate a question to which this code is the answer". Then do natural language lookup on the question, and return the code for it.
2. You have your code lookup query. Ask an LLM to "generate a fabricated answer to this question". Then embed it's answer, and use that to do your lookup.
3. We use the AST of the code to further inform embeddings.
I have this in my project UniteAI [1] and would love if you cared to collab on improving it (either directly, or via your repo and then building a dependency to it into UniteAI). I'm actually trying to collab more, so, this offer goes to anyone! I think for the future of AI to be owned by us, we do that through these local-first projects and building strong communities.
[1] https://github.com/freckletonj/uniteai