Oh interesting, so you're not using Llama 2, you're using the original. Have you begun to evaluate Llama 2 to determine the differences in performance?
How are you determining what notes (or snippets of notes?) to be injected as context? Especially given the small 2048 context limit with Llama 1.
Quick clarification, we are using LlamaV2 7B. We didn't experiment with Llama 1 because we weren't sure of the licensing limitations.
We determine note relevance by using cosine similarity between the query and the knowledge base (your note embeddings). We limit the context window for Llama2 to 3 notes (while OpenAI might comfortably take up to 9). The notes are ranked based on most to least similar and truncated based on the context window limit. For the model we're using, we're still limited to 2048 tokens for Llama v2.
How are you determining what notes (or snippets of notes?) to be injected as context? Especially given the small 2048 context limit with Llama 1.