"With FLASHATTENTION (Dao et al., 2022), there is negligible GPU memory overhead...

"With FLASHATTENTION (Dao et al., 2022), there is negligible GPU memory overhead as we increase the sequence length and we observe around 17% speed loss when increasing the sequence length from 4,096 to 16,384 for the 70B model."

"For the 7B/13B models, we use learning rate 2e−5 and a cosine learning rate schedule with 2000 warm-up steps. For the larger 34B/70B models, we find it important to set a smaller learning rate (1e−5) to get monotonically decreasing validation losses."

"In the training curriculum ablation study, models trained with a fixed context window of 32k from scratch required 3.783 × 10^22 FLOPs and achieved performance metrics like 18.5 F1 on NarrativeQA, 28.6 F1 on Qasper, and 37.9 EM on Quality."

"Continual pretraining from short context models can easily save around 40% FLOPs while imposing almost no loss on performance."

"Through early experiments at the 7B scale, we identified a key limitation of LLAMA 2’s positional encoding (PE) that prevents the attention module from aggregating information of distant tokens. We adopt a minimal yet necessary modification on the RoPE positional encoding (Su et al., 2022) for long-context modeling – decreasing the rotation angle."

Pretty exciting stuff. Getting close to GPT-4 hopefully soon!