Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, empirically I've noticed that if you do not exchange that sequence length gain for huge batch sizes up front, however, your performance will go down overall....


I'd imagine performance will suffer but how does it relate to gains in overall training costs? Ie. with same compute budget, which approach would produce better overall performance?

Expanding on this idea, do you think it makes sense to explore what could be called "bootstrapping phase" or "progressive training" for foundational model training:

- starting with small number of weights that are being increased with further training

- arranging training data with basics first - short sentences, logic, grammar, arithmetic, naive knowledge ("bear is animal" etc) that increases in complexity as training progresses.

- increasing context length - ideally implicit, based on increasing sample sizes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: