Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the big reason why BERT and T5 have fallen out of favor is the lack of zero shot (or few shot) ability.

When you have hundreds or thousands of examples, BERT works great. But that is very restricting.



Yes but you can use an llm to label data and then train a bert model which then costs a small fraction of time and money to run than the original llm.


Shhh, don’t tell everybody the secret. ;-)


Lol isn't everyone doing it? That's how I bootstraped my BERT fine-tunes.


I would say everybody smart is doing that, but a lot of the dumb money in AI right now is just wrappers on the GPT API That makes for a flashy demo with no underlying substance or expertise.


Is the encoder style arch better for representing classification tasks at a given compute budget than a causal LM?

Is this because the final represention in bert style models more globally focused, rather than being optimized for next token prediction?


They are 100% better for classification at a given compute budget. They can account for information before and after e.g. a token for token classification and use that information to classify.


Yes, no zero shot. Few shot is possible for some use cases with setfit: https://github.com/huggingface/setfit and the very recent Fastfit: https://github.com/IBM/fastfit ( https://arxiv.org/pdf/2404.12365 )


They are there, you just have to look. Tasksource, NuNER, Flan, T0. There’s not a lot, but still at least a few good zero shot models in both architectures.


It's because you need to mess with embeddings or even train new heads on top of a network to use it. LLMs just use tokens-in tokens-out, they don't classify with softmax over classes, they softmax over vocabulary tokens. LLMs are more convenient




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: