Hacker Newsnew | past | comments | ask | show | jobs | submit | enricoros's commentslogin

CCP-bench has gotten WAY better on K2.5!

https://big-agi.com/static/kimi-k2.5-less-censored.jpg


Same ask, same session, side by side, no system prompt. K2 Preview refuses, K2.5 gives factual history.

Frankly surprising and welcome to see loosening CCP-sensitive topics between model versions.


One person on Discord has called this 'taking the idea of self-consistency forward to ensemble model usage'. I guess this is, technically, what this approach is about :)


Thank you so much - there's much more and much better coming ;)


Yes, the only issue is the usage of tokens, which is obviously greater as we are sampling more of the solutions space. But it's a compromise to have GPT-4.5 level intelligence with GPT-4.


Probably even higher jump as the models have some amount of unique training data, and they are fact-checking each other, to a more common “truth”, and hallucinations are weeded out.


Same experience. Once you beam you look for it everywhere!


Same. I like using Opus | Gpt-4 | Gemini Pro (I don't have Ultra) | Mistral Large.


Interestingly, Mistral Large “wins” sometimes, and at least provides unique results in comparison.


There's a combo box on the right side, and when you click on the "Add Merge" (green) button, the currently active model will be selected.


got it!


TL;DR & DIY: asked gpt-4 this prompt "Cluster the top10 categories of complaints by the users, and describe each category with a few adjectives/nouns in order or importance." as of rn.

Crisp or too critical?

1. Documentation: lacking, inadequate, outdated

2. Code quality: simple, awkward, suboptimal

3. Production readiness: experimental, unreliable, limited

4. Monetization: unclear, risky, potentially detrimental to open-source

5. Community support: misinformation, poor communication, fragmented

6. Ecosystem: competing alternatives, redundancy, unclear positioning

7. Business model: potential rug-pull, VC-funded, uncertain sustainability

8. Developer experience: poor ergonomics, type erasure, confusing

9. Performance: slow, afterthought, poor observability

10. Maintenance: unpatched bugs, slow response to issues, dependency on contributors


Very interesting to follow the chain on the console. Vry good in breaking down multi-part questions, way better than Google Assistant - and then uses G to search. Thx for showing the way.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: