More

enricoros · 2026-01-27T22:01:50 1769551310

CCP-bench has gotten WAY better on K2.5!

https://big-agi.com/static/kimi-k2.5-less-censored.jpg

enricoros · 2026-01-27T20:33:37 1769546017

Same ask, same session, side by side, no system prompt. K2 Preview refuses, K2.5 gives factual history.

Frankly surprising and welcome to see loosening CCP-sensitive topics between model versions.

enricoros · on April 2, 2024

One person on Discord has called this 'taking the idea of self-consistency forward to ensemble model usage'. I guess this is, technically, what this approach is about :)

enricoros · on April 2, 2024

Thank you so much - there's much more and much better coming ;)

enricoros · on April 2, 2024

Yes, the only issue is the usage of tokens, which is obviously greater as we are sampling more of the solutions space. But it's a compromise to have GPT-4.5 level intelligence with GPT-4.

keithc24 · on April 2, 2024

Probably even higher jump as the models have some amount of unique training data, and they are fact-checking each other, to a more common “truth”, and hallucinations are weeded out.

enricoros · on April 1, 2024

Same experience. Once you beam you look for it everywhere!

enricoros · on April 1, 2024

Same. I like using Opus | Gpt-4 | Gemini Pro (I don't have Ultra) | Mistral Large.

keithc24 · on April 2, 2024

Interestingly, Mistral Large “wins” sometimes, and at least provides unique results in comparison.

enricoros · on April 1, 2024

There's a combo box on the right side, and when you click on the "Add Merge" (green) button, the currently active model will be selected.

fredliu · on April 1, 2024

got it!

enricoros · on April 4, 2023

TL;DR & DIY: asked gpt-4 this prompt "Cluster the top10 categories of complaints by the users, and describe each category with a few adjectives/nouns in order or importance." as of rn.

Crisp or too critical?

1. Documentation: lacking, inadequate, outdated

2. Code quality: simple, awkward, suboptimal

3. Production readiness: experimental, unreliable, limited

4. Monetization: unclear, risky, potentially detrimental to open-source

5. Community support: misinformation, poor communication, fragmented

6. Ecosystem: competing alternatives, redundancy, unclear positioning

7. Business model: potential rug-pull, VC-funded, uncertain sustainability

8. Developer experience: poor ergonomics, type erasure, confusing

9. Performance: slow, afterthought, poor observability

10. Maintenance: unpatched bugs, slow response to issues, dependency on contributors

enricoros · on March 24, 2023

Very interesting to follow the chain on the console. Vry good in breaking down multi-part questions, way better than Google Assistant - and then uses G to search. Thx for showing the way.