I'd like to point out that for the real world tasks (not benchmarks), Kimi K2 outperforms Gemini. This is telemetry across all @cline users, showing diff edit failure rate. Notice how Kimi has about a 6% failure rate, which is significantly better than Gemini's ~ 10% error rate. Remarkably, Kimi even surpassed Claude 4 for most of this week, achieving a sub 4% failure rate!
Paul Gauthier
Paul Gauthier18.7. klo 19.09
Kimi K2 scored 59% on the aider polyglot coding benchmark. Full leaderboard:
In our internal "Hard" diff editing benchmark for cases where a frontier model previously failed a diff edit (prior to our diff algorithm updates), Kimi surpassed Claude 3.5. Will be interesting to see the results from our "Nightmare Difficulty" benchmarks in the next few weeks.
157,21K