Tag

Ai Benchmarks

All articles tagged with #ai benchmarks

Claude Opus 4.7 Debuts as Public-Ready AI with Stronger Coding and Safer Outputs
technology1 month ago

Claude Opus 4.7 Debuts as Public-Ready AI with Stronger Coding and Safer Outputs

Anthropic released Claude Opus 4.7, its most capable public Opus, highlighting improved coding, visual intelligence, and document analysis, while using more tokens and keeping the same price as Opus 4.6. It’s available via Claude AI, the Claude API, and Microsoft Foundry. While Opus 4.7 outperforms many frontier models on several benchmarks, Claude Mythos remains ahead; safety metrics also show fewer hallucinations and misalignment issues compared with Opus 4.6, per Anthropic’s model card.

Claude Opus 4.7 sharpens its edge, but Mythos still leads
technology1 month ago

Claude Opus 4.7 sharpens its edge, but Mythos still leads

Anthropic released Claude Opus 4.7, a meaningful upgrade with better coding, sharper vision, and a new ability to double-check its own work, while acknowledging it still trails Mythos, which remains unreleased to the public. In benchmarks Opus 4.7 outperforms Opus 4.6 and rivals like ChatGPT 5.4 and Google Gemini 3.1 Pro on several tasks, but Mythos Preview remains ahead for now. The update also adds an xhigh (extra high) effort mode for Claude Code, introduces Task Budgets to control long-task reasoning, and raises guardrails as Anthropic tests safeguards ahead of a broader Mythos release.

AI’s 2,500-Question Gauntlet Tests the Real Limits of Machine Intelligence
technology2 months ago

AI’s 2,500-Question Gauntlet Tests the Real Limits of Machine Intelligence

Researchers unveiled Humanity’s Last Exam (HLE), a 2,500-question global benchmark spanning math, the humanities, science, and niche disciplines to probe AI's true limits beyond older tests. Early models scored very low and even recent top systems reach roughly 40–50%, highlighting that high scores on human benchmarks don’t guarantee genuine understanding. Designed as a long-term, transparent gauge, HLE helps policymakers and developers assess capabilities and risks while keeping most questions hidden to prevent memorization; the project includes international experts including Texas A&M’s Dr. Tung Nguyen and is described in a Nature paper with details at lastexam.ai.

OpenAI Enhances Codex with GPT-5 Upgrade
technology8 months ago

OpenAI Enhances Codex with GPT-5 Upgrade

OpenAI has released GPT-5-Codex, an upgraded version of its AI coding agent, which features dynamic thinking capabilities allowing it to spend varying amounts of time on tasks, improving performance on coding benchmarks and code reviews. The model is now available to various ChatGPT users and will be accessible via API in the future, aiming to enhance competitiveness in the crowded AI coding market.

Google Launches Gemini Deep Think AI for Advanced Parallel Reasoning
technology9 months ago

Google Launches Gemini Deep Think AI for Advanced Parallel Reasoning

Google DeepMind has launched Gemini 2.5 Deep Think, its most advanced multi-agent AI reasoning model capable of exploring multiple ideas simultaneously to improve answer quality, outperforming other models on various benchmarks, and integrating tools like code execution and search. The model is available to subscribers and aims to enhance research and problem-solving capabilities, with plans for broader testing.