Tag

Ai Benchmarks

All articles tagged with #ai benchmarks

technology1 month ago•5 min saved

Claude Opus 4.7 Debuts as Public-Ready AI with Stronger Coding and Safer Outputs

Anthropic released Claude Opus 4.7, its most capable public Opus, highlighting improved coding, visual intelligence, and document analysis, while using more tokens and keeping the same price as Opus 4.6. It’s available via Claude AI, the Claude API, and Microsoft Foundry. While Opus 4.7 outperforms many frontier models on several benchmarks, Claude Mythos remains ahead; safety metrics also show fewer hallucinations and misalignment issues compared with Opus 4.6, per Anthropic’s model card.

via Mashable|

#ai-benchmarks #ai-safety #anthropic

technology1 month ago•2 min saved

Claude Opus 4.7 sharpens its edge, but Mythos still leads

Anthropic released Claude Opus 4.7, a meaningful upgrade with better coding, sharper vision, and a new ability to double-check its own work, while acknowledging it still trails Mythos, which remains unreleased to the public. In benchmarks Opus 4.7 outperforms Opus 4.6 and rivals like ChatGPT 5.4 and Google Gemini 3.1 Pro on several tasks, but Mythos Preview remains ahead for now. The update also adds an xhigh (extra high) effort mode for Claude Code, introduces Task Budgets to control long-task reasoning, and raises guardrails as Anthropic tests safeguards ahead of a broader Mythos release.

via Axios|

#ai-benchmarks #anthropic #claude-opus-47

technology2 months ago•8 min saved

AI’s 2,500-Question Gauntlet Tests the Real Limits of Machine Intelligence

Researchers unveiled Humanity’s Last Exam (HLE), a 2,500-question global benchmark spanning math, the humanities, science, and niche disciplines to probe AI's true limits beyond older tests. Early models scored very low and even recent top systems reach roughly 40–50%, highlighting that high scores on human benchmarks don’t guarantee genuine understanding. Designed as a long-term, transparent gauge, HLE helps policymakers and developers assess capabilities and risks while keeping most questions hidden to prevent memorization; the project includes international experts including Texas A&M’s Dr. Tung Nguyen and is described in a Nature paper with details at lastexam.ai.

via SciTechDaily|

#ai-benchmarks #artificial-intelligence #humanitys-last-exam

technology5 months ago•1 min saved

Google's Gemini 3 Flash Outperforms GPT-5.2 and Launches Globally

Google's Gemini 3 Flash, a more efficient version of its latest AI model, performs comparably to GPT-5.2 in benchmarks, even surpassing it in some tests, signaling a significant advancement in AI capabilities and competition with OpenAI.

via Engadget|

#ai-benchmarks #ai-model-performance #google-ai-updates

technology8 months ago•1 min saved

AMD Launches ROCm 7.0 with Major Performance Boosts to Challenge Nvidia

The AMD Ryzen AI Max+ 'Strix Halo' SoCs successfully ran ROCm 7.0 on Ubuntu Linux, despite not being listed on the supported GPU list, with benchmarks showing functional performance on AI tasks and graphics workloads.

via Phoronix|

#ai-benchmarks #amd-ryzen-ai-max #gpu-support

technology8 months ago•2 min saved

OpenAI Enhances Codex with GPT-5 Upgrade

OpenAI has released GPT-5-Codex, an upgraded version of its AI coding agent, which features dynamic thinking capabilities allowing it to spend varying amounts of time on tasks, improving performance on coding benchmarks and code reviews. The model is now available to various ChatGPT users and will be accessible via API in the future, aiming to enhance competitiveness in the crowded AI coding market.

via TechCrunch|

#ai-benchmarks #ai-coding #gpt-5-codex

technology9 months ago•3 min saved

Google Launches Gemini Deep Think AI for Advanced Parallel Reasoning

Google DeepMind has launched Gemini 2.5 Deep Think, its most advanced multi-agent AI reasoning model capable of exploring multiple ideas simultaneously to improve answer quality, outperforming other models on various benchmarks, and integrating tools like code execution and search. The model is available to subscribers and aims to enhance research and problem-solving capabilities, with plans for broader testing.

via TechCrunch|

#ai-benchmarks #ai-reasoning-model #google-gemini-deep-think

technology11 months ago•1 min saved

Google unveils enhanced Gemini 2.5 Pro with improved coding and increased query limits

Google has released an upgraded preview of Gemini 2.5 Pro, enhancing coding capabilities and overall performance, with improvements in creativity and response formatting, set for general availability in a few weeks.

via 9to5Google|

#ai-benchmarks #ai-model-update #gemini-25-pro