BullshitBench tests whether AI models can detect nonsensical questions—or if they'll confidently answer them anyway. The ...
The last time we did comparative tests of AI models from OpenAI and Google at Ars was in late 2023, when Google’s offering was still called Bard. In the roughly two years since, a lot has happened in ...
Google, OpenAI, DeepSeek, et al. are nowhere near achieving AGI (Artificial General Intelligence), according to a new benchmark. The Arc Prize Foundation, a nonprofit that measures AGI progress, has a ...
Google’s new Gemini 3 has become the first major AI model to get a perfect score on a new self-harm safety benchmark, the CARE test. That milestone comes as hundreds of millions of people have come to ...
In a new case study, Hugging Face researchers have demonstrated how small language models (SLMs) can be configured to outperform much larger models. Their findings show that a Llama 3 model with 3B ...
Claude Code Skills 2.0 adds evals plus benchmark test sets; changes target skill reliability as models update over time.
New ORCA results show Gemini leading in practical math, but no AI matches the consistency of a simple calculator.
Google is following the consumer launch of 2.0 Flash with new preview models that will be available to test in the Gemini app: 2.0 Pro Experimental and 2.0 Flash Thinking Experimental. In December, ...
OpenAI announced Tuesday the launch of two open-weight AI reasoning models with similar capabilities to its o-series. Both are freely available to download from the online developer platform Hugging ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results