MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.
Large language models (LLMs), artificial intelligence (AI) systems that can process human language and generate texts in ...
Researchers debut "Humanity’s Last Exam," a benchmark of 2,500 expert-level questions that current AI models are failing.
OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model’s introduction caps off a 12-day product announcement series that started with the launch of a new ...
Gemini 3.1 Pro promises a Google LLM capable of handling more complex forms of work.
Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for ...
Simbian today announced the “AI SOC LLM Leaderboard,” a comprehensive benchmark to measure LLM performance in Security Operations Centers (SOCs). The new benchmark compares LLMs across a diverse range ...
Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...
Sarvam AI launches two advanced LLM models, 30B and 105B, outperforming competitors in key benchmarks, focusing on Indian language support.