01-04-2025 |
Scale AI’s SEAL Leaderboards Rank o1-pro and DeepSeek V3 in AI Performance |
AI Tool Benchmarking |
Scale AI’s latest SEAL Leaderboards place o1-pro at the top for puzzle-solving and multi-turn challenges, while DeepSeek V3 ranks 8th in text-only tests. The rankings highlight strengths in cybersecurity and data analysis for these advanced AI models. Experts at Scale AI provide trusted evaluations to show how these tools perform in real-world tasks. Check out the full rankings to see where your favorite AI stands! |
|
26-03-2025 |
Scale AI Joins AWS Marketplace for U.S. Intelligence |
Collaboration |
Scale AI’s Scale GenAI Platform and Scale Donovan are now available on the AWS Marketplace via the Intelligence Community Marketplace (ICMP). This catalog helps U.S. national security customers discover, test, and buy software running on AWS. The tools are also directly accessible on the AWS Marketplace, simplifying purchases for government use. |
|
18-03-2025 |
Scale AI Proposes Bold Steps for U.S. Leadership in Artificial Intelligence |
Company News |
Scale AI has shared a plan with the White House to keep the U.S. ahead in artificial intelligence, focusing on protection, promotion, adoption, and innovation. The proposal highlights stronger export controls, better tech sharing with allies, increased government use of AI, and support for a skilled workforce. It aims to boost economic growth and national security while keeping America competitive globally. |
|
12-03-2025 |
Behind Scenes of Humanity’s Last Exam: AI Evaluation Insights Unveiled |
Podcast |
Scale AI’s latest fireside chat features Dan Hendrycks from CAIS and Summer Yue from Scale AI, diving into Humanity’s Last Exam insights. This exclusive discussion explores top AI model performance, revealing cutting-edge findings on advanced AI evaluation techniques. Discover what’s next for AI benchmark testing and how it shapes the future of expert-level AI systems in this must-see behind-the-scenes look. |
|
09-03-2025 |
Scale AI and AI Risks Unveil MASK: Testing AI Honesty Under Pressure with 1,000+ Scenarios |
AI Tool Benchmarking |
Scale AI and AI Risks have launched MASK, a groundbreaking benchmark featuring over 1,000 real-world scenarios to evaluate AI honesty under pressure. This initiative aims to assess whether advanced models can resist deception when pushed, addressing a critical aspect of AI alignment challenges. Soon, SEAL rankings based on a private dataset will provide deeper insights into model performance, advancing efforts in trustworthy AI development. Explore how this impacts the future of ethical AI systems and reliability in high-stakes situations. |
|
07-03-2025 |
TIME and Scale AI Launch Interactive Generative AI Experience for Person of the Year |
Case Studies |
TIME teams up with Scale AI to redefine media with TIME AI, the first generative AI journalism tool for the Person of the Year feature. This innovative solution offers multimodal AI content engagement, including text, audio summaries, translations, and conversational chat, all built with custom guardrails for safety and trust. Delivered in just two months, this partnership enhances accessibility and positions TIME as a leader in AI-powered media transformation, captivating audiences worldwide. |
|
05-03-2025 |
Scale Wins Prime DIU Contract for Thunderforge AI Military Planning Program |
AI Innovation Update |
Scale has secured a prime contract from the Defense Innovation Unit (DIU) for Thunderforge, the DoD’s flagship AI initiative enhancing military planning and wargaming. This multimillion-dollar deal leverages cutting-edge artificial intelligence to transform U.S. defense strategies. Backed by Scale’s proven expertise, Thunderforge aims to deliver advanced decision-making tools for the Joint Force. Click to learn how this AI breakthrough is reshaping modern warfare! |
|
01-03-2025 |
Claude 3.7 Sonnet Thinking Tops #1 on Scale AI’s VISTA Leaderboard This Week |
AI Tool Benchmarking |
Scale AI’s Visual-Language Understanding (VISTA) benchmark, a rigorous test of multimodal AI, awards Anthropic’s Claude 3.7 Sonnet Thinking (February 2025) the #1 position this week. With a score of 48.23% (±0.62), it leads in integrating perception skills like OCR and object recognition with reasoning across 758 tasks. VISTA, a Scale AI innovation, challenges models with rubric-based assessments, and Claude 3.7 Sonnet Thinking’s top ranking this week showcases its superior visual reasoning prowess. |
|
01-03-2025 |
GPT-4o Reigns #1 on Scale AI’s Chat Tool Use Leaderboard This Week |
AI Tool Benchmarking |
Scale AI’s Agentic Tool Use (Chat) leaderboard, testing conversational AI’s tool integration with Google Search and Python Interpreter, names OpenAI’s GPT-4o (August 2024) its #1 model this week. With a score of 56.85% (+6.92/-6.92), GPT-4o masters dependent tool calls across 198 examples, redefining chat-based utility. Scale AI’s ToolComp-Chat benchmark mirrors real chatbot scenarios, and GPT-4o’s top spot this week showcases its supremacy in information retrieval and processing. |
|
01-03-2025 |
o1-preview Dominates #1 on Scale AI’s Enterprise Tool Use Leaderboard This Week |
AI Tool Benchmarking |
Scale AI’s Agentic Tool Use (Enterprise) leaderboard, assessing AI’s ability to chain multiple tools in enterprise settings, crowns OpenAI’s o1-preview as #1 this week. With a score of 66.43% (+5.47/-5.47), o1-preview excels in composing 11 tools across 287 complex tasks. Scale AI’s ToolComp-Enterprise benchmark tests practical, real-world utility, and o1-preview’s top ranking this week positions it as the leading model for enterprise-grade tool use. |
|
01-03-2025 |
o1 Claims #1 on Scale AI’s Multichallenge Leaderboard This Week |
AI Tool Benchmarking |
Scale AI’s MultiChallenge, a pioneering benchmark evaluating multi-turn conversational AI, names OpenAI’s o1 (December 2024) its #1 model this week. With a score of 44.93% (+3.29/-3.29), o1 excels in instruction retention, inference memory, versioned editing, and self-coherence. MultiChallenge reflects Scale AI’s mission to test real-world conversational capabilities, and o1’s top spot this week solidifies its leadership in navigating complex, human-like interactions. |
|
01-03-2025 |
o1 Tops #1 on Scale AI’s EnigmaEval Leaderboard This Week |
AI Tool Benchmarking |
Scale AI’s EnigmaEval, a cutting-edge benchmark of 1,184 complex puzzles from global hunt communities, names OpenAI’s o1 (December 2024) its #1 performer this week. With an accuracy of 5.65% (±0.46), o1 outshines rivals in creative, multi-step reasoning across diverse domains. Built on private datasets to ensure integrity, EnigmaEval showcases Scale AI’s commitment to exposing AI limits, and o1’s top spot this week marks it as the leader in unstructured problem-solving. |
|
01-03-2025 |
Claude 3.7 Sonnet Claims #1 on Scale AI’s Humanity’s Last Exam |
AI Tool Benchmarking |
Scale AI’s Humanity’s Last Exam (HLE), a benchmark engineered to test AI at the pinnacle of human knowledge, has crowned Claude 3.7 Sonnet from Anthropic as its #1 performer. Tackling 3,000 expert-crafted, “Google-proof” questions across math, humanities, and science, Claude 3.7 Sonnet outshone all contenders with unparalleled reasoning prowess. Hosted at lastexam.ai, HLE exemplifies Scale AI’s commitment to pushing AI evaluation boundaries, and Claude’s top ranking sets a new standard for frontier models aiming to rival human expertise. |
|
27-02-2025 |
Scale AI and CSIS Launch Critical Foreign Policy Decision Benchmark for LLM Assessment |
AI Tool Benchmarking |
Scale AI, in partnership with CSIS, has introduced the Critical Foreign Policy Decision (CFPD) Benchmark, a groundbreaking tool designed to evaluate large language models (LLMs) based on their national security and foreign policy decision-making capabilities. Announced today, the CFPD Benchmark aims to enhance the understanding of LLMs’ tendencies in critical decision-making scenarios, pushing the boundaries of artificial intelligence applications in global strategy. This collaboration marks a significant step forward in assessing AI’s potential to influence high-stakes policy environments. |
|
26-02-2025 |
Claude 3.7 Sonnet Hits SEAL LLM Leaderboards with Humanity’s Last Exam |
Company News |
Claude 3.7 Sonnet debuts on SCALE AI’s SEAL LLM Leaderboards, starting with Humanity’s Last Exam rankings. Explore the latest AI model performance updates and stay ahead—check it out now! |
|
24-02-2025 |
MCIT & Scale AI Partner for Qatar’s AI-Driven Digital Transformation |
Company News |
MCIT teams up with Scale AI to boost Qatar’s digital future, leveraging AI solutions like predictive modeling and automation for smarter government services. This strategic partnership aligns with Digital Agenda 2030, enhancing sectors like healthcare and education with cutting-edge artificial intelligence. |
|
19-02-2025 |
Scale AI’s artificial intelligence powers autonomous security with DIU and air force |
Company News |
Scale AI’s advanced AI software and machine learning enhance national security through a new partnership with the DIU and U.S. Air Force, deploying cloud computing-driven computer vision for autonomous maritime threat detection. Detailed in a February 20, 2025 blog, this data analytics breakthrough reduces manpower needs while outperforming open-source models. |
|
13-02-2025 |
Scale AI Unveils SEAL Leaderboard with EnigmaEval Benchmark for Puzzle-Solving AI Models |
Service |
Scale AI launches EnigmaEval, a challenging benchmark designed to assess AI models' puzzle-solving skills. With 1184 puzzles, EnigmaEval reveals the limitations of current models. Check out the leaderboard now! |
|
10-02-2025 |
Scale AI Introduces J2 Method for Vulnerability Testing with 94% Success Rate |
Service |
Scale AI unveils a new approach, J2 (Jailbreaking-to-Jailbreak), achieving up to a 94% attack success rate for vulnerability testing. Read the groundbreaking paper on this innovative safety testing method. |
|
10-02-2025 |
US AI Safety Institute Partners with Scale AI for Model Testing |
Company News |
The U.S. AI Safety Institute selected Scale AI as its first third-party evaluator to assess AI models, broadening access to voluntary evaluations. This collaboration aims to streamline model testing and enhance public-private cooperation in AI safety. |
|
04-02-2025 |
Canada at VivaTech 2025: Scale AI Opens Call for Canadian Delegation Applications |
Company News |
Scale AI invites Canadian tech companies to join the official delegation for VivaTech 2025 in Paris. Showcase your innovation, connect with global leaders, and expand international market access. Apply by February 24 to be part of this prestigious mission. |
|