Latest updates on Scale AI: What’s New in 2025?

Scale AI Insights: Type-Specific Updates

Scale AI Updates by Year and Month

54 Significant Changes from the Last 6 Months

Date	Update	Type	Description
07-06-2025	Gemini-2.5 Pro Leads SEAL Leaderboards in Reasoning	AI Tool Benchmarking	Google’s Gemini-2.5 Pro preview model has secured the top spot on Scale AI’s SEAL Leaderboards, excelling in expert reasoning and visual understanding benchmarks. Evaluated by domain experts using private datasets, it outperformed competitors in tasks like Humanity’s Last Exam and VISTA, showcasing its strength in complex problem-solving. The leaderboards provide industry insights into AI model capabilities, ensuring trustworthy rankings. Visit scale.com/leaderboard to explore the full rankings.
07-06-2025	Red Teaming Enhances Enterprise AI Safety	Podcast	Scale AI’s Human in the Loop podcast episode highlights red teaming, a critical process for identifying vulnerabilities in enterprise AI systems. Experts discuss how tailored red teaming prevents issues like model drift and over-restrictive guardrails, ensuring AI remains both safe and effective. By simulating real-world risks, Scale AI’s approach helps enterprises balance usability with robust safety policies. Watch the episode at scale.com/blog to learn how to strengthen your AI defenses.
07-06-2025	DeepSeek-R1 Tops SEAL Leaderboard for Open-Source Models	AI Tool Benchmarking	Scale AI’s SEAL Leaderboards, designed to evaluate frontier large language models (LLMs) with unbiased private datasets, recently highlighted DeepSeek-R1’s superior performance among open-source models on the Humanity’s Last Exam (Text Only) benchmark. The leaderboards, crafted by verified domain experts, assess models across diverse tasks like coding, reasoning, and honesty, with top performers including Gemini-2.5-pro-preview and Claude Sonnet 4. Regular updates ensure rankings reflect the latest AI advancements, fostering trust and transparency.
30-05-2025	Scale AI CEO Predicts AI-Driven Future of Work at Gov AI Summit	Social Media News	At the Gov AI Summit, Scale AI CEO Alexandr Wang outlined a transformative two-stage evolution for entry-level jobs driven by AI advancements. Initially, workers will shift from performing tasks to reviewing and refining AI-generated outputs, ensuring quality and addressing errors. In the future, employees will manage swarms of AI agents, setting strategic directions, with software development already reflecting this trend as developers focus on debugging AI-generated code.
22-05-2025	Scale AI Explores Enterprise AI Guardrails in Latest Human in the Loop Episode	Podcast	Scale AI's recent episode of Human in the Loop delves into the challenges and solutions for implementing effective AI guardrails in enterprise settings. The discussion highlights why guardrails-as-a-service products are rare and offers insights into building reliable AI systems with human oversight. Watch the full episode to understand the critical role of guardrails in trustworthy AI development. Visit Scale AI’s YouTube channel to learn more
21-05-2025	Scale AI Reveals Fine-Tuning Impacts on LLM Robustness	AI Research and Surveys	Scale AI’s research evaluates how fine-tuning methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO affect large language models’ robustness to spurious correlations. Findings show DPO and KTO excel in mathematical reasoning under high spuriousness, while SFT performs better in complex, context-heavy tasks. The study emphasizes that fine-tuning doesn’t always ensure robustness, urging developers to consider task type and data biases. Read the full analysis at scale.com.
21-05-2025	Gemini 2.5 Flash Preview Ranks on SEAL Leaderboards	AI Tool Benchmarking	Scale AI has added Google’s Gemini 2.5 Flash Preview to its SEAL Leaderboards, highlighting its performance in handling complex tasks with speed. The SEAL Leaderboards, developed by Scale’s Safety, Evaluations, and Alignment Lab, use private datasets to provide unbiased rankings of large language models. This update showcases Gemini’s capabilities across domains like coding and instruction following, contributing to transparent AI evaluation. Visit scale.com/leaderboard for detailed results and insights.
15-05-2025	Scale AI Explores Next-Gen Enterprise Agents in Human in the Loop Episode 4	Podcast	Scale AI’s latest podcast episode, Human in the Loop: Episode 4, delves into the future of enterprise AI agents, highlighting their shift from task automation to goal-driven, proactive systems that act like a chief of staff. Led by experts Ben Scharfstein and Felix Su, the discussion explores how next-generation agents will integrate with enterprise workflows, leveraging asynchronous operations and human oversight to enhance efficiency. These agents aim to augment teams by handling routine tasks while humans focus on strategic decisions. Watch the full episode at scale.com to learn how Scale AI is shaping the agentic enterprise landscape.
12-05-2025	Scale AI Explores LLMs Advances in Metafictional Storytelling	Company News	Scale AI’s latest research analyzes how leading LLMs, including unreleased GPT and Gemini 2.5 Pro, generate metafictional short stories on AI and grief. The study highlights their self-awareness and philosophical depth but notes persistent issues with emotional flatness and clichéd tropes. These findings suggest LLMs are evolving as creative collaborators, potentially unlocking new storytelling forms. Dive into Scale AI’s blog for a detailed comparison and insights.
07-05-2025	Scale AI’s SEAL Leaderboards Highlight Gemini 2.5 Pro Preview’s Strong Performance Across Diverse AI Benchmarks	AI Tool Benchmarking	Scale AI’s SEAL Leaderboards, backed by a $1 billion funding round, showcase the Gemini 2.5 Pro Preview excelling in coding and interdisciplinary challenges like Humanity’s Last Exam. These expert-driven rankings use private datasets to ensure unbiased evaluations of frontier large language models. The leaderboards emphasize transparency and robust benchmarking, helping developers assess AI model capabilities. Explore the full rankings at scale.com/leaderboard to see how top models compare.
06-05-2025	Scale AI Advances AI Safety with Robust Evaluations and Interpretability	AI Safety and Security	Scale AI emphasizes rigorous evaluations to measure AI behavior, complementing mechanistic interpretability to ensure alignment with human values. Their SEAL Leaderboards showcase benchmarks like MASK and EnigmaEval, testing model honesty and reasoning to address risks such as deception. By integrating expert assessments and private datasets, Scale tackles benchmark saturation and contamination challenges. Join their efforts to build trustworthy AI by exploring their research and methodologies.
06-05-2025	Scale AI Unveils J2 Approach for Advanced LLM Safety Testing	Webinar	Scale AI introduces J2, a groundbreaking method where large language models are trained to red-team other models, mimicking human strategies to uncover vulnerabilities. Presented at a webinar on May 13, J2 offers a scalable, cost-effective alternative to traditional testing, achieving near-human success rates. The approach highlights emerging risks in AI safety and the need for robust defenses. Register to explore J2’s methodology and its impact on secure AI development.
03-05-2025	Qwen3 Shines on Scale AI’s SEAL Leaderboards for Open-Source AI	AI Tool Benchmarking	Scale AI’s SEAL Leaderboards highlight Qwen3 as a standout open-source model, excelling across diverse benchmarks like reasoning and coding. Evaluated with private datasets to ensure unbiased rankings, Qwen3 competes strongly against top models. P
02-05-2025	Scale AI’s Alexandr Wang Discusses U.S. AI Leadership at CSIS Event	Company News	At a CSIS Wadhwani AI Center event on May 1, 2025, Scale AI CEO Alexandr Wang shared insights on advancing U.S. AI leadership, emphasizing national security and AI policy. The discussion, moderated by Gregory C. Allen, covered U.S.-China AI competition, international governance, and Scale AI’s role in supporting the Department of Defense and NIST’s AI Safety Institute. Wang highlighted Scale AI’s growth from a 2016 startup to a $14 billion enterprise driving AI innovation. Watch the full event at csis.org to explore the future of AI policy.
01-05-2025	Scale AI Unveils Technical Insights for Enterprise Agents in New Podcast	Podcast	Scale AI’s latest Human in the Loop podcast episode explores the technical nuances of building effective enterprise AI agents, as discussed by Scale AI’s expert leaders. The episode covers capturing business logic, leveraging expert feedback via Scale AI’s Agent Monitoring Protocol, and ensuring security through robust access controls. It emphasizes Scale AI’s GenAI Platform’s role in creating adaptive, high-quality agent systems. Watch the full episode on Scale AI’s blog to gain actionable strategies for enterprise AI deployment.
25-04-2025	Scale AI’s Human in the Loop Podcast Explores Enterprise AI Agents	Podcast	Scale AI has launched Human in the Loop, a video podcast series focused on building effective enterprise AI systems with human oversight. The first episode discusses the agent landscape, highlighting tools like Scale GenAI Platform for creating reliable AI solutions. It emphasizes the need for precision and human feedback to ensure AI aligns with business needs. Watch the episode on Scale AI’s website to gain insights into practical AI development.
24-04-2025	PaperBench by OpenAI Highlights Execution Challenges for Agentic AI	AI Safety and Security	OpenAI’s PaperBench, analyzed by Scale AI, tests AI agents’ ability to replicate cutting-edge machine learning research from ICML 2024 papers. While AI excels in planning and coding, it struggles with executing complex tasks, scoring only 21% compared to human experts’ 41.4%. This benchmark reveals critical gaps in AI capabilities and supports safer AI development. Explore Scale AI’s analysis to understand PaperBench’s impact on AI research and safety
22-04-2025	Scale AI Introduces PLANSEARCH to Boost Code Generation with Smarter LLM Output Diversity	Feature	Scale AI’s latest research presents PLANSEARCH, a powerful method that enhances code generation by increasing idea diversity in large language models. The approach tackles inefficient inference by planning in natural language, achieving top results on coding tasks like LiveCodeBench. Findings reveal a strong link between diverse AI outputs and improved performance metrics. Discover how PLANSEARCH is changing the way developers approach AI-powered coding.
22-04-2025	Scale AI Reveals Key Role of Natural Language Planning in Code Generation Models	Insights	Scale AI’s new ICLR-accepted research shows that using natural language planning significantly improves code generation with large language models. While increasing training compute boosts model performance, this study highlights why inference compute alone isn’t enough. The findings offer valuable insights for developers optimizing AI tools for programming tasks. Explore how natural language can enhance AI-powered coding workflows.
17-04-2025	Scale AI Tests o3 and o4-mini Calibration on Humanity’s Last Exam	Insights	Scale AI’s latest study explores how well OpenAI’s o3 and o4-mini models align confidence with accuracy on Humanity’s Last Exam, a tough AI benchmark. The research shows o3 model has the lowest calibration error yet, meaning it’s better at knowing when it’s right or wrong. Meanwhile, o4-mini performs strongly on easier tasks. This deep dive highlights advancements in AI model reliability, offering insights for developers building trustworthy AI systems.
16-04-2025	OpenAI’s o3 Model Tops SEAL Leaderboards, Showcasing Advanced AI Reasoning Capabilities	Company News	OpenAI’s newly released o3 and o4-mini models have secured top rankings on the SEAL Leaderboards, with o3 excelling in high-level reasoning, multi-turn challenges, honesty under pressure, and puzzle-solving. These developments highlight the growing importance of robust AI evaluation metrics and spark interest in how these models will perform in real-world applications like coding, math, and multimodal tasks.
15-04-2025	GPT-4.1 Boosts Performance on Scale AI’s SEAL Leaderboards	Company News	Scale AI announces GPT-4.1’s impressive 38.3% score on the SEAL Leaderboards, surpassing GPT-4o by 10.5 points in the MultiChallenge test. This leap highlights Scale AI’s role in evaluating top AI models with clear, expert-driven rankings.
15-04-2025	Scale AI Guides Writers to Maintain Voice with AI Tools	Tutorial	Scale AI explores ways to use AI writing tools while keeping your unique voice. It offers practical advice for blending AI suggestions with your style to create engaging content. The guide stresses leading the process to ensure authenticity. Visit Scale AI to improve your writing today.
11-04-2025	Scale AI Hosts UK Defense Leaders to Discuss AI in National Security	Collaboration	Scale AI welcomed General Dame Sharon Nesmith and Rear Admiral Tim Woods at its Washington, DC office to explore AI’s vital role in national security. The talks highlighted how Scale AI supports global defense efforts with cutting-edge technology. The UK’s DefenceHQ is leading advancements in AI to strengthen security worldwide.
11-04-2025	Scale AI Earns Spot on Forbes AI 50 List for Innovative Data Solutions	Awards & Honours	Scale AI, a leader in data labeling and AI infrastructure, has been named to the Forbes AI 50 list, recognizing its role in powering advanced AI models for companies like Tesla and Nvidia. The list highlights promising privately-held firms driving AI innovation across industries. With $1.6 billion in funding, Scale AI continues to shape the future of artificial intelligence. Explore their career opportunities to join the AI revolution at scale.com/careers.