The frontier of AI benchmarks: Q&A with an expert

Posted by
On February 4, 2026

Amaury Lendasse.

Engineering Management and Systems Engineering Chair and Professor Amaury Lendasse. Photo by Michael Pierce/Missouri S&T.

Jim Sterling.
Jim Sterling, vice provost and founding dean of Kummer College. Photo by Michael Pierce/Missouri S&T.

In this Q&A session, Kummer College Dean Jim Sterling chats with the Engineering Management and Systems Engineering Chair and Professor Amaury Lendasse, an expert in machine learning and AI. 

Sterling: Why should AI research begin by defining the persona of the AI to be prompted?

Lendasse: Defining a persona is the foundational step because it constrains the response space the model will draw from. When you specify a role such as “Senior Quantum Physicist” or “Distinguished Mathematician,” you implicitly set expectations for tone, depth, vocabulary and the assumptions the model is allowed to make. 

From a benchmarking perspective, persona definition helps ensure the model is operating within the domain-appropriate cognitive frame required by the task. This reduces drift into generic, lowest-common-denominator answers and lowers the risk of confident but misaligned outputs that can occur when the model lacks a clear identity and objective. 

Importantly, a persona should be more than a job title: it should encode the relevant discipline, typical reasoning style and the boundaries of expertise so the model can interpret the prompt in the correct context.

Sterling: What are the most reputable AI benchmarks?

Lendasse: The most reputable benchmarks tend to share three properties: they are widely adopted by the research community, publicly released with clear scoring protocols and designed to be difficult to manipulate. In practice, serious evaluations triangulate across multiple benchmark families, because no single test captures real capability.

  • General knowledge and reasoning: Massive Multitask Language Understanding, or MMLU, remains a well-known historical baseline, but many groups now prefer harder successors such as MMLU-Pro, which was explicitly designed to be more discriminative and reasoning-heavy.
  • Expert scientific reasoning: Generalized Physics Question Answering, or GPQA, is widely used to probe high-level science question-and-answer tasks that are intended to resist shallow pattern matching.
  • Coding and software engineering: HumanEval is a classic code-generation benchmark, while Software Engineering Benchmark, or SWE-bench, is viewed as a more “realistic” test because it evaluates whether systems can resolve real GitHub issues via test-based verification. Increasingly, teams also use LiveCodeBench to better control for test-set contamination and track coding performance over time.
  • Multimodal understanding: Multimodal Massive Understanding, or MMMU, is a flagship multimodal benchmark focused on college-level, multi-discipline reasoning grounded in both text and images; it is frequently cited as a reference point for multimodal progress.
  • Dialogue and preference: Chatbot Arena (LMSYS) is influential because it measures human preference at scale via head-to-head model comparisons and Elo-style ratings.
  • Agentic and tool use: General AI Assistant, or GAIA, targets general AI assistant capabilities that include tool use and web-browsing style behaviors. For web navigation specifically, WebArena and Mind2Web are widely referenced environments and datasets for evaluating agents operating on simulated or real websites. Broader agent evaluation suites like AgentBench aim to test agent behavior across multiple interactive environments.

Sterling: Can you explain why claims are being made that “math” is being solved by AI? What benchmark shows this?

Lendasse: Those claims are usually shorthand for a very specific observation: AI has moved from “calculator-style” word problems to near-ceiling performance on some standardized competition math tasks, especially when models can reason step-by-step or use tools. Historically, people pointed to GSM8K, which measures multi-step grade-school math word problems. It’s valuable, but it’s no longer a strong separator because modern systems can do very well on it.

The newer evidence behind the “math is solved” narrative comes from competition-style benchmarks like the American Invitational Mathematics Examination, or AIME, where recent reasoning-focused models from OpenAI report very high scores, particularly when allowed to use a Python interpreter for verification and computation. But if we define “solving math” as research-level mathematical reasoning, the key benchmark is FrontierMath, developed by Epoch AI with expert mathematicians. It uses new, unpublished problems across modern math areas and is designed to reduce contamination. In the original FrontierMath report, state-of-the-art systems solved under about 2% of problems, explicitly showing that “math” is not solved in the research sense.

Sterling: How are investment markets being best-predicted and returns on investment being optimized by AI?

Lendasse: AI is impacting investment in two main ways: Signal discovery and decision optimization. On the signal side, modern models ingest enormous streams of structured and unstructured data: macroeconomic indicators, earnings and guidance, rates and inflation expectations, news flows and alternative data (shipping activity, web traffic, satellite-derived proxies), alongside real-time natural language processing sentiment from filings and media. The advantage is not “magic prediction,” but the ability to detect weak, non-linear relationships and time-varying regimes that are hard to model with traditional linear factors.

On the optimization side, AI is used to translate those signals into actions: portfolio construction (balancing expected return against risk, drawdown and constraints), dynamic hedging and execution (minimizing slippage and market impact). In more agentic settings, systems run scenario simulations — “If X happens, what does the distribution of outcomes look like?” — and continuously update exposures as new information arrives. This is especially visible in systematic trading and algorithmic execution, where decisions are made at high frequency and disciplined risk controls are essential.

And it’s a bit funny for me to answer this, because forecasting financial time series was the core topic of my or Ph.D. I’ve seen firsthand both sides of the story: the genuine power of data-driven models, and the hard reality that markets are nonstationary, adaptive and extremely unforgiving to overfitting.

Sterling: Besides benchmarking, what other criteria show the progress in training of AI?

Lendasse: Benchmarks are useful, but they’re not the only way the field tracks progress. More broadly, people look at predictable scaling behavior, efficiency and generalization. I’ll add a small caveat upfront: this is somewhat outside my core research lane, so I’m describing the mainstream criteria I see used across the community rather than claiming deep specialization in every one of them.

  • First, scaling laws capture the empirical regularity that performance improves in a fairly predictable way as you increase compute, data and model capacity. Even when individual benchmarks saturate, scaling curves can show whether improvements are coming from better training recipes, better data or simply more computation.
  • Second, compute efficiency matters as much as raw capability: how much performance you get per unit of training compute, and how much you can do per unit of inference compute. That includes latency, throughput, memory footprint, and energy — because a model that is “slightly better” but 10 times more expensive may not represent practical progress.
  • Third, the community increasingly tracks inference-time computation — giving a model more time or steps to reason before committing to an answer. This is part of the shift toward reasoning-oriented systems, where “thinking longer” can materially improve correctness on hard tasks.
  • Finally, a key signal is generalization, especially zero-shot (or minimal-shot) behavior: can the model solve a new task it wasn’t explicitly trained on, using only a natural-language description? Strong zero-shot generalization suggests the model has learned reusable abstractions rather than simply memorizing patterns tied to a narrow dataset.

So, taken together, these criteria aim to measure not just “how high the score is,” but how predictably capability scales, how efficiently it’s delivered and how robustly it transfers to new tasks.

Sterling: What is explainable AI? What recent progress can be mentioned that shows progress is rapid and substantial?

Lendasse: Explainable AI is the discipline, and increasingly the engineering practice, of making an AI system auditable: you can trace what it did, why it did it, what signals drove the decision and whether it stayed within policy. In other words, explainability is moving from “nice-to-have explanations” to operational accountability, especially as we deploy agentic systems that take actions, not just generate text.

That illustrates why progress in explainable artificial intelligence is rapid right now: we’re seeing a shift from post-hoc, after-the-fact explanations to continuous monitoring, logging and policy enforcement, the kind of infrastructure you can actually deploy in enterprises and regulated workflows. This acceleration is also being pushed by governance expectations: National Institute of Standards and Technology, or NIST, explicitly frames transparency, explainability and interpretability as core trustworthiness characteristics, and the European Union’s AI Act emphasizes transparency obligations for high-risk systems.

So, the substantial progress isn’t just “better explanations on a slide,” it’s the emergence of end-to-end traceability stacks for real AI deployments, where you can defend decisions, detect misbehavior and demonstrate compliance.

Share this page

Leave a Reply

Your email address will not be published. Required fields are marked *