Research
5 minute read

IBM’s new benchmark changes monthly to avoid teaching to the test

The multi-modal benchmark, LiveXiv, evaluates vision-language models on questions and images they have likely never seen before, providing a potentially more accurate assessment of their capabilities.

The multi-modal benchmark, LiveXiv, evaluates vision-language models on questions and images they have likely never seen before, providing a potentially more accurate assessment of their capabilities.

AI is evolving rapidly, and so are the benchmarks that measure their progress. Benchmarks are essentially tests of how well AI models can carry out a task or achieve a target goal, giving people a way to compare models and select the best one for their needs.

But benchmarks have flaws. They may not adequately measure the skills their creators intended, and with time, they can drift into irrelevancy as questions become too easy. There is also evidence to suggest they are becoming easier to game, as more models train on data scraped from the web, including the problems in benchmarks themselves. The practical implications of this are that models that have ascended prominent leaderboards may prove far less capable in real life.

To get a more accurate picture of AI aptitude, researchers at IBM, University of Michigan, and Tel Aviv University have created a new, dynamic benchmark for visual document understanding tasks. LiveXiv (pronounced “live-kive”), provides an evolving set of test questions corresponding to images in the latest papers posted on arXiv, a platform for sharing scientific research. Updated automatically once a month, LiveXiv may be the largest, most comprehensive benchmark yet for evaluating VLMs on their ability to analyze charts, tables, diagrams, and other images.

One of the downsides of updating benchmarks this often is that, while cheating becomes harder, the cost of re-evaluating and re-ranking models becomes more costly. To make live benchmarks more feasible, the researchers also devised a statistical shortcut for re-evaluating a full leaderboard. In a paper to be presented at the upcoming International Conference on Learning Representations (ICLR), they show that their method can accurately estimate model performance using 70% less data and computation.

"That evaluation is important, and it's important to do it efficiently," said the study's lead author, Nimrod Shabtay, a researcher at IBM who is also a PhD student at Tel Aviv University.

Benchmarking AI at the pace of scientific progress

Modern AI depends on data. To continue improving their models, companies now routinely crawl the web for new content to feed them. This shift to automated data gathering has increased the odds that AI models have effectively seen the test before taking it. This is known as benchmark contamination.

In the last year or so, a series of dynamic benchmarks have emerged in response. The researchers took inspiration from one of them, a comprehensive benchmark called LiveBench that uses arXiv papers, among other sources, to assesses LLMs on their math and linguistic ‘reasoning’ skills. But rather than focus solely on text, the researchers chose to zero in on visual reasoning tasks important to enterprises. To make their benchmark as efficient and as scalable as possible, they also introduced automation into the benchmark creation and model evaluation process.

One pillar of this automated pipeline is Docling, IBM’s open-source toolkit for document conversion. The researchers used Docling to turn raw PDF pages from the web into structured, machine-readable documents. They also harnessed a pair of language models to convert the documents into more than 16,000 questions and answers about the graphs, tables, and charts within.

Now in its fifth month — and fifth iteration — the current version of LiveXiv draws from more than 300 arXiv papers across 14 domains.

LiveXiv AI evaluation.png
LiveXiv tests vision-language models on a variety of tasks that involve 'reasoning' over images and their corresponding text

Quick comparisons

Each time a benchmark gets updated, ideally each model on the leaderboard gets re-evaluated. But the cost of running dozens of models again on the newest benchmark can quickly become cost prohibitive. Inspired by the rise of “tiny” benchmarks that can gauge model performance from a few strategically picked questions, the researchers came up with a solution for LiveXiv.

A tiny benchmark can infer from a few AI model responses how it’s likely to do on the complete test. Similarly, the LiveXiv solution estimates from a subset of AI model results how the rest of the field will perform. In experiments to validate their approach, they found that re-evaluating just five out of 17 VLMs was enough to accurately predict the performance of the entire cast. They also found that five re-evaluations would work no matter the sample size.

AI rankings continue to be fiercely contested and closely watched, despite growing suspicions that data contamination is influencing results. The current IBM, Michigan, and Tel Aviv University work adds support for this. When researchers looked at how high-ranking VLMs did on the dynamic LiveXiv dataset, compared to static visual understanding benchmarks, they saw that some models dropped significantly in performance.

ModelStatic datasetsLiveXivChange in performance
InstructBLIP-7B15.3317.00-1.67
LLaVA-1.6-Mistral-7B13.6716.0-2.33
Mantis14.514.50.0
LLaVA-1.5-7B14.3314.5-0.17
LLaVA-1.5-13B12.6713.0-0.33
Idefics212.012.00.0
InternVL2-2B9.010.0-1.0
IXC2-4KHD-7B6.3310.5-4.17
IXC2.5-7B5.09.5-4.5
LLaVA-OneVision-7B8.06.51.5
Phi3v9.063.0
Idefics38.56.52.0
LLaVA-1.6-34B9.336.52.83
GPT-4o2.334.01.67
Qwen2-VL3.333.0.33
InternVL2-8B3.332.01.33
Claude-Sonnet1.01.00.0

What’s next

Researchers plan to soon release a public LiveXiv leaderboard. They also plan to apply their automated pipeline to other frequently updated data sources to create additional live benchmarks. To test your VLM on LiveXiv, and peruse the questions (and answers) to version five, check out the project website on Hugging Face.

Date

Authors

Topics

Share