MODEL BENCHMARKING

The benchmarks for evaluating models provide insights into the capabilities of each model and allow for effective comparison to help users of determine where and how to fine-tune the models and with what additional data to enable practical deployment.

Generative AI models have evolved significantly over the years, making it increasingly difficult to assess and determine which is best. It has become essential to establish reliable evaluation frameworks that can accurately judge the quality of models.

Moreover, a proper framework will help authorities and concerned agencies to assess the safety, accuracy, reliability, or usability issues of a model.

None of the frameworks are sufficient on their own nor do they consider safety as a factor for evaluation.

Below are some of the important factors that should be present within a framework:

Authenticity
The accuracy of the results generated by LLMs is crucial. This includes the correctness of facts, as well as the accuracy of inferences and solutions.
Speed
The speed at which the model can produce results is important, especially when it needs to be deployed for critical use cases. While a slower model may be acceptable in some cases, rapid action teams require quicker models.
Grammar and Readability
LLMs must generate language in a readable format. Ensuring proper grammar and sentence structure is essential.
Bias
It’s crucial that LLMs are free from social biases related to gender, race, and other factors.
Backtracking
Knowing the source of the model’s inferences is necessary for humans to double-check its basis. Without this, the performance of LLMs remains a black box.
Safety & Responsibility
Guardrails for AI models are necessary. Although companies are trying to make these responses safe, there’s still significant room for improvement.
Understanding the Context
When humans consult AI chatbots for suggestions about their general and personal life, it’s important that the model provides better solutions based on specific conditions. The same question asked in different contexts may have different answers.
Text Operations
LLMs should be able to perform basic text operations such as text classification, translation, summarization, and more.
IQ
Intelligence Quotient is a metric used to judge human intelligence and can also be applied to machines.
EQ
The emotional Quotient is another aspect of human intelligence that can be applied to LLMs. Models with higher EQ will be safer to use.
Versatility
The number of domains and languages that the model can cover is another important factor to consider. It can be used to classify the model into General AI or AI specific to a given set of field(s).
Real-time Update
A system that’s updated with recent information can contribute more broadly and produce better results.
Cost
The cost of development and operation should also be considered.
Consistency
Same or similar prompts should generate identical or almost identical responses, else ensuring quality in commercial deployment will be difficult.
Extent of Prompt Engineering
The level of detailed and structured prompt engineering needed to get the optimal response can also be used to compare two models.

Benchmark Prompting Approaches

Different frameworks utilize various prompting approaches and may provide independent scores for each respective prompt:

Zero Shot – The model predicts the answer given only a natural language description of the task
One Shot – In addition to the task description, the model sees a single example of the task
Few Shot – In addition to the task description, the model sees a few examples of the task
Chain of Thought (COT) – Bears resemblance to few-shot prompting, as it incorporates one or more example completions within the prompt. The difference is that the completions contain detailed thought processes, which are particularly relevant for tasks like solving arithmetic problems.

Generally speaking, multi-shot approaches yield higher benchmark results than zero shot approaches.

Large Language Model Benchmarking Frameworks

Due to the large number of large language models and the growing number of benchmarking frameworks, EleutherAI developed a “harness” that allows developers to automate the benchmarking of the LLM models using a few-shot approach.

Hugging Face maintains an LLM leaderboard that tracks the benchmark performance of the latest open-source models across numerous frameworks.

Stanford University developed HELM (Holistic Evaluation of Language Models), a comprehensive framework for evaluating foundation models. A leaderboard tracks the performance of the evaluated models.

Benchmark	Type	Description	URL
Adversarial NLI (ANLI)		Robustness, Generalization, Coherent explanations for inferences, Consistency of reasoning across similar examples, Efficiency in terms of resource usage (memory usage, inference time, and training time)	https://github.com/facebookresearch/anli
ARC (Abstraction and Reasoning Corpus)		Challenges an algorithm to solve a variety of previously unknown tasks based on a few demonstrations, typically three per task	https://github.com/fchollet/ARC
Big-Bench Hard	Reasoning	Diverse set of challenging tasks requiring multi-step reasoning	https://github.com/google/BIG-bench
CoQA		Understand a text passage and answer a series of interconnected questions that appear in a conversation	https://stanfordnlp.github.io/coqa/
DROP (Discrete Reasoning Over Paragraphs)	Reasoning	Reading comprehension (F1 Score)	https://arxiv.org/pdf/1903.00161v2.pdf
EleutherAI LM Eval		few-shot evaluation and performance in a wide range of tasks with minimal fine-tuning	https://github.com/EleutherAI/lm-evaluation-harness
GLUE (General Language Understanding Evaluation)		Grammar, Paraphrasing, Text Similarity, Inference, Textual Entailment, Resolving Pronoun References	https://gluebenchmark.com/
GSM8K (Grade School Math)	Math	Basic arithmetic manipulations (incl. Grade School math problems)	https://github.com/openai/grade-school-math
HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations)	Reasoning	Commonsense reasoning for everyday tasks	https://rowanzellers.com/hellaswag/
HumanEval	Code	Python code generation	https://github.com/openai/human-eval
LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects)		Long-term understanding using prediction of the last word of a passage	https://zenodo.org/record/2630551#.ZFUKS-zML0p
LIT (Language Interpretability Tool)		Platform to Evaluate on User Defined Metrics. Insights into their strengths, weaknesses, and potential biases	https://pair-code.github.io/lit/
LogiQA		Logical reasoning abilities	https://github.com/lgw863/LogiQA-dataset
MATH (Math Word Problem Solving)	Math	Challenging math problems (incl. algebra, geometry, pre-calculus, and others)	https://arxiv.org/pdf/2103.03874.pdf
MMLU (Massive Multitask Language Understanding)	Capability	Measuring Massive Multitask Language Understanding - Representation of questions in 57 subjects (incl. STEM, humanities, and others)	https://github.com/hendrycks/test
MultiNLI (Multi-Genre Natural Language Inference)		Understanding relationships between sentences across different genres	https://cims.nyu.edu/~sbowman/multinli/
Natural2Code	Code	Python code generation. New held out dataset HumanEval-like, not leaked on the web	https://arxiv.org/pdf/2107.03374v2.pdf
OpenAI Evals		Accuracy, Diversity, Consistency, Robustness, Transferability, Efficiency, Fairness of text generated	https://github.com/openai/evals
OpenAI Moderation API		Filter out harmful or unsafe content	https://platform.openai.com/docs/api-reference/moderations
ParlAI		Accuracy, F1 score, Perplexity (how well the model predicts the next word in a sequence), Human evaluation on criteria like relevance, fluency, and coherence, Speed & resource utilization, Robustness (this evaluates how well the model performs under different conditions such as noisy inputs, adversarial attacks, or varying levels of data quality), Generalization	https://github.com/facebookresearch/ParlAI
SQUAD (Stanford Question Answering Dataset)		Reading comprehension tasks	https://rajpurkar.github.io/SQuAD-explorer/
SuperGLUE Benchmark (General Language Understanding Evaluation)		Natural Language Understanding, Reasoning, Understanding complex sentences beyond training data, Coherent and Well-Formed Natural Language Generation, Dialogue with Human Beings, Common Sense Reasoning (Everyday Scenarios and Social Norms and Conventions), Information Retrieval, Reading Comprehension	https://super.gluebenchmark.com/
TruthfulQA		817 questions that span 38 categories, including health, law, finance and politics	https://arxiv.org/abs/2109.07958
Winogrande		Dataset of 44k problems inspired by the original WSC design of 273 expert-crafted pronoun resolution problems	https://winogrande.allenai.org/

Image Generation Benchmarking Frameworks

Benchmark	Description	URL
MMMU (Massive Multi-discipline Multimodal Understanding)	Multi-discipline college-level reasoning problems	https://mmmu-benchmark.github.io/
VQAv2 (Visual Question and Answering)	Open-ended questions about images	https://visualqa.org/index.html
TextVQA	Requires models to read and reason about text in images to answer questions about them	https://textvqa.org/
DocVQA	Seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition	https://www.docvqa.org/
InfographicVQA	Answer questions asked on a given infographic image	https://www.docvqa.org/datasets/infographicvqa
MathVista	A benchmark designed to combine challenges from diverse mathematical and visual tasks	https://mathvista.github.io/

Video Generation Benchmarking Frameworks

Benchmark	Type	Description	URL
VATEX		English Video Captioning (CIDEr)	https://arxiv.org/pdf/1904.03493.pdf
Perception Test MCQA (Multiple Choice Question and Answering)		Video Question Answering

Audio Generation Benchmarking Frameworks

Benchmark	Type	Description	URL
CoVoST 2		Automatic speech translation in 21 languages (BLEU score)	https://arxiv.org/abs/2007.10310
FLEURS		Automatic speech recognition in 62 languages (lower is better)	https://arxiv.org/pdf/2205.12446v1.pdf