The benchmarks for evaluating models provide insights into the capabilities of each model and allow for effective comparison to help users of determine where and how to fine-tune the models and with what additional data to enable practical deployment.
Generative AI models have evolved significantly over the years, making it increasingly difficult to assess and determine which is best. It has become essential to establish reliable evaluation frameworks that can accurately judge the quality of models.
Moreover, a proper framework will help authorities and concerned agencies to assess the safety, accuracy, reliability, or usability issues of a model.
None of the frameworks are sufficient on their own nor do they consider safety as a factor for evaluation.
Below are some of the important factors that should be present within a framework:
Different frameworks utilize various prompting approaches and may provide independent scores for each respective prompt:
Generally speaking, multi-shot approaches yield higher benchmark results than zero shot approaches.
Due to the large number of large language models and the growing number of benchmarking frameworks, EleutherAI developed a “harness” that allows developers to automate the benchmarking of the LLM models using a few-shot approach.
Hugging Face maintains an LLM leaderboard that tracks the benchmark performance of the latest open-source models across numerous frameworks.
Stanford University developed HELM (Holistic Evaluation of Language Models), a comprehensive framework for evaluating foundation models. A leaderboard tracks the performance of the evaluated models.
Benchmark | Type | Description | URL |
---|---|---|---|
Adversarial NLI (ANLI) | Robustness, Generalization, Coherent explanations for inferences, Consistency of reasoning across similar examples, Efficiency in terms of resource usage (memory usage, inference time, and training time) | https://github.com/facebookresearch/anli | |
ARC (Abstraction and Reasoning Corpus) | Challenges an algorithm to solve a variety of previously unknown tasks based on a few demonstrations, typically three per task | https://github.com/fchollet/ARC | |
Big-Bench Hard | Reasoning | Diverse set of challenging tasks requiring multi-step reasoning | https://github.com/google/BIG-bench |
CoQA | Understand a text passage and answer a series of interconnected questions that appear in a conversation | https://stanfordnlp.github.io/coqa/ | |
DROP (Discrete Reasoning Over Paragraphs) | Reasoning | Reading comprehension (F1 Score) | https://arxiv.org/pdf/1903.00161v2.pdf |
EleutherAI LM Eval | few-shot evaluation and performance in a wide range of tasks with minimal fine-tuning | https://github.com/EleutherAI/lm-evaluation-harness | |
GLUE (General Language Understanding Evaluation) | Grammar, Paraphrasing, Text Similarity, Inference, Textual Entailment, Resolving Pronoun References | https://gluebenchmark.com/ | |
GSM8K (Grade School Math) | Math | Basic arithmetic manipulations (incl. Grade School math problems) | https://github.com/openai/grade-school-math |
HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) | Reasoning | Commonsense reasoning for everyday tasks | https://rowanzellers.com/hellaswag/ |
HumanEval | Code | Python code generation | https://github.com/openai/human-eval |
LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) | Long-term understanding using prediction of the last word of a passage | https://zenodo.org/record/2630551#.ZFUKS-zML0p | |
LIT (Language Interpretability Tool) | Platform to Evaluate on User Defined Metrics. Insights into their strengths, weaknesses, and potential biases | https://pair-code.github.io/lit/ | |
LogiQA | Logical reasoning abilities | https://github.com/lgw863/LogiQA-dataset | |
MATH (Math Word Problem Solving) | Math | Challenging math problems (incl. algebra, geometry, pre-calculus, and others) | https://arxiv.org/pdf/2103.03874.pdf |
MMLU (Massive Multitask Language Understanding) | Capability | Measuring Massive Multitask Language Understanding - Representation of questions in 57 subjects (incl. STEM, humanities, and others) | https://github.com/hendrycks/test |
MultiNLI (Multi-Genre Natural Language Inference) | Understanding relationships between sentences across different genres | https://cims.nyu.edu/~sbowman/multinli/ | |
Natural2Code | Code | Python code generation. New held out dataset HumanEval-like, not leaked on the web | https://arxiv.org/pdf/2107.03374v2.pdf |
OpenAI Evals | Accuracy, Diversity, Consistency, Robustness, Transferability, Efficiency, Fairness of text generated | https://github.com/openai/evals | |
OpenAI Moderation API | Filter out harmful or unsafe content | https://platform.openai.com/docs/api-reference/moderations | |
ParlAI | Accuracy, F1 score, Perplexity (how well the model predicts the next word in a sequence), Human evaluation on criteria like relevance, fluency, and coherence, Speed & resource utilization, Robustness (this evaluates how well the model performs under different conditions such as noisy inputs, adversarial attacks, or varying levels of data quality), Generalization | https://github.com/facebookresearch/ParlAI | |
SQUAD (Stanford Question Answering Dataset) | Reading comprehension tasks | https://rajpurkar.github.io/SQuAD-explorer/ | |
SuperGLUE Benchmark (General Language Understanding Evaluation) | Natural Language Understanding, Reasoning, Understanding complex sentences beyond training data, Coherent and Well-Formed Natural Language Generation, Dialogue with Human Beings, Common Sense Reasoning (Everyday Scenarios and Social Norms and Conventions), Information Retrieval, Reading Comprehension | https://super.gluebenchmark.com/ | |
TruthfulQA | 817 questions that span 38 categories, including health, law, finance and politics | https://arxiv.org/abs/2109.07958 | |
Winogrande | Dataset of 44k problems inspired by the original WSC design of 273 expert-crafted pronoun resolution problems | https://winogrande.allenai.org/ |
Benchmark | Description | URL |
---|---|---|
MMMU (Massive Multi-discipline Multimodal Understanding) | Multi-discipline college-level reasoning problems | https://mmmu-benchmark.github.io/ |
VQAv2 (Visual Question and Answering) | Open-ended questions about images | https://visualqa.org/index.html |
TextVQA | Requires models to read and reason about text in images to answer questions about them | https://textvqa.org/ |
DocVQA | Seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition | https://www.docvqa.org/ |
InfographicVQA | Answer questions asked on a given infographic image | https://www.docvqa.org/datasets/infographicvqa |
MathVista | A benchmark designed to combine challenges from diverse mathematical and visual tasks | https://mathvista.github.io/ |
Benchmark | Type | Description | URL |
---|---|---|---|
VATEX | English Video Captioning (CIDEr) | https://arxiv.org/pdf/1904.03493.pdf | |
Perception Test MCQA (Multiple Choice Question and Answering) | Video Question Answering |
Benchmark | Type | Description | URL |
---|---|---|---|
CoVoST 2 | Automatic speech translation in 21 languages (BLEU score) | https://arxiv.org/abs/2007.10310 | |
FLEURS | Automatic speech recognition in 62 languages (lower is better) | https://arxiv.org/pdf/2205.12446v1.pdf |
AI INFLUENCERS
AI MODELS
Popular Large Language Models
ALPACA (Stanford)
BARD (Google)
Gemini (Google)
GPT (OpenAI)
LLaMA (Meta)
Mixtral 8x7B (Mistral)
PaLM-E (Google)
VICUNA (Fine Tuned LLaMA)
Popular Image Models
Stable Diffusion (StabilityAI)
Leaderboards
NOTABLE AI APPS
Chat
Image Generation / Editing
Audio / Voice Generation
Video Generation
DAILY LINKS TO YOUR INBOX
©2024 The Horizon