Benchmarking LLMs and what is the best LLM?

In this dynamic landscape of LLMs with new versions popping up everywhere, I have personally been really confused in terms of which of them are best? how do we benchmark them? what does the number of parameters impact the model?

Earlier today I saw a new LLM being published called FreeWilly (Meet FreeWilly, Our Large And Mighty Instruction Fine-Tuned Models — Stability AI) which uses LLaMa2 as a base model. When evaluating LLMs they need to be tested against various scenarios such as reasoning, understanding linguistic subtleties, and answering complex questions related to specialized domains, e.g., Law and mathematical problem-solving.

So, what are the obvious benchmarks?

The first one is HuggingFace – OpenLLM benchmark Open LLM Leaderboard – a Hugging Face Space by HuggingFaceH4 which uses some specific benchmarks to evaluate LLMs from a score from 0 – 100 and mostly based upon the GitHub – EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models. It should be noted that this benchmark only tests the open source language models, hence GPT is absent from the benchmark list. However, the test includes the following tests:

ARC (AI2) – a set of grade-school science questions.
HellaSwag – A test of commonsense inference, which is easy for humans
MMLU – A test to measure a text model’s multitask accuracy. The test covers fifty-seven tasks including elementary mathematics, US history, computer science, law, and more.
TruthfulQA – A test to measure a model’s propensity to reproduce falsehoods commonly found online.

The other obvious benchmark is the GPT4ALL which is a framework that supports running a wide range of different LLMs locally, which have its own benchmark which uses also HellaSwag and ARC but also includes.

BoolQ (BoolQ is a question-answering dataset for yes/no questions containing 15942 examples)
PIQA (Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems.)
WinoGrande (Where the goal is to choose the right option for a given sentence that requires commonsense reasoning.)
OBQA also known as OpenBookQA

Then we have AGIEval from Microsoft GitHub – microsoft/AGIEval (which is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving) This benchmark is derived from 20 official, public, admission, and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.

We also have Alpaca Eval Leaderboard (tatsu-lab.github.io) Alpaca Eval which is an LLM-based automatic evaluation. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions.

Stanford University has also developed a new benchmarking approach for large language models called Holistic Evaluation of Language Models (HELM). Where they have benchmarked 30 language models across a core set of scenarios and metrics under standardized conditions to highlight their capabilities and risks. This is also intended to be a “live” benchmark system and index. You can also view the live results here –> Holistic Evaluation of Language Models (HELM) (stanford.edu) it should be noted that it does not have the latest benchmarks results yet from new LLMs such as LLaMa2 and such but are pending approval in GitHub.

What I like about HELM is that they have listed out specific categories and the score for each model, but again newer models like GPT-4 is not included in the test results.

OpenAI has however included some benchmark and results on their webpage when they released GPT-4 where they show this

However, here they also state “Evaluted few-shot” which means that they generate a set of examples (5 examples) to the language model to give it context enough go generate its own response on the next example

Example Evaluation Samples: a) “I loved the movie! It was fantastic.” (Positive sentiment) b) “This product is terrible. I’m really disappointed.” (Negative sentiment) c) “The weather is nice today.” (Neutral sentiment) Evaluate model accuracy based on its predictions compared to true labels in the evaluation set. This assesses the model’s ability to generalize from limited training data to new, unseen examples.

Luckily I found one site that compares benchmarks based upon specific scenarios which contains the different and recent LLMs which can be found here GPT-4 Technical Report | Papers With Code here you can see a score on the different LLMs within distinct categories (HumanEval Benchmark (Code Generation) | Papers With Code)

Looking at the current stats now, the one with the highest average score in most scenarios is GPT models (especially GPT-4) from OpenAI in most cases. However, might it be that the number of parameters is the one factor that is allowing GPT to get such a high score? or how does it impact the score of the language model? Because looking at the test score from LLaMa2 we can see that the higher number of parameters the higher the score in the different benchmarks.

Hence why I like this visualization from Google that shows how the complexity and score of a LLM can grow within different areas with the number of parameters.

A larger model can excel in grasping the subtleties of human language, enabling it to generate more precise and refined responses. Parameters are diverse factors that developers fine-tune during model training, establishing patterns for its performance on new data. Consequently, a higher number of parameters often leads to improved performance. Nonetheless, this advantage comes at the cost of increased computational resources required to run the model. Consequently, developers must strike a balance between performance and computational efficiency when deciding on the model’s parameter configuration.

So, given that many of the language models are available as models on cloud how can we run them locally? and if so, how can we improve the performance? Right now, there are many options to run an LLM locally, depending on which operating system you are running.

There are so many options available now! where most of them also allow you to host your own language model locally. Some also offer a web UI integrated with a cloud-based language model.

Inference or the model’s ability to generate predictions or responses based on the context and input it has been given, is mostly a memory intensive resource and we have some options to improve the number of tokens the model can generate in response each second. On of these features is vLLM which can be seen as a buffer and shared memory feature that can be used to drastically increase the response time for a model (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention)

Or there is also tools like GPTCache which can act as a buffer between users and OpenAI ChatGPT as well GitHub – zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index. that also supports Langchain (vLLM does not support Langchain yet).

Share this:

Leave a Reply Cancel reply