With generative AI and the explosion of use cases, companies are trying to make the best use of Large Language Models and RAG LLMs in their operations. What’s missing is a reliable LLM comparison that shows the strengths and weaknesses of the best LLMs available.
The comparison of LLM systems is one step toward building an internal tool to process documents and manage company knowledge in an efficient and more organized way. Picking the right LLM depends on the tasks to be performed and can require significant performance, quality, and price tradeoffs. Having all the data, one can reduce uncertainty and make more informed decisions.
This text contains:
I DON’T CARE! Take me straight to the results!
- What is LLM?
- What is RAG (Retrieval Augmented Generation)? – RAG definition and why LLMs need them
- Why (and how) to compare LLMs
- List of LLMs in comparison
- Methodology
- Our work
- Conclusion
What is LLM?
An LLM (Large Language Model) is an artificial neural network that processes natural language using machine learning principles and techniques. What makes it Large (and, by that, different from a standard language model) is its size, regarding the number of parameters and the size of the training dataset. For instance, the GPT4 model has 1 trillion parameters and has been trained on all texts that can be found on the internet, with the addition of books, press articles, and a multitude of other sources.
With the size and amount of data processed, the Large Language Model shows a much greater understanding of natural language than traditional ML-based solutions do. Also, the model shows far greater flexibility, with abilities that previously required fine-tuning or additional training achievable through prompt engineering.
What is RAG (Retrieval Augmented Generation)? – RAG definition and why LLMs need them
RAG (Retrieval Augmented Generation) is the process of supporting Large Language Model generation by providing context and forcing the system to deliver an answer by referring to its knowledge base.
Using RAGs is a way to reduce the threat of hallucinations and the generation of irrelevant responses. Also, it narrows the operations of the LLM – the system needs to “focus” on the source data, not to flow through all the information it stores inside its neural network.
RAGs use the vectorization approach to categorize the data they process by transforming the data to be analyzed by the LLM into vectors. It is possible to perform a super-fast search query, to provide the LLM with chunks of information containing answers or at least with data relevant to the questions. In this particular case, the Tooploox team used the “Danswer” RAG LLM to provide different LLMs with documents to process.
Why (and how) to compare LLMs
Apart from the technical details regarding the number of tokens, number of parameters, and solution architecture, one rarely can spot significant differences between LLMs in their performance. All systems deliver coherent and plausible responses. On the other hand, the acclaimed ChatGPT has been known to hallucinate and make up facts in order to deliver a response.
These differences become extremely significant when it comes to delivering these systems in production:
- Commercial projects – if the LLM is going to control a solution that interacts with thousands of users (with customers among them in a great number), it needs to be as resilient and well-performing as possible. Also, even tiny elements, like odd grammar or an unnatural way to communicate, are going to affect the user experience. The greater the role of the LLM in the solution, the more severe the impact.
- Security – on the reverse side of user experience, there is the security aspect of the LLM. This includes mitigating the risks of data leakage or performing attacks on the model intended to alter its behavior.
- Compliance – compliance issues can be extremely complicated. This may include the need to ensure adherence to data processing standards or imposing rules on data processing control.
Considering all of that, it may be hard to show the “best” LLM, with all of them showing some pros and cons. For instance, an open-source LLM run on on-prem infrastructure may be way more secure for the sake of its performance. On the other hand, the tradeoff needs to be justified and the gain shouldn’t overlap the sacrifice.
That’s why the Tooploox AI team decided to comprehensively map the performance of popular open-source and most popular commercial LLMs in production environments with a real use case.
List of LLMs in comparison
- GPT3.5 – a proprietary model trained and owned by OpenAI. It uses a 2048 tokens-long context window and was trained using 175 billion parameters. It is commercially available via ChatGPT for free or via API, with users paying for every API call.
- GPT4 – GPT4 is a large, multimodal model that works not only with text but also processes images. It was fine-tuned using reinforcement learning that leveraged human feedback. It has a significantly larger token window (8,192 and 32,768 tokens, depending on the version) and is available as a premium service for ChatGPT users.
- GPT4 Turbo – GPT4 Turbo is a performance-optimized, smaller model that aims to provide users with faster responses. As with all GPT-based models, it is a proprietary technology of OpenAI.
- Mistral – developed by the French company Mistral AI, it is an open-source model that has 7 Billion LLM parameters. It uses a “sliding window” context of 4096 tokens.
- Llama13B – this open-source model uses 13 billion parameters and was developed by Meta, the company behind Facebook and Instagram.
Methodology
Delivering reliable results required us to ensure the set of boundaries and guidelines that applied to all tested LLMs in this comparison. These include:
Standardized dataset
The dataset contained Tooploox documents used in day-to-day work, containing real business information. The dataset was composed entirely of our own texts, with no additions from external sources. What’s more, the dataset was composed of documents written in two languages – Polish and English, depending on the matter they dealt with.
Standardized set of questions
Having the dataset in hand, we had to prepare a set of questions to ask the system. The set had to include probable and relevant questions. For example, when running the internal document-mining LLM solution, the probability of requesting a muffin recipe is unlikely. On the other hand, the ability to deliver answers using the documents-only, not general knowledge, is also a huge challenge for LLMs, with a great risk of hallucinations filling the gaps in knowledge.
Building the questions set
The Tooploox team had to prepare, run, and check multiple sets of questions, changing only the LLM that processed the answer. A dull, repetitive, and daunting task. And that’s why our team handed the task over to an AI-based system. The workflow looked as follows:
A ChatGPT3.5-based agent scanned a document with the task of preparing a set of questions based on the content of the document. The system was also tasked with delivering a “perfect answer” to validate future output. A pair containing the document and the perfect answer was later reviewed by ChatGPT4.
The existence of context, the perfect answer, and the source document triad were essential for further validation.
Who watches The Watchmen?
Using AI to generate questions, validate them later, and then use them to rank LLMs may seem a controversial approach at first glance. The key is in understanding the nature of the task being performed – coming up with a set of questions based on a relatively short data chunk (one document) is a relatively easy task for the machine. On the other hand, handing this task to a human team would result in many controversies, as there is always a range of interpretations for human annotators.
And last but not least – the AI team supervised the machines performing the tasks, making us, humans the ultimate watchmen of AI.
Standardized set of metrics
Last but not least, how do we measure a response delivered in natural language? There are multiple scenarios to consider when measuring the performance of an LLM, for example, hallucinating the correct answer but from the wrong context or an irrelevant source. On the other side of the spectrum is picking the right context and fitting fragments of documents yet providing an inaccurate answer. And finally, in delivering an answer irrelevant to the question, which is also a possible scenario.
Having all that in mind, we picked the following metrics:
- Faithfulness
- Answer relevance
- Context precision
- Context relevance
- Context recall
- Answer semantic similarity
- Answer correctness
More information about each of these metrics can be found below, in the results section, near the chart showing comparison effects. The details about the metrics used can be found in this RagAS documentation. If you need some more details regarding the data and the methodology and the dataset, please contact us directly.
Our work
Having everything in place and set up, the Tooploox team started the tests of the RAG-backed LLMs in the production environment. The test followed the pattern below:
- One of the prepared questions was given to the system consisting of the RAG and LLM of choice.
- The system delivered the context and the answer.
- The ChatGPT4-based system delivers the score, which is later reviewed by the data scientist.
To avoid losing context and precision, the team decided to separate all the metrics and show them on separate graphs. Accumulating or combining them in any way may have blurred the image and misguided the user.
Example:
Question: what is the number of Python programmers working in Tooploox
Answer: There are X C# programmers in Tooploox (derived from the context relevant for C# programmers)
In the example above, the system scores max in context precision, relevancy, and recall. It also scores high in semantic similarity. Yet the answer is useless to the user. If the metrics were aggregated, the overall median score would be heavily misleading.
Results
All graphs and results are available below, separated by the type of metric. The “Danswer” mentioned in the legend is the type of RAG used to sort the documents:
Answer Correctness
This metric is about gauging the accuracy of the answer when compared to ground truth. This metric may be a game changer when it comes to evaluating models that generally perform well but tend to confuse single words (for example, “do” and “don’t” make a great difference).
The graph above shows that answer correctness varies regarding the model, with GPT4 Turbo being one of the most unpredictable models – the two most probable options include either highly correct or incorrect answers, with a lower probability of getting a less-but-still-correct answer.
GPT4 delivers the most correct answers, with a slightly less probable chance of delivering a highly correct answer, yet with a more limited probability of delivering an incorrect answer as GPT4 Turbo.
The performance of Mistral, Llama13B, and GPT3.5 are fully comparable, with the highest probability of delivering slightly incorrect answers. Among the models listed above, Mistral shows significantly better performance with the highest chance of achieving the performance level of GPT4 in this group.
Faithfulness
Faithfulness is the factual consistency of the answer against the given context. The answer is considered faithful if the source material backs it up.
There were no significant differences regarding the faithfulness distribution, with Mistral excelling in this comparison and GPT4 showing an above-average probability of showing “unfaithful” information.
Answer relevancy
This metric measures how the LLM follows the instructions in the prompt. Both incomplete or redundant information reduces the score, this metric can be considered the accuracy of the answer.
Llama2:13B showed significantly better performance than the rest of the technologies. Interestingly enough, there were no significant differences between the rest of the models, including the almighty GPT4 and open-source alternatives. Also, GPT3.5 actually performed better than GPT4.
Answer similarity
With semantic similarity, the tool measures if the delivered answer resembles an ideal answer. This metric also produced fairly comparable results, with a slight preference for GPT3.5. The second best-performing model was Llama13B.
Surprisingly, GPT4 scored worse than one might expect. On the other hand, this particular model provides the user with the most sophisticated answers, and by that, it has fallen prey to overcomplicating its output.
Context precision
This metric measures how relevant the context prepared by the tool is. Basically, with this metric, one can check if the right and relevant context was picked for a particular answer.
The research has shown that it is more probable to get irrelevant context than relevant, with the probability up to two times higher. GPT4 Turbo shows the least precision, while Mistral slightly outperforms the rest of the models.
What’s interesting is that this metric applies more to the RAG used rather than the LLM itself, and by that, the differences in the performance shouldn’t be visible. On the other hand, though, this metric may show how the LLM can make a difference when RAG underperforms.
Context relevance
This metric calculates the relevance of the context regarding the question. As mentioned above, this metric is more applicable to RAG than LLM.
The experiment has shown little context relevancy when using AI models with RAG. Also, there is a slight yet visible performance gap between the open source models and commercial models in preference toward the open source.
Context recall
With this factor, we calculated how the recalled context is relevant regarding the answer and how much it can be considered as a reliable and truthful source. This metric is more RAG-related.
Context recall is another metric where one couldn’t see a significant difference in model performance. All models, commercial and open source alike, scored nearly equally, gaining maximum note.
Conclusions
- The test has shown that there is a huge variety in model performance depending on the metric and the tested aspect. This is by far the best answer to the question about “the best LLM available” – it depends on what one is going to use it for.
- Despite the enthusiasm and buzz around Large Language Models, there is a high probability of getting an incorrect or biased answer, as the laboratory tests have shown. GPT4 shows the greatest probability of providing the user with a correct answer. On the other hand, in all cases, there were correct answers available in the dataset – the model failed to spot them from the context and source material.
- Despite its problems, GPT4 outperformed the rest of the models. Also, it was the only model that actually said that “it doesn’t know” at any time. The rest of the models hallucinated answers if incapable of providing users with a correct answer.