The current state of Generative AI and LLMs

In preparation for my upcoming session at Experts Live Denmark, I wanted to write a blog post showing the overall state of generative AI and LLMs. There is a lot of development happening here now and therefore I wanted to give a current overview of the different models, support for multi-modality, integrations and give a glimpse into what is going on in terms of autonomous agents.

Note, that this is my personal reflection based upon my own experience and based upon knowledge I have built up during the course of the last years.

What is Generative AI?

Let’s start with the basic part, what is Generative AI? At its core it is the ability to generate new content based upon a pattern in the existing dataset and based upon instructions. When we enter a prompt into ChatGPT and ask it a question, such as “why did the chicken cross the…..” It will try to calculate the most likely next word in that specific context.

While one parts of the architecture is to calculate the most likely next word, it also needs to understand the semantics or the underlying meaning in the sentence as well, and for each word that is generates it has to do this calculation.

While the technology has been there for some time, it first only became available to a larger audience with the release of ChatGPT (30/11/2022) and GitHub Copilot which came out in June 2022.

With the release of ChatGPT it used a large language model which had the ability to generate text-based output. However now we are no longer restricted to just text, we can also generate other output such as (pictures, speech, code, music and even video) moving to a new approach called multi-modal models.

It is not just the outputs that have changed, but also the input. We have now gotten to the point that we can also add other input to the models such as with GPT-V (V= Vision) that has the ability to describe what it sees in a picture, and with Google Gemini 1.5 where we can now add a video and ask the model to describe what happens in the video, however I will get back to that.

The language models are also becoming bigger.

GPT 2: Consisted of 1.5 billion parameters.
GPT 3: Consisted of 175 billion parameters.
GPT 4: Speculated to be 1.7 trillion parameters.

The larger the language model, the more processing power it requires. Hence what we saw when GPT-4 in ChatGPT was released it had a much higher latency compared to GPT 3.5. At the core since the GPT-4 model is speculated to larger, it requires more compute power and because of the architecture.

OpenAI as well had a lot of other features that they provided from their platform.

Codex (Which is a specific model trained on application code and was the core of GitHub Copilot. OpenAI has later stated that they have moved to GPT-4)
Whisper (Speech-to-text) and is usually used for transcription of audio into text.
DALL-E (Text-to-images)
GPT-V (Images-to-text) the ability to upload a picture and ask the model to describe the content.

As seen here with an example of use of CPT-V to detect if pictures contain a blueberry muffin or a Chihuhua.

Another thing with the newer models is also the context window, meaning the ability to handle more input tokens (how much information you can input to the model) With GPT 3.5 it had an upper limit of about 2500 words, with GPT 4 the limit was raised to 12000 words and now with the latest GPT-4 turbo model the limit was once again raised to 100,000 words.

OpenAI also launched different versions of the GPT models, where some were aimed for Chat scenarios while also having instruct models which are not aimed for chat-based services. The GPT language models also got support for a new feature called GPT Functions, which allows the LLM to generate an a specific output to trigger the funtion.

The newer models also support parallel function calling, which for instance the model needs to collect weather information from 3 locations at the same time. Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means functions count against the model’s context limit and are billed as input tokens.

This has later evolved into GPT Assistants (or GPT Builder) where we now have a “virtual agent” that can have a predefined set of instructions on how it should behave and can also have a set of predefined functions available to it, which it can use to perform an action or get data from another source.

Generative AI Timeline and Evolution

There has been a lot of development the last couple of years, here is a timeline on some of the most important releases with models and new features. One thing that we also saw is that after the release of ChatGPT is that all the major cloud providers were running to build the most complete ecosystem of models, developer tools, infrastructure, and search components.

2021-06-11: GPT-3 Beta
2022-03-15: GPT-3 and Codex Public API
2022-06-21: Github Copilot
2022-11-30: ChatGPT released
2023-01-16: Azure OpenAI released
2023-02-06: Google Bard info
2023-02-07: Bing Search with GPT released
2023-02-24: Meta introduced LLaMa
2023-03-01: OpenAI with Whisper API
2023-03-14: ChatGPT with GPTv4 released
2023-03-15: Midjourney v5 released
2023-03-16: Microsoft 365 Copilot announced
2023-03-21: GPT4 in Azure OpenAI, Google Bard released
2023-03-23: ChatGPT-plugins announced
2023-03-24: Dolly v1 launched
2023-03-30: BloombergGPT announced
2023-04-12: Dolly 2.0 launched
2023-04-14: AWS CodeWhisperer
2023-04-17: LLaVA released
2023-05-10: Microsoft 365 Copilot – Extended Preview
2023-05-15: OpenAI Plugins in ChatGPT
2023-06-13: ChatGPT Functions released
2023-07-06: ChatGPT v4 API released
2023-07-18: LLaMa2 released
2023-10-16: Fine-tuning available Azure OpenAI
2023-11-01: Microsoft Copilot released
2023-11-21: Anthropic with support for 200k tokens
2023-11-29: AWS releases Q and Bedrock Agents
2023-12-05: AI Alliance announced
2023-12-11: Mixtral 8x7B released from Mistral
2023-12-12: Phi-2 from Microsoft released
2024-02-15: Google releases Gemini 1.5
2024-02-21: Google releases Gemma
2024-02-28: NVIDIA and Service Now releases StarCoder2

The rise of Open-Source Models

Many of the cloud providers were early on with a managed service of the LLMs. Such as with Microsoft that has an exclusive partnership with OpenAI, Google had their own called Bard and renamed to Gemini and AWS had only Titan before they went into an exclusive partnership with Anthropic and now provides Claude.

Last year, the AI alliance was also announced which aims at building open source-based data models.
We also saw more and more open source-based language models, which allows us to run this from any location (given that we have the hardware…)

LLaMa from Meta (v1 and later v2)
Mixtrael from Mistral
Gemma from Google
Phi and Orca from Microsoft
Neural Chat from Intel
StarCoder2 from NVIDIA and ServiceNow
In addition to many different versions of these base models which are available on hugging-face.

A good indication on which type of Open-source models that are currently trending can be viewed here  Open LLM Leaderboard – a Hugging Face Space by HuggingFaceH4

We also saw new language models that were released that were trained for specific use-cases such as with LLaMa-Code which was trained on application source code. Meta also released a version which was only aimed at Python code, providing it with much deeper ability to understand and generate correct python-based output.

Another thing we also saw last year, was the use of Mixture of experts. Where you have a model which consists of multiple smaller models and uses a form of dynamic routing based upon the task, ensuring lower latency and can be even more accurate.
In traditional LLMs, all tasks have been processed by a single neural network which are good for generalistic handling, however for more complex problems it get hard for a general model to handle everything. Mistral released Mixtral 8x7B which is mixture-of-experts network. The model consists of 8 expert models,
which also outperforms many of the larger models including LLaMa2 and GPT 3.5. Mixtral is not just better in raw output quality but also in inference speed, which is about six times faster.
NOTE: If you want a better understanding about the different benchmark tests and what they evaluate you can read more about it here  Benchmarking LLMs and what is the best LLM? – msandbu.org

In addition to all these new open-source language models and new architecture, we also now have a wide selection of different tools that we can use to run LLMs on our own machines, such as:

Ollama
LocalAI
LocalGPT
PrivateGPT
h2oGPT
Llama.cpp
Quivr
GPT4All
GPT4free
Lmstudio.AI
Windows AI Studio
NVIDIA AI Workbench

While many of these have the ability to run without a high-end GPU card with most of the work offloaded to CPU/RAM they will work a lot better if you have a modern GPU card.
I even have a blog post on how to use Ollama which can run both these open-source models, but also models using LORA Adapters or custom fine-tuned models such as GGUF files. Running an LLM on Windows or Mac using Ollama – msandbu.org

Integration frameworks

With all these now different language models, and with OpenAI providing an API around their own, many were looking for a way to integrate the language model with their own data (without needed to import data directly into a prompt) but too also handle the logic on how it should behave and also build their own custom application around it.

This is where we have seen the rise of integration frameworks such as

Langchain (Mostly python based) has a wide range of different integrations.
Microsoft Semantic Kernel (Mostly .Net based) tighter integration with the Microsoft ecosystem.
LlamaIndex
Google Vertex AI Extensions (Currently in private preview)

Most of the integration framework provide predefined methods to integrate language models with data, memory, context and to chain prompts together. (As seen in the example below with Llamaindex.

We also have several web frameworks running Python that can be easily used on top of existing workflows defined such as Langchain, such as with Streamlit, Chainlit and Databutton. These integration frameworks are also often also used together with Vector Stores and Search mechanism to handle RAG (Retrieval Augmented Generation) applications.

The rise of RAG

While many like to use language models for creative work, in many cases we would like to use a LLM together with our own data, either it is for answering questions or helping us with rewriting content, creating a summary and such.

This is where RAG (Retrieval Augmented Generation) comes in. Since this allows us to use search to find the most relevant data sources and provide it to our LLM.
In most organization there are thousands upon thousands of documents and files stored in various formats and data sources (mostly unstructured). While we could copy paste content from documents into a prompt (or using the API) however we would quickly reach the limit on how many tokens (content) the language model supports.

In most organization there are thousands upon thousands of documents and files stored in various formats and data sources (mostly unstructured) and since fine-tuning or training our LLM with this data would not scale and not give the context we want; hence the best approach is to use RAG. RAG (Retrieval Augmented Generation) is as the name implies the use of a search mechanism to find relevant content and feed that data to the LLM to generate output.

For instance, Microsoft Copilot 365 is one of the services that uses this approach make data available to the LLM (Which in that case is Azure OpenAI) what happens is all data that is stored in Microsoft 365 is indexed and stored in a vector database called the Semantic Index. When we ask Copilot to do something, it will call the Search API in Microsoft 365 (Graph Search API) to find the relevant information from the vector store.

These vector databases do not store all the information of the document but just vectors that represent some of the content.

For queries, developers ask model for a representation (embedding) of just that query. Then the embedding can be passed to the vector database, and it can return similar embeddings — which have already been run through the model. Those embeddings can then be mapped back to their original content: whether that is a URL for a page or a link to an image.

Vector databases are not your typical database and therefore requires us to have a separate component to store this data. Fortunately, there are numerous different alternatives to vector databases out there such as

Azure AI Search
FAISS (Local vector store)
MongoDB Atlas Vector Search
Vespa Search
Pinecone
OpenSearch
PostgreSQL with PGVector
Chroma
Weaviate

Another issue is building this in enterprise scales, where you have thousands upon thousands of documents that you want to have made available and up to date (If there is a change to the document or the data source for instance)
Secondly another issue is chunking of the data that is ingested into the vector database (which Rob Kerr describes quite well here  Chunking Text to Vector Embeddings in Generative AI Solutions (robkerr.ai) ) That you might have content overlap in the chucks of text that is embedded or that text is embedded and missing context

Multimodality and extending the LLMs

Language models are now being released more frequently, either it being new open-source based models such as Grok from X, which is a LLM consisting of 314 billion parameters and using MoE (Mixture of Experts) xai-org/grok-1: Grok open release (github.com) or it is new closed-source models such as Claude 3 Introducing the next generation of Claude \ Anthropic their capabilities and knowledge are constantly being extended, better language support but also multimodality (the ability to handle different medium such as text, pictures, audio, speech and video) but also reduced hallucination and also increased context window.

Another issue that has been with higher context windows on LLMs is the accuracy when it comes to trying to put to much information in the context window many of the LLM loose accuracy when it supports high context windows.

We also have now models that have the ability to generate video output such as OpenAI Sora, Runway and (Open-Sora, which is an open source alternative)

However we also have LLMs that can handle video as input to be able to describe the content such as wtih Google Gemini which Ive described earlier. For other scenarioes we also have services like OpenAI Whisper (Speech-to-text), OpenAI TTS (Text-to-speech) or Elevenlabs which has a pretty impressive set of features for speech synthensis and other text-to-voice capabilities.

With all the improvements happening to the models in terms of size, accuracy and better understanding of the content feed into the prompt combined with new agent frameworks such as Autogen, we will see a new rise of virtual assistants powered by Generative AI, with the ability to understand different languages, understand different types of media and with converse using natural language.

Also we will see more Generative AI used in the business, either it is with personal assistants such as with Copilot, Duet or AWS Q or for use in business processes such as with building custom applications with frameworks such as Azure OpenAI, AWS Bedrock or Google Vertex AI.

We will also see Generative AI models be embedded into a multitde of different devices such as our own mobile phones, computers and other smaller devices. It will also open up for other use cases such as IoT where we can now easily analyze content without the need to spend much time into training the computer models to provide more accuracy to the model.

We will also see a wave of new open source models and new commercial offerings which will allow us to run generative AI models on our own infrastructure at scale which can be used for a wide range of different use-cases but also offline coding assistants.