With the release of ChatGPT it gave the regular end-user access to generative AI at their fingertips, which then started a domino effect on the development of new generative AI services in the market. All cloud providers are doubling down their development and efforts on building Generative AI capabilities.
Microsoft focused on Copilot offerings, Google on Duet and Gemini and AWS on Bedrock and Q which are now new cloud offerings that has been released in the last year. However there are still many scenarios and use-cases where you might need to have private hosted generative AI services, meaning hosting generative AI services within your own datacenter.
While services from OpenAI, Google, Microsoft, AWS are only available as a cloud service, what can we host from our own datacenter? Well I did some research and looked into some of the alternatives and offerings available. There is however something that you should be aware of is that there are very few “enterprise” commercial offerings that cover the entire aspect of generative AI like we do have with ChatGPT or Copilot. If you want to have all of the service, well then you need to build something that consists of different components to provide a similiar experience.
Can we actually get access to generative AI and large language models that on the same level as ChatGPT and OpenAI? The last couple of months it has been pretty interesting to follow this technology. This benchmark below looks at LLaMa2 from Meta, GPT-3.5 (The original ChatGPT) and Mixtral from Mistral.
Both LLaMa2 and Mixtral are open-source models, meaning that you have host them anywhere. Also based upon the benchmark you can also see that Mixtral scores better then both of them. Mixtral 8x7B was released on December 11th and “only” consists of 7 billion parameters, and the day after Microsoft launched a new (and even smaller LLM) called phi-2 (consisting of 2.7 billion parameters)
and Phi-2 as well has also gotten decent performance, compared to the other bigger language models as well.
It is important to note that to host a large language model a rule of thumb is that you need to have a GPU twice the GPU memory of the LLM size. So for instance if I wanted to host Mistral language model on a self-hosted infrastructure (which consists of 7 billion parameters) I need to have atleast 14 GB of video memory.
The last few months there has been rapid development with new LLMs (which are open sourced) and are competing with the bigger closed source based LLMs such as ChatGPT. However the language models themselves are one part of the picture. There are other models as well that are pretrained on application code, images, video, and even audio so it can for instance understand languages and translate in real-time.
So let us go into some of the different offerings, while some are generative AI focused some are also just regular ML based services used for instance doing object detection or providing a framework that can be used to host other services.
NOTE: There might be several other components, libraries, models which I have forgotten to add. If you spot any obvious missing please let me know!!
Large Language models (Open-source)
- OpenChat (imoneoi/openchat: OpenChat: Advancing Open-source Language Models with Imperfect Data (github.com)
- Intel Neural Chat Intel/neural-chat-7b-v1-1 · Hugging Face
You even have models now with provides functions as well (Trelis/Llama-2-7b-chat-hf-function-calling-v2 · Hugging Face)
Another way to keep track of the new language models, I recommend that you take a look Open LLM benchmark Open LLM Leaderboard – a Hugging Face Space by HuggingFaceH4
Running LLMs on local machines
If you want to run LLMs locally on your machine, it is of course dependant on what kind of hardware specs and OS you are running, but here is a list of some of the most common tools
- Ollama (for Mac OSX)
- Windows AI Studio (for Windows)
Code and Code assistants
While many tend to look at Github Copilot or simliar options there are also other options that can be used to provide code assistant with self-hosted language models
- Code Llama – Python
- TabbyML (Self-hosted on own machine)
- WizardCoder (only Python)
- StarCode and StarCoderBase
- Continue (VS Code extension)
- SafeCoder (Code Assistant from HuggingFace)
For instance one way to provide a self-hosted code assistance and integrated into VS Code is using TGI (For hosting of language model such as Code LLaMA) and using llm-vscode extension from hugging face (llm-vscode – Visual Studio Marketplace)
Images (Object detection and Generative AI)
- Stable Diffusion (GenAI library)
- Detectron2 (Object detection)
- DINOv2 (Computer Vision model)
- RCNN ViT (Objekt gjenkjenning)
- OpenCLIP (GenAI library)
- OpenCV (Computer vision library)
- LLaVA (GenAI library)
For instance here is an example with LLaVa that provides the same capability as with GPT-Vision to be able to understand and describe the contenxt of what is going on in a picture. (Which is where I am entering the front door on my own house)
- Coqui-TTS (Text-to-speech)
- Whisper (Speech-to-text and translations)
- Meta Seamless Communication (You can try out a demo of it here –> Seamless Expressive Translation Demo (metademolab.com)
- NVIDIA Riva (Speech-to-text)
- Meta Audiobox
- NVIDIA Deepstream
- Stable Video
Private AI tools and platforms
- Kubeflow (ML Platform) (NOTE: More are moving over to use Flyte or Prefect instead of Kubeflow)
- Flyte and Prefect (ML Platform)
- NVIDIA AI Enterprise (Preview) (ML Platform)
- PyTorch og NVIDIA Nemo (ML framework)
- Ray og RayKube (ML Platform)
- Dify.ai (LLM Platform)
- vLLM (LLM Platform)
- HPE Ezmeral Unified Analytics (ML Platform)
- VMware Private AI (Preview) (SDDC and GenAI Platform) is also uses NVIDIA AI Enterprise
- AnyScale (LLM Platform)
- LAMINI (LLM Platform)
- LM Studio (LLM Platform)
- Cnvrg.io (AI Platform)
While it is a good mix here of AI platforms and LLM platforms it is dependant on your use-case.