With the evolution of personal Generative AI and the introduction of Small Language Models (SLMs) that can now be run locally on personal devices, there are numerous options for running Generative AI on your own hardware. Tools like Ollama, vLLM, and LMStudio, along with various models such as Codestral, Phi-3, LLaMA3, and Google Gemma, are paving the way for this capability.
Microsoft have also introduced new hardware featuring AI chips (NPU) in their new Copilot+ machines, optimized for running AI workloads locally. This move also indicates Microsoft’s intention to focus on local AI processing, including even Apple now with their introduction of their AI “Apple intelligence”
Given the rapid advancements over just the past two years, we are only at the beginning of what is to come in this field. This evolution will likely give rise to new user roles within the industry, such as the GenAI developer, GenAI integration developer, and GenAI Power Users who enhance their workflows with local AI. I believe this will lead to a new user profile that integrates the use of 3D VDI and developer tools.
These new roles will build and tune new GenAI services, such as building functions or agents or evaluate new language models to see how they perform before putting them into production for different workloads.
Just to showcase what it can look like, I got access to a VDI on Google Cloud via Dizzion (also formerly Frame) to test drive a machine with a NVIDIA L4 GPU card. NVIDIA L4 card has 24GB of VRAM, and a good rule of thumb is to have 2x the amount of VGPU available for X amount of billion parameters a LLM has just to load the model into memory. However if we use a technique called Model Quantization it can be used to reduce the size of the large language models (LLMs) by modifying the precision of their weights, allowing it to be run using less powerful hardware with an acceptable reduction of the capabilities and accuracy. This reduces the VRAM requirements, and allows for instance a 22 billion parameter LLM like Codestral to run on 12 GB vGPU memory.
The logic behind is like this: 22 billion parameters should be equivilant of ~44 GB vGPU memory, while Ollama has made Codestral available using a quantization of Q4
Just to show the different models and what they cost in terms of vGPU memory running on Ollama. (NOTE: That the requirements here is just to load the model into memory you also have to consider the vGPU requirements to do the processing as well)
- Codestral: 12 GB vGPU
- Gemma 5.0 GB vGPU
- Phi 1.6 GB vGPU
- LLaMa3 4.7 GB vGPU
So what can this look like? Within the VDI I installed Ollama (Ollama), Anything LLM (AnythingLLM) and Continue (Continue) into Visual Studio Code. Continue provides a Github Copilot alternative that uses a local LLM. Ollama will serve the different models locally trough an API which Anything and Continue uses for API Interfering.
Drawing created by Claude 3.5 Sonet with the artifacts feature
AnythingLLM provides a wide range of different features including a vector store to provide “memory” and allowing to use your own data. It also supports Ollama for LLM chats.
And here is the settings for Continue within Visual Studio Code to allow use Codestral which is a LLM specialized in development. This also provides a Chat based interface as seen to the left here and how it can be used to help out with Terraform code generation.
When not doing processing the Codestral model is loaded into memory trough Ollama and is using a high amount of vGPU memory.
You can also see that the load spikes into the roof when actually doing processing on the machine, such as using Continue on the machine since the API intefering is done locally on the machine.
I also did a simple benchmark using Python to see the amount of tokens per second this VDI could generate, using this Python script (MinhNgyuen/llm-benchmark: Benchmark llm performance (github.com) Also the smaller the models the faster the interfering APIs.
While these models are quite different and there are numerous factors that impact the amount of tokens that are generated by the LLM, such as the context window, model and underlying API. However this just shows that what is possible with running local LLMs.
To build these future services and configure integrations between generative AI and internal IT systems, a significant enhancement in computational infrastructure is necessary. The heart of this upgrade lies in the deployment of high-performance GPUs that are essential for the intensive data processing tasks inherent in AI modeling and simulation. By leveraging VDI, organizations can provide their AI developers with remote access to powerful, GPU-enhanced virtual desktops. This setup not only facilitates the agile development and testing of AI models but also ensures that these resources are scalable and cost-effective.
Utilizing VDI environments equipped with GPUs allows developers to experiment and iterate on their AI models without the constraints of physical hardware limitations. This can significantly reduce the time from concept to deployment, a critical factor in staying competitive in the rapidly evolving AI landscape. Additionally, this approach supports better security practices by centralizing data and processing power in controlled data centers, rather than distributing sensitive information across multiple endpoints.