One of the easiest ways to get started with running LLMs locally on your own machine is using Ollama. Ollama is an open-source product that provides a local llm inference API that you can interact it. It also provides a CLI tool that you can also use to interact with in real-time. You can also add your own data models, if you for instance have a data set that you need to merge with a base LLM.
You can download the tool from here Ollama and it has a set of predefined models that you can download and use –> library (ollama.com) Ollama also supports the same OpenAI compatibility as on other platforms, making it possible to use existing tooling built for OpenAI with local models via Ollama.
The way to use this is pretty simple, look at the list of available models and from the CLI run the command to download the correct LLM.
ollama pull gemma:7b
Once the model is downloaded you run the LLM inference API using the command
ollama run gemma:7b
Once you run it, you get this type of interface directly from the CLI.
If you plan to use OpenAI API specs you need to have either Mistral or LLaMa2 models running. Where you can use this CURL command to query it directly.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
You can also use the Python library for OpenAI as well. To use it against Ollama you just need to change some context
from openai import OpenAI
client = OpenAI(
base_url = 'http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.chat.completions.create(
model="llama2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The LA Dodgers won in 2020."},
{"role": "user", "content": "Where was it played?"}
]
)
print(response.choices[0].message.content)
Ollama can also handle GGUF. GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading.For instance, you can import GGUF models using a Modelfile
. Create a file named Modelfile
with a FROM
instruction pointing to the local filepath of the model you want to import. Then, create the model in Ollama:
For instance here I have downloaded a fine-tuned Mistral model called NorskGPT, (which I have downloaded from Hugginface bineric/NorskGPT-Mistral-7b · Hugging Face) If I want to run that in ollama I just create a modelfile like this
Then run the command to import the model into Ollama
ollama create example -f Modelfile
Then I can run a prompt directly to the LLM. Sorry the picture below is in Norwegian, but the command I am using is ollama run example: latest “PROMPT HERE” and click enter
This will then generate the output that you have seen here.