TL;DR
The Ollama Helm chart (version 1.5.0+) now supports creating custom LLM configurations via Modelfiles during deployment. Example of Llama3.1 deployment with a 16384 token context size:
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- llama3.1:8b
create:
- name: llama3.1-ctx16384
template: |
FROM llama3.1:8b
PARAMETER num_ctx 16384
run:
- llama3.1-ctx16384
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Long read
The Ollama Helm-chart provides a convenient way to deploy self-hosted Large Language Models (LLMs) on Kubernetes. You can deploy models from the ollama model library using the pull and run configurations:
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- llama3.1:8b
run:
- llama3.1:8b
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Ollama also supports HugginFace :
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
run:
- hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
In practice, we usually need to change parameters of deployed models, but Ollama helm-chart does not support configuration of models at deployment stage. For exampe, to change the context size, which is 2048 tokens by default , you previously had to:
- Use the Ollama REST API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"options": {
"num_ctx": 4096
}
}'
- Utilize any Ollama REST API client like Open-WebUI
- Utilize any Ollama Client library
All the options mentioned above require additional steps after deployment, but we would like to define custom settings during the deployment phase. Fortunately, ollama supports Modelfile which can be used to solve the issue.
I created a new issue in Ollama helm-chart repository with proposal for improvement and implemented the functionality in my personal helm repository , so you can deploy ollama custom model this way:
The example below uses my personal helm-chart implementation at https://olegsmetanin.github.io/helm-charts/ , this is not the official Ollama Helm-chart!
$ helm repo add olegsmetanin https://olegsmetanin.github.io/helm-charts
$ helm upgrade ollama olegsmetanin/ollama \
--install --create-namespace -n ollama --version 0.1.0 \
--set 'persistentVolume.enabled=true' \
--set 'persistentVolume.size=20Gi' \
--set 'ollama.gpu.enabled=false' \
--set 'ollama.models.pull[0]=llama3.1:8b' \
--set 'ollama.models.create[0].name=llama3.1-ctx16384' \
--set 'ollama.models.create[0].model=FROM llama3.1:8b\\nPARAMETER num_ctx 16384' \
--set 'ollama.models.run[0]=llama3.1-ctx16384' \
--set 'extraEnv[0].name=OLLAMA_KEEP_ALIVE' \
--set 'extraEnv[0].value=24h'
2025-02-14 my proposal was implemented by Ollama Helm-chart maintainers in official repository. Now we can deploy Llama3.1 with a 16384 token context size using the following example:
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- llama3.1:8b
create:
- name: llama3.1-ctx16384
template: |
FROM llama3.1:8b
PARAMETER num_ctx 16384
run:
- llama3.1-ctx16384
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Happy LLM deployment!