LLM context size support in Ollama Helm-chart - Arch · Dev · Ops

TL;DR

The Ollama Helm chart (version 1.5.0+) now supports creating custom LLM configurations via Modelfiles during deployment. Example of Llama3.1 deployment with a 16384 token context size:


cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
  enabled: true
  size: 50Gi
ollama:
  gpu:
    enabled: true
  models:
    pull:
      - llama3.1:8b
    create:
      - name: llama3.1-ctx16384
        template: |
          FROM llama3.1:8b
          PARAMETER num_ctx 16384
    run:
      - llama3.1-ctx16384
extraEnv:
  - name: OLLAMA_KEEP_ALIVE
    value: 24h
EOF

Long read

The Ollama Helm-chart provides a convenient way to deploy self-hosted Large Language Models (LLMs) on Kubernetes. You can deploy models from the ollama model library using the pull and run configurations:


cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
  enabled: true
  size: 50Gi
ollama:
  gpu:
    enabled: true
  models:
    pull:
      - llama3.1:8b
    run:
      - llama3.1:8b
extraEnv:
  - name: OLLAMA_KEEP_ALIVE
    value: 24h
EOF

Ollama also supports HugginFace :


cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
  enabled: true
  size: 50Gi
ollama:
  gpu:
    enabled: true
  models:
    pull:
      - hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
    run:
      - hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
extraEnv:
  - name: OLLAMA_KEEP_ALIVE
    value: 24h
EOF

In practice, we usually need to change parameters of deployed models, but Ollama helm-chart does not support configuration of models at deployment stage. For exampe, to change the context size, which is 2048 tokens by default , you previously had to:

Use the Ollama REST API:


curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'

Utilize any Ollama REST API client like Open-WebUI
Utilize any Ollama Client library

All the options mentioned above require additional steps after deployment, but we would like to define custom settings during the deployment phase. Fortunately, ollama supports Modelfile which can be used to solve the issue.

I created a new issue in Ollama helm-chart repository with proposal for improvement and implemented the functionality in my personal helm repository , so you can deploy ollama custom model this way:

⚠️

Warning

The example below uses my personal helm-chart implementation at https://olegsmetanin.github.io/helm-charts/ , this is not the official Ollama Helm-chart!


$ helm repo add olegsmetanin https://olegsmetanin.github.io/helm-charts
$ helm upgrade ollama olegsmetanin/ollama \
  --install --create-namespace -n ollama --version 0.1.0 \
  --set 'persistentVolume.enabled=true' \
  --set 'persistentVolume.size=20Gi' \
  --set 'ollama.gpu.enabled=false' \
  --set 'ollama.models.pull[0]=llama3.1:8b' \
  --set 'ollama.models.create[0].name=llama3.1-ctx16384' \
  --set 'ollama.models.create[0].model=FROM llama3.1:8b\\nPARAMETER num_ctx 16384' \
  --set 'ollama.models.run[0]=llama3.1-ctx16384' \
  --set 'extraEnv[0].name=OLLAMA_KEEP_ALIVE' \
  --set 'extraEnv[0].value=24h'

2025-02-14 my proposal was implemented by Ollama Helm-chart maintainers in official repository. Now we can deploy Llama3.1 with a 16384 token context size using the following example:


cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
  enabled: true
  size: 50Gi
ollama:
  gpu:
    enabled: true
  models:
    pull:
      - llama3.1:8b
    create:
      - name: llama3.1-ctx16384
        template: |
          FROM llama3.1:8b
          PARAMETER num_ctx 16384
    run:
      - llama3.1-ctx16384
extraEnv:
  - name: OLLAMA_KEEP_ALIVE
    value: 24h
EOF

Happy LLM deployment!