Intro
Modern LLMs can be categorized into four levels based on their complexity and data volume:
- Budget-effective LLMs (7-8b parameters): These models, such as qwen2.5-coder:7b (4.7GB), can be run on readily available hardware, including GPUs with more than 8GB VRAM or even extended VMs with 32GB RAM without a dedicated GPU.
- Lower mid-sized LLMs (30b parameters): Such models, for example qwen2.5-coder:32b, deliver better quality compared to budget-oriented models. They can be effectively run on a single GPU with 24GB of VRAM, like an NVIDIA A30.
- Mid-sized LLMs (70b parameters): Models like llama3.3:70b (43GB) offer a balance between performance and resource requirements, but to achieve reasonable processing speeds, these models typically require a single high-performance GPU, such as an H100 with 80GB of VRAM.
- State-of-the-art LLMs (over 200b parameters): These cutting-edge models, exemplified by deepseek-r1:671b (404GB), provide the best quality, but their size requires multiple high-end GPUs like the H200 (e.g., a configuration of 2x8H200 with 512GB of VRAM).
Higher output quality generally correlates with increased costs. Architects must therefore balance quality, cost, and speed when designing solutions. For tasks like text extraction and aggregation, 7-8b parameter models are typically adequate. Coding support benefits from models with 30b+ parameters. Complex reasoning tasks require state-of-the-art models.
The most demanded self-hosted LLM models at the moment are:
Model Type | Names | Reason |
---|---|---|
Budget-effective | deepseek-r1:8b llama3.1:8b qwen2.5:7b mistral:7b | easy for finetuning, sufficient for usual agentic usage - text extraction and aggregation |
Lower mid-sized | deepseek-r1:32b qwen2.5-coder:32b gemma2:27b | + suitable for coding, agentic flow code generation |
Mid-sized | deepseek-r1:70b llama3.3:70b | + better quality for extra budget |
Provider selection
To find cost-effective cloud GPU providers for self-hosted LLMs, I utilize cloud GPU price comparison services such as GetDeploying and Primeintellect.ai . At first glance, these services offer comprehensive reviews of different GPU vendors and configurations. As of February 2025, the following minimal prices are observed:
Type | vCPUs | Memory | VRAM | Current Market Price |
---|---|---|---|---|
Nvidia A30 | 8 | 31GB | 24GB | $0.22 / per hour |
Nvidia A40 | 9 | 50GB | 48GB | $0.39 / per hour |
Nvidia H100 | 28 | 180GB | 80GB | $1.90 / per hour |
Upon deeper investigation, it became apparent that these comparisons are incomplete. Notably, they lacked coverage of GCP Spot H100 instances (starting at approximately $3.92/hour ) and neglected to highlight more cost-efficient alternatives.
After a brief googling, I found Rackspace Spot offering spot GPU instances starting at $0.15/hour for A30 GPUs and $1/hour for H100 :
Type | Region | vCPUs | Memory | VRAM | Current Market Price |
---|---|---|---|---|---|
Nvidia A30 Virtual Server v2. + + Extra Large | US West, San Jose, CA | 24 | 128GB | 24GB | $0.15 / per hour |
Nvidia H100 Virtual Server v2.Mega Extra Large | US West, San Jose, CA | 48 | 128GB | 80GB | $1.00 / per hour |
They lack GPU instances with 40-60GB VRAM, such as the Nvidia A40. This makes them less ideal for mid-sized distilled LLMs like llama3.3:70b (43GB) or deepseek-r1:70b (43GB) . Users must resort to H100 instances for these models.
A30-powered cluster provisioning
Rackspace Spot provides managed Kubernetes services in 7 Data Centers across the world (US x4, UK, AU, HKG), but GPU-powered nodes are available only in US West, San Jose, CA (us-west-sjc-1).
In brief, the service provides managed Kubernetes control-plane for free, you pay for the ordered cluster nodes only. To order nodes, you define the number of instances and the bidding price. The probability of node deployment depends on current market price, your bidding price and availability of nodes. More information about node provisioning and bidding process can be found here .
For the first test we will order one A30-powered node with 24 vCPU, 128GB Memory and 24GB VRAM. It usually takes up to 20 minutes to deploy the node.
To access the cluster and deploy applications, we need to install kubectl and helm .
Save kubeconfig file from dashboard locally in some working directory (e.g. kubeconfig-a30.yaml) and point out KUBECONFIG environment variable to the file.
$ export KUBECONFIG=$(pwd)/kubeconfig-a30.yaml
# Test connection
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
prod-instance-XXXXXXXXXXXXX Ready worker 11m v1.29.6
Kubernetes node GPU configuration
Unfortunately, provisioned Rackspace Spot Kubernetes nodes are not ready for serving GPU-workload by default. If you check node capabilities, you won’t find any nvidia.com capabilities:
$ kubectl describe node -A | grep nvidia.com/cuda | uniq
To fix the issue, you have to install NVIDIA GPU Operator and NVIDIA device plugin for Kubernetes . Basically, it installs nvidia drivers and libraries on kubernetes nodes using Kubernetes DaemonSets. Installation may also take up to 10 minutes. If you skip this step, your cluster will be fully functional, but GPU capabilities will not be available for the workload.
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
$ helm upgrade gpu-operator nvidia/gpu-operator \
--install --create-namespace -n gpu-operator --version=v24.9.2 --wait
Wait until new capabilities appear
$ kubectl describe node -A | grep nvidia.com/gpu.present | uniq
nvidia.com/gpu.present=true
Install NVIDIA device plugin for Kubernetes
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update
$ helm upgrade nvdp nvdp/nvidia-device-plugin \
--install --create-namespace -n nvidia-device-plugin --version 0.17.0 --wait
Wait until new capabilities appear
$ kubectl describe node -A | grep nvidia.com/cuda | uniq
nvidia.com/cuda.driver-version.full=550.144.03
nvidia.com/cuda.driver-version.major=550
nvidia.com/cuda.driver-version.minor=144
nvidia.com/cuda.driver-version.revision=03
nvidia.com/cuda.driver.major=550
nvidia.com/cuda.driver.minor=144
nvidia.com/cuda.driver.rev=03
nvidia.com/cuda.runtime-version.full=12.4
nvidia.com/cuda.runtime-version.major=12
nvidia.com/cuda.runtime-version.minor=4
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
Self-hosted LLM
We will use ollama to run self-hosted LLM models. You can find all the models available in ollama library . Also, there are options to run models from HuggingFace .
In our case, we will deploy the qwen2.5-coder:32b model, which takes 20GB of disk space:
$ helm repo add ollama-helm https://otwld.github.io/ollama-helm/ && helm repo update
$ cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- qwen2.5-coder:32b
run:
- qwen2.5-coder:32b
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Wait until container started
$ kubectl wait --for=condition=ready pod -n ollama -l app.kubernetes.io/name=ollama
pod/ollama-aaaaaaa-bbbb condition met
# Check logs
$ kubectl logs -n ollama -l app.kubernetes.io/name=ollama --tail=1000
...
# Check if the GPU is found
time=2025-02-15T18:34:48.855Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-15T18:34:49.206Z level=INFO source=types.go:130 msg="inference compute" id=GPU-a3f129d7-5579-ffd5-dcfd-ec4f602c2e66 library=cuda variant=v12 compute=8.0 driver=12.4 name="NVIDIA A30" total="23.5 GiB" available="23.3 GiB"
...
# Downloading of model
time=2025-02-15T18:34:50.363Z level=INFO source=download.go:176 msg="downloading ac3d1ba8aa77 in 20 1 GB part(s)"
...
# Check that qwen2.5-coder:32b fits in VRAM
time=2025-02-15T18:37:16.165Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-ac3d1ba8aa77755dab3806d9024e9c385ea0d5b412d6bdf9157f8a4a7e9fc0d9 gpu=GPU-a3f129d7-5579-ffd5-dcfd-ec4f602c2e66 parallel=4 available=24973148160 required="21.5 GiB"
...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 32B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 32B
llama_model_loader: - kv 6: general.license str = apache-2.0
Connect to ollama
We can bind deployed ollama service to local port using kubernetes port-forwarding:
kubectl -n ollama port-forward service/ollama 11434:11434
Ollama API will be available at http://localhost:11434
Configure Open WebUI
Open WebUI project will be used as a frontend application for ollama service. There are multiple ways to install Open WebUI, I prefer to create python environment using miniconda and run application without authentication:
$ conda create -n openwebui python=3.12 -y
$ conda activate openwebui
$ pip install open-webui
$ WEBUI_AUTH=False open-webui serve
Open WebUI will be available at http://localhost:8080 . To configure connection to ollama API, open http://localhost:8080/admin/settings and create new connection with url http://localhost:11434 .
That’s it. We are ready to talk to LLM.
Simple performance test
Open new chat in Open WebUI, open browser DevTools inspector window (right click, menu Inspect) and browser console tab. Select qwen2.5-coder:32b model and start a conversation by asking the question:
How to estimate your speed performance in terms of "tokens per second"?
Ollama will start responding and you will see a lot of messages in your browser console. In the third message before the end you will find out the generation statistics, the most interesting for us is response_token/s.
The first question was to warm up ollama, run the same question second time for record.
Test results
Performance results are shown below. Token throughput (tokens per second) scales proportionally with model size. An A30 GPU achieves approximately 20 tps for a 32b (20GB size) model and 80 tps for an 8b (5GB size) model.
GPU | VRAM | Model | Size | Context size (n_ctx_per_seq) | Performance, tps (tokens per sec) |
---|---|---|---|---|---|
Nvidia A30 | 24GB | qwen2.5-coder:32b | 20GB | 2048 | 23.24 |
Nvidia A30 | 24GB | deepseek-r1:32b | 20GB | 131072 | 21.44 |
Nvidia A30 | 24GB | llama3.1:8b | 4.9GB | 2048 | 80.03 |
Nvidia A30 | 24GB | llama3.1:8b | 4.9GB | 16384 (how to change?) | 79.55 |
Nvidia H100 | 80GB | NOT AVAILABLE |
Unfortunately, I was unable to deploy a cluster with H100 GPUs. Rackspace Spot appears to have limited capacity for this configuration. Despite a $1.00/hour market price, successful fulfillment required a $5.00/hour bid, exceeding budget expectations.
Even at this elevated bid, the deployment remained stuck in ‘Node Provisioning in Progress’.
Configure Cline VSCode extension
- Install VSCode and Cline extension
- Open VS Code
- Click Cline settings icon
- Select “Ollama” as API provider
- Enter configuration: Base URL: http://localhost:11434/ (default value, can be left as is)
Read more information about Cline configuration here
More deployment options
You can define the model context size when deploying, read more about parameters in Modelfile specification :
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 50Gi
ollama:
gpu:
enabled: true
models:
pull:
- llama3.1:8b
create:
- name: llama3.1-ctx16384
template: |
FROM llama3.1:8b
PARAMETER num_ctx 16384
run:
- llama3.1-ctx16384
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Mid-sized DeepSeek R1 70b parameters:
cat <<EOF | helm upgrade ollama ollama-helm/ollama --install --create-namespace -n ollama --version 1.5.0 -f -
persistentVolume:
enabled: true
size: 70Gi
ollama:
gpu:
enabled: true
models:
pull:
- deepseek-r1:70b
run:
- deepseek-r1:70b
extraEnv:
- name: OLLAMA_KEEP_ALIVE
value: 24h
EOF
Conclusion
Rackspace Spot presents a potentially cost-effective solution for Kubernetes-based, self-hosted LLM deployments. They currently offer suitable configurations for 7-8b and 32b parameter models using A30 GPUs. Unfortunately, H100 capacity is limited, preventing testing of that configuration.
Comments from Rackspace
Next steps
In the next tutorials we will deploy cloud-native machine learning platform and start using LLM for something useful. Stay tuned!