OpenAI GPT-OSS: How to Run it Locally or self-host on Cloud, Hardware Requirements

GPT-OSS is unusually well-engineered for accessibility: the gpt-oss-20B variant is designed to run on a single consumer GPU (~16 GB VRAM) or recent high-end laptops using quantized GGUF builds, while gpt-oss-120B—despite its 117B total parameters—is shipped with MoE/active-parameter tricks and an MXFP4 quantization that lets it run on single H100-class GPUs (≈80 GB) or on multi-GPU setups. Deploying an open-source GPT-style model (often called “GPT OSS”) — whether a compact 6–7B model for local apps or a 70B+ model for production services — raises the same core question: How to Run GPT-OSS Locally or self-host on Cloud, Hardware Requirements
What are GPT-OSS models and what are their hardware requirements?
What is GPT-OSS?
GPT-OSS is OpenAI’s recently released open-weight family of large language models (two headline variants at the time of release: ~20B and ~120B parameter versions). They ship with optimized choices (mixture-of-experts, MXFP4 native quantization in OpenAI’s distribution, sparse/dense innovations) that let these relatively large parameter counts run on significantly less memory than naïve FP32/FP16 copies would require. The release was explicitly intended to make powerful models more broadly runnable and customizable outside only the hyperscalers.
Key product facts (load-bearing):
- gpt-oss-20B is intended to run on a single consumer GPU with ~16 GB VRAM (and can be used on desktops/laptops with GGUF quantizations).
- gpt-oss-120B (≈117B parameters, ~5.1B active parameters in OpenAI’s MoE design) is engineered so that the model can fit in a single 80 GB H100 / A100 when using MXFP4 and specific runtime support, or on multi-GPU setups.
Hardware factors that determine requirements
- Model size and architecture – MoE and sparse/dense layers can change activation and working memory. (GPT-OSS uses mixture-of-experts style components.)
- Precision & quantization – FP32, FP16, BF16, 8-bit, 4-bit (GPTQ/AWQ/MXFP4). Lower precisions reduce memory but can affect latency and numeric fidelity. OpenAI provides MXFP4 quantized weights for GPT-OSS.
- Context length (sequence length) – longer contexts increase activation cache use proportionally; GPT-OSS supports extremely long contexts (up to very large token windows in their design), which multiplies memory needs.
- Batch size & concurrency – serving multiple concurrent users multiplies memory for activations and cache. Frameworks like vLLM, DeepSpeed, and Triton try to efficiently batch and share activations across requests.
- Serving framework overhead – different inference servers (vLLM, text-generation-inference, llama.cpp, ONNX Runtime) add different overheads and optimizations.
What “fits” where: rough memory rules
Two concepts matter for hardware planning:
- Total parameter count — an upper bound on model size (117B vs 21B).
- Activated/working set — in MoE or certain precision settings the active memory needed at inference can be much smaller than raw parameter bytes.
Practical rules-of-thumb:
- 16 GB class GPUs/edge laptops → possible for gpt-oss-20b if you use the model’s provided memory-efficient config (or quantize aggressively to 4-bit/NF4/AWQ).
- 80 GB H100 / A100 80GB → single-GPU hosting for gpt-oss-120b in their recommended setup. For production throughput you may still want multiple GPUs for batching, redundancy or lower latency under concurrency.
- Large multi-GPU setups (A100/H100 clusters) → required if you want to run many concurrent users at low latency or perform heavy fine-tuning/training. DeepSpeed/ZeRO and automatic tensor parallelism let you split large models across GPUs.
Short takeaway: for experimentation and lightweight local use, plan on a 16–24 GB GPU (or CPU + heavy quantization). For production single-GPU inference of the big gpt-oss model you’ll target an 80 GB H100, otherwise use multi-GPU partitioning.
How much computing power is required for GPT-OSS deployment in practice?
Inference vs training: wildly different budgets
- Inference: the dominating cost is GPU memory (VRAM) and optimized kernels. With optimized runtimes (vLLM, TensorRT, DeepSpeed-Inference) and quantization, inference on gpt-oss-20b is feasible on a 16 GB consumer GPU; the 120B MoE model is engineered to fit an 80 GB H100.
- Fine-tuning / full-scale training: orders of magnitude larger — you’ll need many GPUs, or specialized training instances (multi-node H100/A100 clusters, DFLOPs budget, and storage I/O). This article focuses mainly on inference/self-hosting and light fine-tuning recipes (QLoRA / LoRA), not multi-week pretraining.
CPU vs GPU vs specialized accelerators
- CPU only: possible with GGUF/llama.cpp and tiny quantized builds, trading latency for lower cost. Running 20B on CPU without quantization is impractical. Use CPU when privacy or local offline operation is essential and your throughput needs are low.
- GPU: preferred for latency and throughput. Modern ML GPUs (A100/H100/4090/4080) differ widely by HBM/VRAM and inter-GPU fabric. The gpt-oss docs recommend H100-class for the 120B variant.
- TPU / AMD MI300X: supported by some runtimes (vLLM/ROCm builds) and can be cost-effective in certain clouds — check provider docs when choosing hardware.
How to run GPT-OSS locally on a limited budget? (code + step-by-step)
Below are two practical approaches: (A) GPU laptop/desktop with ~16–24 GB VRAM using 4-bit quantization, and (B) CPU/low-GPU (offline) using llama.cpp (GGUF) or small quantized builds. Both are widely used by practitioners when money and power are limited.
Note: these instructions assume you have a working Python environment (Linux recommended for best CUDA support). For Windows, use WSL2 for best compatibility with GPU toolchains.
A. GPU route (recommended for best latency on a budget) — quantize + load with bitsandbytes (4-bit)
This path aims to run openai/gpt-oss-20b on a single consumer GPU (e.g., 24 GB 4090 or 16 GB 4080). It uses bitsandbytes 4-bit quantization and Hugging Face transformers
device-map/accelerate.
Step 1 — Install basics
# Linux + CUDA (example); pick the correct torch CUDA wheel for your driver
python -m pip install -U pip
pip install torch --index-url https://download.pytorch.org/whl/cu121 # pick your CUDA version
pip install -U transformers accelerate bitsandbytes safetensors
(If you use conda, create an env and install the CUDA-compatible torch wheel for your platform.)
Step 2 — (Optional) Hugging Face login to download large files
huggingface-cli login
Step 3 — Python example (load quantized 4-bit model)
# save as run_gptoss_4bit.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "openai/gpt-oss-20b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4" # or "fp4"/"nf4" depending on support
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # let transformers pick GPU + CPU offload if needed
quantization_config=bnb_config,
torch_dtype=torch.float16,
trust_remote_code=True
)
prompt = "Write a concise summary of quantization for LLMs."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Notes & tips
- Use
device_map="auto"
sotransformers
uses CPU/GPU offload automatically. If you have a single GPU,device_map="auto"
will usually put everything on GPU and offload what must be on CPU. - If you run out of VRAM, add
--offload_folder ./offload
(or setoffload_folder
infrom_pretrained
) to offload tensors to NVMe. - Hugging Face + bitsandbytes approach is widely documented; see the 4-bit transformers guide for details.
B. CPU / tiny-budget route (llama.cpp / GGUF)
If you have no GPU or a very small GPU, llama.cpp
/ GGUF builds (and AWQ/GPTQ quantized files) let you run models on CPU with acceptable latency for single users.
Step 1 — Install llama.cpp / Python bindings
# Download and build (Linux)
git clone --recursive https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
# Python bindings (optional)
pip install llama-cpp-python
Step 2 — Convert safetensors → GGUF (if conversion scripts are available for gpt-oss)
OpenAI/Hugging Face provide safetensors; community converters (or scripts in llama.cpp
) convert to GGUF. The exact command depends on current llama.cpp
tools; check the repo README for convert.py
/convert-safetensors-to-gguf
. (Community threads discuss conversion for new models.)
Step 3 — Run the model with llama.cpp
# basic inference (example)
./main -m ./gpt-oss-20b.gguf -p "Explain GGUF and quantization in one paragraph." -n 256
Notes & tradeoffs
- CPU runs are much slower. Use this route for testing, privacy, or very low-concurrency local agents.
- Generating long outputs or serving many concurrent users on CPU is not practical; move to a GPU for production.
On-disk quantized builds (GPTQ/AWQ)
If you need to squeeze a large model into a small GPU (e.g., 8–12 GB), results from the community show GPTQ/AWQ style quantization can make some 20B models run on low-VRAM GPUs — but conversion often needs more CPU RAM and one intermediate GPU during conversion. Tools: GPTQ-for-LLaMa
, AutoGPTQ
(archived), AWQ
, and QLLM
.
Practical tips for limited budget
- Prefer 4-bit quantized checkpoints (GPTQ/AWQ/MXFP4) — often the difference between “runs in 12 GB” and “requires 80 GB”.
- Limit context length for budget inference: long contexts blow up activation cache. If you must store long contexts, consider offloading strategies.
- Use unified memory / nvmem offload carefully — frameworks may offer CPU/NVMe offload (DeepSpeed ZeRO-Offload / ZeRO-Infinity), but this increases latency.
How to self-host GPT-OSS on cloud providers (practical guide & cost pointers)?
Which cloud hardware to pick?
- Single-GPU 80 GB H100: good for hosting gpt-oss-120b for small-to-medium traffic. In AWS terms, P5 instances provide H100 hardware; single-GPU variants (announced in 2025) make it cheaper to right-size for inference. Use P5 / ND H100 family depending on provider.
- Multi-GPU (8× H100): for high throughput and redundancy, use p5.48x, p5dn or comparable cluster. NVidia NVLink/NVSwitch in same instance reduces inter-GPU comms overhead.
- Alternative clouds: CoreWeave, Lambda Labs, Paperspace, Runpod — often cheaper spot/ondemand GPU rentals for bursty inference. Use them for dev before committing to long-term infra.
- Cutting-edge / heavy production: AWS p5 (H100) (8 × H100 80GB per instance) — for the highest throughput per node and single-GPU 80+ GB needs, or for 120B+ with less splitting. P5 provides H100s and large NVMe local storage.
rmers, text-generation-inference (TGI)/NVIDIA TGI containers, or set up DeepSpeed inference.
- Provision fast local NVMe if you plan to offload large activation states (ZeRO-Infinity). P4/P5 nodes often have local NVMe and very high network bandwidth. ([Amazon Web Services, Inc.][16])
- Security & networking — place inference endpoints behind load balancers, use autoscaling groups for front-ends, and separate concerns (model serving vs request routing).
- Monitoring & SLOs — track GPU utilization, memory, token/sec, latency p95 and errors; use Prometheus + Grafana for metrics.
Example cloud self-hosting workflow (AWS P4/P5)
- Choose instance (p4d/p5) based on model memory needs. For gpt-oss-20B, a single 16–32 GB instance is fine; for gpt-oss-120B choose 80GB HBM instance or multi-GPU.
- Prepare AMI / image — use a vendor AMI that bundles CUDA, cuDNN, and optimized PyTorch (or vendor images with NVIDIA drivers).
- Install serving stack: vLLM, transformers, text-generation-inference (TGI)/NVIDIA TGI containers, or set up DeepSpeed inference.
- Provision fast local NVMe if you plan to offload large activation states (ZeRO-Infinity). P4/P5 nodes often have local NVMe and very high network bandwidth.
- Security & networking — place inference endpoints behind load balancers, use autoscaling groups for front-ends, and separate concerns (model serving vs request routing).
- Monitoring & SLOs — track GPU utilization, memory, token/sec, latency p95 and errors; use Prometheus + Grafana for metrics.
Sample self-host plan (gpt-oss-20b, production small scale)
Goal: serve ~20 concurrent users, 1–2s response target, cost-sensitive.
- Instance: 1× A10G / 1× 24 GB GPU (e.g., G5 / A10G / RTX 6000) for model + 1× small CPU bootstrap server.
- Runtime: vLLM as the model server (continuous batching) + CometAPI gateway.
- Autoscale: use autoscaling group with GPU AMI and an ALB + horizontal autoscaling by CPU/GPU metrics.
- Storage: NVMe local for model caching; object store (S3) for cold model storage.
- Monitoring: Prometheus + Grafana, track GPU utilization, latency, queue length.
- Security: VPC, private subnets, IAM roles for model storage, TLS certs.
Sample self-host plan (gpt-oss-120b, production)
Goal: low latency for many concurrent users / enterprise.
- Instance: 1× H100 80 GB (single-GPU) for baseline; scale horizontally or use multi-GPU p5 instances for throughput. For high throughput, either replicate single-GPU service (data parallel) or shard the model across GPUs using DeepSpeed (tensor/pipeline).
- Runtime: DeepSpeed-Inference with automatic TP or NVIDIA TensorRT (where available). vLLM’s support for MoE/Multi-GPU and tuned kernels may also be useful.
- Kubernetes: use K8s with device plugins and local NVMe; use chaos testing for availability.
- Cost optimization: reserved instances for predictable load; spot instances for batch workloads.
Example: start a vLLM serving container for gpt-oss-20b
# assume vllm is installed and CUDA is set up
vllm serve --model openai/gpt-oss-20b --port 8000 --num-gpus 1
Then point your front-end to http://<host>:8000/v1/chat/completions
(vLLM supports OpenAI-compatible API).
Cost-optimization tips
- Spot/Preemptible VMs are 50–80% cheaper but require checkpointing or fast re-spawn strategies.
- Model quantization reduces instance type needs (e.g., a quantized 120B might be served on fewer GPUs if engines support on-the-fly dequantization).
- Use inference-only optimized instance families (P5/P4/A2 Ultra) with high NVLink/NVSwitch when doing multi-GPU model parallelism; network bandwidth matters for inter-GPU sharding.
How to balance cost, latency and model quality
Quantization: speed vs quality
Aggressive quantization (2-4 bit, AWQ/GPTQ) → huge memory savings and often modest quality loss for many tasks. Use AWQ/GPTQ for production if you benchmark the specific workload. Conversion may require large CPU memory during quantization.
Mixed precision & kernel optimizations
Use fp16, bf16 where supported; combine with specialized CUDA kernels (FasterTransformer, TensorRT) for max throughput. Nvidia/TensorRT offers speculative decoding and optimized kernels for many transformers (NVIDIA provides optimized GPT-OSS adapters).
Safety & observability
Open-weight models mean you’re responsible for monitoring misuse, data leakage and drift. Implement request logging, content filters, rate limiting, and human-in-the-loop moderation. OpenAI’s release notes and model card emphasize their internal testing and external evaluations — but self-hosting shifts the safety perimeter to you.
Final thoughts
GPT-OSS moves the needle: models that previously required massive bespoke infra are now more approachable thanks to careful architecture choices and quantized distributions. But deployment remains a discipline: hardware sizing must consider model precision, context length, and the concurrency profile of your app. Use small testbeds (quantized 20B) to measure token/sec and p95 latency, and then multiply to estimate cloud compute and cost for production.
How to Access GPT-OSS API
CometAPI is a unified API platform that aggregates over 500 AI models from leading providers—such as OpenAI’s GPT series, Google’s Gemini, Anthropic’s Claude, Midjourney, Suno, and more—into a single, developer-friendly interface. By offering consistent authentication, request formatting, and response handling, CometAPI dramatically simplifies the integration of AI capabilities into your applications. Whether you’re building chatbots, image generators, music composers, or data‐driven analytics pipelines, CometAPI lets you iterate faster, control costs, and remain vendor-agnostic—all while tapping into the latest breakthroughs across the AI ecosystem.
Developers can access GPT-OSS-20B and GPT-OSS-120B through CometAPI, the latest models version listed are as of the article’s publication date. To begin, explore the model’s capabilities in the Playground and consult the API guide for detailed instructions. Before accessing, please make sure you have logged in to CometAPI and obtained the API key. CometAPI offer a price far lower than the official price to help you integrate.