QLoRA: Quantized Low-Rank Adaptation

QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.

Prerequisites: This skill assumes familiarity with LoRA. See the lora skill for LoRA fundamentals (LoraConfig, target_modules, training patterns).

Core Innovations
BitsAndBytesConfig Deep Dive
Memory Requirements
Complete Training Example
Inference and Merging
Troubleshooting
Best Practices
References

Core Innovations

QLoRA introduces three techniques that reduce memory usage without sacrificing performance:

4-bit NormalFloat (NF4)

NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.

Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)

The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.

NF4 vs FP4:

Quantization	Description	Use Case
`nf4`	Normalized Float 4-bit, optimal for normal distributions	Default, recommended
`fp4`	Standard 4-bit float	Legacy, rarely needed

Double Quantization

Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:

First quantization:  weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constants

This saves approximately 0.37 bits per parameter—significant for billion-parameter models:

7B model: ~325 MB savings
70B model: ~3.2 GB savings

Paged Optimizers

During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:

Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully

This is handled automatically by bitsandbytes when using 4-bit training.

BitsAndBytesConfig Deep Dive

All Parameters Explained

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    # Core 4-bit settings
    load_in_4bit=True,              # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",      # "nf4" (recommended) or "fp4"

    # Double quantization
    bnb_4bit_use_double_quant=True, # Quantize the quantization constants

    # Compute precision
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to this dtype for compute

    # Optional: specific storage type (usually auto-detected)
    bnb_4bit_quant_storage=torch.uint8,     # Storage dtype for quantized weights
)

Compute Dtype Selection

Dtype	Hardware	Notes
`torch.bfloat16`	Ampere+ (RTX 30xx, A100)	Recommended, faster
`torch.float16`	Older GPUs (V100, RTX 20xx)	Use if bf16 not supported
`torch.float32`	Any	Slower, only for debugging

Check bf16 support:

import torch
print(torch.cuda.is_bf16_supported())  # True on Ampere+

Comparison: Quantization Options

# Recommended: NF4 + double quant + bf16
optimal_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Fallback when bf16 is unsupported
fp16_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,  # use when bf16 is unsupported or slower
)

# 8-bit alternative (less compression, sometimes more stable)
eight_bit_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

Memory Requirements

Model Size	Full Fine-tuning	LoRA (16-bit)	QLoRA (4-bit)
7B	~60 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
34B	~272 GB	~75 GB	~20 GB
70B	~560 GB	~160 GB	~48 GB

Notes:

QLoRA memory includes model + optimizer states + activations
Actual usage varies with batch size, sequence length, and gradient checkpointing
Add ~20% buffer for safe operation

GPU Recommendations

GPU VRAM	Max Model Size (QLoRA)
8 GB	7B (tight)
16 GB	7-13B
24 GB	13-34B
48 GB	34-70B
80 GB	70B+ comfortably

Complete Training Example

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# 1. Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# 2. Load quantized model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    dtype="auto",
    attn_implementation="flash_attention_2",  # Optional: faster attention
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# 3. Prepare for k-bit training (critical step!)
model = prepare_model_for_kbit_training(model)

# 4. LoRA config (see lora skill for parameter details)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 5. Dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

def format_example(example):
    if example["input"]:
        return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"}
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_example)

# 6. Training
sft_config = SFTConfig(
    output_dir="./qlora-output",
    max_length=512,
    dataset_text_field="text",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_steps=100,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",  # Paged optimizer for memory efficiency
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

# 7. Save adapter
model.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")

Inference and Merging

Inference with Quantized Model

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

model_name = "meta-llama/Llama-3.1-8B"

# Load quantized base model
bnb_config = BitsAnd

qlora

How to add

Drop this on your repo README

Related skills

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Get new Design e Frontend skills every Monday