Fine-Tuning SmolLM on a Single GPU

Key Takeaways

SmolLM (135M – 360M – 1.7B): These models exhibit surprisingly competent behavior and can live on a laptop or entry-level gaming rig.
You can bend the 135M & 360M variants to your will in a few hours on commodity hardware like an RTX 4070. No cloud tithe required.
Supervised Fine-Tuning (SFT) is the lever to coax instruction-following behavior from a raw base model.
Small models are unforgiving. They demand carefully formatted prompts, high-signal synthetic data, and (counter-intuitively) gradient checkpointing to survive.
Evaluate with held-out tasks. Blind faith in a single benchmark like Winogrande is a recipe for delusion.

Why Bother With Tiny Models?

The large-scale, open-weight apex predators like Llama-2 70B and Mistral-7B are undeniably potent-if you belong to the H100 priesthood or are willing to pay the cloud bill. In the trenches of real-world projects, my needs are often more prosaic. I need an assistant that:

Boots inside 8–10 GB of VRAM.
Answers domain-specific questions locally, without phoning home.
Can be retrained overnight on a desk-side machine I actually own.

Hugging Face’s SmolLM series meets that bar. The 135M and 360M checkpoints are lean yet capable, a consequence of a deep-and-narrow MobileLLM-style architecture. This is about digital sovereignty and tangible utility.

Model	Params	Layers × Dim	Pre-train tokens	Context	Epochs (effective)
SmolLM-135M	135M	48 × 768	600B	2048	≈ 2.4
SmolLM-360M	360M	64 × 1024	600B	2048	≈ 2.4
SmolLM-1.7B	1.7B	80 × 2048	1T	2048	≈ 4

_{Training corpus: Cosmopedia v2 (28B), FineWeb-Edu (220B), Python-Edu (4B).}

Hardware & Tooling

To replicate this, you’ll need:

GPU: At least 12 GB of VRAM. An RTX 3060 laptop is sufficient. RTX 4070 would be great.
The usual suspects: PyTorch ≥ 2.2, Transformers ≥ 4.40, TRL, datasets, accelerate. This is the standard arsenal for modern NLP.
Optional but recommended: FlashAttention-2 if your GPU has a Compute Capability ≥ 8.0 (Ampere or newer). It materially accelerates the attention mechanism.

Supervised Fine-Tuning (SFT)

This is where we impose discipline on the base model, teaching it to follow instructions. I’m using Dolly-15k, a high-quality instruction set from Databricks that provides a diverse range of tasks. The objective is to shift the model’s behavior from merely predicting the next token to completing a user’s request.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
import torch
import re
import platform

# --- Determine hardware capabilities first ---
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
attn_impl = "eager"
if torch.cuda.is_available():
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        # For GPUs with Compute Capability >= 8.0 (Ampere, Ada, Hopper)
        # Install flash-attn for a significant speedup.
        # !pip install -q flash-attn
        attn_impl = "flash_attention_2"
        print("Using FlashAttention-2.")
    else:
        print("GPU compute capability less than 8.0. Using eager attention.")
else:
    print("No CUDA GPU found. Using eager attention on CPU.")

print(f"Processor: {platform.processor()} -> BF16 available: {use_bf16}, Attention impl: {attn_impl}")


# --- Model and Tokenizer Setup ---
model_id = "HuggingFaceTB/SmolLM-135M"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token # Base models often lack a pad token
tok.padding_side = "left" # Critical for decoder-only architectures

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation=attn_impl,
    device_map="auto", # Maps model to GPU/CPU automatically
    torch_dtype=torch.bfloat16 if use_bf16 else torch.float16
)
# A non-negotiable for memory savings. Trades compute for VRAM.
model.gradient_checkpointing_enable()

# --- Dataset Preparation ---
ds_raw = load_dataset("databricks/databricks-dolly-15k", split="train")

def format_dolly(example):
    # Imposing a standard conversational format on the raw data.
    instruction = example["instruction"]
    context = example["context"]
    response = example["response"]
    
    if context and isinstance(context, str) and context.strip():
        # Clean up the noise - Wikipedia citation numbers aren't signal.
        clean_context = re.sub(r'\[\d+\]', '', context).strip()
        formatted = f"User: {instruction}\nContext: {clean_context}\nAssistant: {response}"
    else:
        formatted = f"User: {instruction}\nAssistant: {response}"
    
    return {"text": formatted}

# Process and split the dataset
ds_processed = ds_raw.map(format_dolly, remove_columns=list(ds_raw.features))
train_val_split = ds_processed.train_test_split(test_size=0.1, seed=42)
ds_train = train_val_split["train"]
ds_val = train_val_split["test"]
print(f"Training on {len(ds_train)} examples, validating on {len(ds_val)} examples.")

# --- Training Configuration ---
config = SFTConfig(
    output_dir="sft_smollm_135m",
    per_device_train_batch_size=4,   # Keep this low for ~12GB VRAM
    gradient_accumulation_steps=4,   # Effective batch size = 4 * 4 = 16
    num_train_epochs=3,              # Dolly is small enough for multiple epochs
    learning_rate=2e-5,
    fp16=not use_bf16,               # Use FP16 if BF16 is not available
    bf16=use_bf16,                   # Use BF16 on Ampere+ for speed and stability
    logging_steps=50,
    save_steps=500,
    eval_steps=50,
    evaluation_strategy="steps",     # Required if using eval_steps
    max_seq_length=2048,             # Match the model's context window
    packing=True,                    # Pack multiple short examples into one sequence for efficiency
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    args=config,
    dataset_text_field="text",       # Explicitly point to our formatted text column
)

# To run the training, uncomment the line below.
# trainer.train()
print("SFT Trainer configured. The machine is ready to impose some discipline.")

Tip: For Dolly-15k, three full epochs is a reasonable starting point. With ~15k examples and an effective batch size of 16, one epoch is roughly 940 steps. I always recommend a dry run with a small number of steps (e.g., max_steps=100) to validate the entire pipeline before committing to a multi-hour training job.

Practical Guidelines

My time in the trenches with these smaller models has yielded a few hard-won heuristics:

Curate Ruthlessly: Tiny models have no capacity for garbage. They are exquisitely sensitive to data quality. They benefit immensely from high-signal, narrowly-scoped data like Dolly, which is composed of human-curated examples across useful categories (creative writing, classification, information extraction). Noise will kill your performance.
Batch Size Matters: Even for a 135M model, achieving a stable global batch size (e.g., 64+ sequences per optimization step) is crucial for stable training. Gradient accumulation is your primary tool here, simulating a larger batch size without demanding more VRAM.
Mind the Template: As of mid-2024, many base models on Hugging Face lack a defined chat_template. You will likely need to inject one manually before using higher-level abstractions like apply_chat_template. Failure to do so results in models that don’t understand conversational turns.
Learning Rate Schedule: For SFT, a learning rate between 1e-5 and 5e-5 with a linear decay schedule is a robust starting point. It helps stabilize training as the model converges.
Quantize for Deployment: Once you are satisfied with the fine-tuned model’s performance, quantizing it is the final step to weaponize it for inference. Converting to a format like INT4-GGUF with llama.cpp can shrink the 360M model to fit comfortably within 3 GB of RAM.

Limitations & Next Steps

Let’s be clear about what these models are not. Don’t bother with Direct Preference Optimization (DPO) on models this small. They lack the parameter capacity to effectively learn from preference pairs without suffering catastrophic forgetting. Dolly provides a solid SFT foundation, but the limitations are real:

Context Window: The SmolLM series is locked at a 2048-token context. For long-document Q&A, you are still bound to a Retrieval Augmented Generation (RAG) architecture.
Multilingualism: The pre-training data is overwhelmingly English. Performance in other languages will require dedicated fine-tuning on translated or newly generated instruction sets.
Complex Reasoning: Performance on multi-step logical reasoning (like GSM8K math problems) will hit a hard ceiling. For these tasks, chasing scale or exploring radically different architectures like Mixture-of-Experts (MoE) ensembles is likely a more productive path.

Conclusion

The SmolLM series is an existence proof that “good enough” language models are now accessible to builders outside of large, well-capitalized labs. With a weekend of tinkering on a consumer GPU, it’s possible to craft a personalized assistant that respects both data privacy and budgetary reality. The Supervised Fine-Tuning workflow is a repeatable recipe for imprinting your intent onto a base model-a process that is not only effective but a potent reminder that value can be created without asking for anyone’s permission.

Posted in AI / ML, LLM Intermediate by Rakshit Kalra