LLM Fine-Tuning — QLoRA Indian Legal
A first-principles fine-tuning experiment — QLoRA (4-bit quantization + LoRA adapters) applied to Meta's Llama 3.2 1B Instruct on Indian Legal data. Trained from scratch on Google Colab free tier at ₹0 cost. Model published on Hugging Face.
🔗 Live Model: huggingface.co/invincibleambuj/llama-3.2-1b-legal-india-qlora
Project Overview
| Metric | Value |
|---|---|
| Base Model | Llama 3.2 1B Instruct (Meta) |
| Method | QLoRA — 4-bit quantization + LoRA rank 16 |
| Dataset | 14,543 Indian Legal QA pairs |
| Trainable Parameters | 11.27M out of 1.247B (0.90%) |
| Frozen Parameters | 1,235,814,400 (99.10%) |
| Training Steps | 100 (learning experiment) |
| Training Time | ~1 min 37 sec on T4 |
| Final Loss | 1.578 (Step 100) |
| GPU | Google Colab Tesla T4 (Free) |
| Training Cost | ₹0 |
What is QLoRA?
Standard fine-tuning requires loading the full model (1B+ params) into GPU memory — impossible on free-tier hardware. QLoRA solves this in two steps:
Normal Fine-tuning:
Full model (1B params) → Need 100GB+ VRAM → ❌ Not possible on free GPU
LoRA:
Freeze base model → Add small adapter layers → Train only adapters
1B params → Only 11M adapter params trained → ✅ Much less VRAM
QLoRA (what we used):
Step 1: Compress base model to 4-bit (Quantization) → ~300MB in memory
Step 2: Add LoRA adapters on top → Train only 11.27M params
Result: 0.90% of model trained → ✅ Runs on FREE T4 GPU
The base model's general knowledge (language, reasoning) is preserved. The adapter layers steer the model's responses toward the Indian Legal domain. The two work together at inference time.
| Concept | In This Project |
|---|---|
| Base Model Size | 1.247 Billion parameters |
| After 4-bit Quantization | ~300 MB in GPU memory |
| LoRA Rank (r) | 16 |
| Trainable Parameters | 11,272,192 (0.90%) |
| Frozen Parameters | 1,235,814,400 (99.10%) |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Environment
| Component | Detail |
|---|---|
| Platform | Google Colab Free Tier |
| GPU | Tesla T4 — 14.5 GB VRAM |
| CUDA Version | 12.8 |
| Python | 3.12 |
| Torch | 2.10.0+cu128 |
| Unsloth | 2026.4.4 |
| TRL | Latest (upgraded from <0.9.0) |
| Training Cost | ₹0 — Free GPU |
Dataset
Sourced from Kaggle — three JSON files of Indian Legal question-answer pairs:
| File | Content | Source Law |
|---|---|---|
| constitution_qa.json | Constitutional QA | Constitution of India |
| ipc_qa.json | Penal Code QA | Indian Penal Code (IPC) |
| crpc_qa.json | Procedure QA | Code of Criminal Procedure (CrPC) |
| Total | 14,543 examples | — |
Data Format Conversion
Raw JSON → Alpaca format → Training prompt:
// Raw Data
{
"question": "What is India according to the Union and its Territory?",
"answer": "India, that is Bharat, shall be a Union of States."
}
// Converted to Alpaca Format
{
"instruction": "What is India according to the Union and its Territory?",
"input": "",
"output": "India, that is Bharat, shall be a Union of States."
}
Final Training Prompt Format:
### Instruction:
What is India according to the Union and its Territory?
### Response:
India, that is Bharat, shall be a Union of States.
Cell-by-Cell Training Walkthrough
Model Loading (4-bit)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3.2-1b-instruct",
max_seq_length = 2048,
load_in_4bit = True, # ← This is the "q" in qLoRA
)
Downloaded Llama 3.2 1B (1.10 GB) and loaded it in 4-bit precision. load_in_4bit=True is the quantization step that makes free GPU training possible.
LoRA Adapter Setup
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = True,
)
LoRA adapters attached to 7 layer types across 16 transformer layers. Result: 11.27M trainable parameters added.
Important: This cell must be re-run after every Runtime restart. If skipped, training fails with "cannot fine-tune quantized model" error.
Training Configuration
| Parameter | Value | What it Means |
|---|---|---|
| per_device_train_batch_size | 2 | 2 examples processed per GPU step |
| gradient_accumulation_steps | 4 | Accumulate 4 steps = effective batch of 8 |
| max_steps | 100 | Total training steps (learning experiment) |
| learning_rate | 2e-4 (0.0002) | How fast adapter weights update |
| optim | adamw_8bit | Memory-efficient 8-bit optimizer |
| fp16 | True (T4 GPU) | Half precision — faster training |
| lr_scheduler_type | linear | Learning rate decreases linearly |
| warmup_steps | 5 | Gradual LR warmup at start |
- Effective Batch Size: 2 (batch) × 4 (accumulation) = 8 examples per weight update
- Examples actually seen: 100 steps × 8 = 800 out of 14,543 (5.5% of full dataset)
- Full epoch would require: ~1,820 steps (14,543 ÷ 8)
Training Results
Loss decreased consistently from 3.47 → 1.57, showing the model was learning the legal domain patterns.
| Step | Training Loss | Trend |
|---|---|---|
| 1 | 3.4753 | — |
| 10 | 2.3187 | ↓ Decreasing fast |
| 25 | 1.9115 | ↓ Continuing |
| 50 | 1.6189 | ↓ Stabilizing |
| 75 | 1.6946 | ~ Fluctuating |
| 100 | 1.5784 | ↓ Final |
Limitation: 100 steps is a learning experiment. Loss is still high — a proper training run needs ~1,820 steps (1 full epoch) on Kaggle's free GPU (30 hrs/week).
Inference Test
FastLanguageModel.for_inference(model)
inputs = tokenizer(
"### Instruction:\nWhat is IPC Section 302?\n\n### Response:\n",
return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
✅ Output: "IPC Section 302 deals with punishment for murder."
Model answered correctly. repetition_penalty=1.3 needed to prevent answer looping (side effect of limited training steps).
Errors Encountered & Fixes
Real engineering — not a clean tutorial. Every error hit during training and the fix applied:
| Error | Root Cause | Fix Applied |
|---|---|---|
IndexError: Index 1999 out of range | dataset.select(range(2000)) on 1000-item dataset | Changed to range(len(dataset)) |
ImportError: cannot import SFTConfig from trl.trainer | trl<0.9.0 pinned — wrong import path | Import from trl directly |
ImportError: cannot import SFTConfig from trl | trl version too old — no SFTConfig exists | Upgraded trl, restarted runtime |
TypeError: unexpected keyword 'tokenizer' | New TRL renamed the argument | Changed to processing_class |
TypeError: unexpected keyword 'processing_class' | Old TRL version — argument didn't exist yet | Version conflict — upgraded trl |
ValueError: padding_free=True without packing | Unsloth sets padding_free=True by default | Added padding_free=False in SFTConfig |
ValueError: cannot fine-tune quantized model | Cell 3 (LoRA setup) not re-run after restart | Re-run Cell 3 after every restart |
Model Save & Publish
# Save locally in Colab
model.save_pretrained("lora_legal_india_final")
tokenizer.save_pretrained("lora_legal_india_final")
# Push to Hugging Face
model.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")
tokenizer.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")
| File Uploaded | Size | What It Is |
|---|---|---|
| adapter_model.safetensors | 45.1 MB | Trained LoRA adapter weights |
| tokenizer.json | 17.2 MB | Tokenizer vocabulary |
| README.md | ~2 KB | Model card with usage info |
Note: Only the LoRA adapter is saved — not the full base model. Anyone loading this model needs: (1) base model
unsloth/llama-3.2-1b-instruct+ (2) this adapter. Hugging Face handles this automatically.
How to Use
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "invincibleambuj/llama-3.2-1b-legal-india-qlora"
)
FastLanguageModel.for_inference(model)
inputs = tokenizer(
"### Instruction:\nWhat is Article 21 of Indian Constitution?\n\n### Response:\n",
return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License Compliance (Llama 3.2)
| Requirement | Status |
|---|---|
| Model name must start with "Llama" | ✅ Compliant — llama-3.2-1b-legal-india-qlora |
| "Built with Llama" displayed on model page | ✅ Added to README |
| Attribution notice included | ✅ In README |
| Monthly users > 700M? (requires Meta license) | N/A — Personal project |
Next Steps
- Full training: Change
max_steps = 1820on Kaggle free GPU (30 hrs/week) — runs 1 complete epoch over all 14,543 examples - Better evaluation: Test on unseen legal questions to measure actual accuracy
- GGUF export: Convert to GGUF format for running locally on CPU with llama.cpp
- Gradio demo: Build a simple Hugging Face Space with a chat interface