LLM Fine-Tuning — QLoRA Indian Legal

A first-principles fine-tuning experiment — QLoRA (4-bit quantization + LoRA adapters) applied to Meta's Llama 3.2 1B Instruct on Indian Legal data. Trained from scratch on Google Colab free tier at ₹0 cost. Model published on Hugging Face.

🔗 Live Model: huggingface.co/invincibleambuj/llama-3.2-1b-legal-india-qlora

Project Overview

Metric	Value
Base Model	Llama 3.2 1B Instruct (Meta)
Method	QLoRA — 4-bit quantization + LoRA rank 16
Dataset	14,543 Indian Legal QA pairs
Trainable Parameters	11.27M out of 1.247B (0.90%)
Frozen Parameters	1,235,814,400 (99.10%)
Training Steps	100 (learning experiment)
Training Time	~1 min 37 sec on T4
Final Loss	1.578 (Step 100)
GPU	Google Colab Tesla T4 (Free)
Training Cost	₹0

What is QLoRA?

Standard fine-tuning requires loading the full model (1B+ params) into GPU memory — impossible on free-tier hardware. QLoRA solves this in two steps:

Normal Fine-tuning:
  Full model (1B params) → Need 100GB+ VRAM → ❌ Not possible on free GPU

LoRA:
  Freeze base model → Add small adapter layers → Train only adapters
  1B params → Only 11M adapter params trained → ✅ Much less VRAM

QLoRA (what we used):
  Step 1: Compress base model to 4-bit (Quantization) → ~300MB in memory
  Step 2: Add LoRA adapters on top → Train only 11.27M params
  Result: 0.90% of model trained → ✅ Runs on FREE T4 GPU

The base model's general knowledge (language, reasoning) is preserved. The adapter layers steer the model's responses toward the Indian Legal domain. The two work together at inference time.

Concept	In This Project
Base Model Size	1.247 Billion parameters
After 4-bit Quantization	~300 MB in GPU memory
LoRA Rank (r)	16
Trainable Parameters	11,272,192 (0.90%)
Frozen Parameters	1,235,814,400 (99.10%)
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Environment

Component	Detail
Platform	Google Colab Free Tier
GPU	Tesla T4 — 14.5 GB VRAM
CUDA Version	12.8
Python	3.12
Torch	2.10.0+cu128
Unsloth	2026.4.4
TRL	Latest (upgraded from <0.9.0)
Training Cost	₹0 — Free GPU

Dataset

Sourced from Kaggle — three JSON files of Indian Legal question-answer pairs:

File	Content	Source Law
constitution_qa.json	Constitutional QA	Constitution of India
ipc_qa.json	Penal Code QA	Indian Penal Code (IPC)
crpc_qa.json	Procedure QA	Code of Criminal Procedure (CrPC)
Total	14,543 examples	—

Data Format Conversion

Raw JSON → Alpaca format → Training prompt:

// Raw Data
{
  "question": "What is India according to the Union and its Territory?",
  "answer": "India, that is Bharat, shall be a Union of States."
}

// Converted to Alpaca Format
{
  "instruction": "What is India according to the Union and its Territory?",
  "input": "",
  "output": "India, that is Bharat, shall be a Union of States."
}

Final Training Prompt Format:

### Instruction:
What is India according to the Union and its Territory?

### Response:
India, that is Bharat, shall be a Union of States.

Cell-by-Cell Training Walkthrough

Model Loading (4-bit)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3.2-1b-instruct",
    max_seq_length = 2048,
    load_in_4bit = True,   # ← This is the "q" in qLoRA
)

Downloaded Llama 3.2 1B (1.10 GB) and loaded it in 4-bit precision. load_in_4bit=True is the quantization step that makes free GPU training possible.

LoRA Adapter Setup

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj",
                      "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
)

LoRA adapters attached to 7 layer types across 16 transformer layers. Result: 11.27M trainable parameters added.

Important: This cell must be re-run after every Runtime restart. If skipped, training fails with "cannot fine-tune quantized model" error.

Training Configuration

Parameter	Value	What it Means
per_device_train_batch_size	2	2 examples processed per GPU step
gradient_accumulation_steps	4	Accumulate 4 steps = effective batch of 8
max_steps	100	Total training steps (learning experiment)
learning_rate	2e-4 (0.0002)	How fast adapter weights update
optim	adamw_8bit	Memory-efficient 8-bit optimizer
fp16	True (T4 GPU)	Half precision — faster training
lr_scheduler_type	linear	Learning rate decreases linearly
warmup_steps	5	Gradual LR warmup at start

Effective Batch Size: 2 (batch) × 4 (accumulation) = 8 examples per weight update
Examples actually seen: 100 steps × 8 = 800 out of 14,543 (5.5% of full dataset)
Full epoch would require: ~1,820 steps (14,543 ÷ 8)

Training Results

Loss decreased consistently from 3.47 → 1.57, showing the model was learning the legal domain patterns.

Step	Training Loss	Trend
1	3.4753	—
10	2.3187	↓ Decreasing fast
25	1.9115	↓ Continuing
50	1.6189	↓ Stabilizing
75	1.6946	~ Fluctuating
100	1.5784	↓ Final

Limitation: 100 steps is a learning experiment. Loss is still high — a proper training run needs ~1,820 steps (1 full epoch) on Kaggle's free GPU (30 hrs/week).

Inference Test

FastLanguageModel.for_inference(model)

inputs = tokenizer(
    "### Instruction:\nWhat is IPC Section 302?\n\n### Response:\n",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✅ Output: "IPC Section 302 deals with punishment for murder."

Model answered correctly. repetition_penalty=1.3 needed to prevent answer looping (side effect of limited training steps).

Errors Encountered & Fixes

Real engineering — not a clean tutorial. Every error hit during training and the fix applied:

Error	Root Cause	Fix Applied
`IndexError: Index 1999 out of range`	`dataset.select(range(2000))` on 1000-item dataset	Changed to `range(len(dataset))`
`ImportError: cannot import SFTConfig from trl.trainer`	`trl<0.9.0` pinned — wrong import path	Import from `trl` directly
`ImportError: cannot import SFTConfig from trl`	trl version too old — no `SFTConfig` exists	Upgraded trl, restarted runtime
`TypeError: unexpected keyword 'tokenizer'`	New TRL renamed the argument	Changed to `processing_class`
`TypeError: unexpected keyword 'processing_class'`	Old TRL version — argument didn't exist yet	Version conflict — upgraded trl
`ValueError: padding_free=True without packing`	Unsloth sets `padding_free=True` by default	Added `padding_free=False` in SFTConfig
`ValueError: cannot fine-tune quantized model`	Cell 3 (LoRA setup) not re-run after restart	Re-run Cell 3 after every restart

Model Save & Publish

# Save locally in Colab
model.save_pretrained("lora_legal_india_final")
tokenizer.save_pretrained("lora_legal_india_final")

# Push to Hugging Face
model.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")
tokenizer.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")

File Uploaded	Size	What It Is
adapter_model.safetensors	45.1 MB	Trained LoRA adapter weights
tokenizer.json	17.2 MB	Tokenizer vocabulary
README.md	~2 KB	Model card with usage info

Note: Only the LoRA adapter is saved — not the full base model. Anyone loading this model needs: (1) base model unsloth/llama-3.2-1b-instruct + (2) this adapter. Hugging Face handles this automatically.

How to Use

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "invincibleambuj/llama-3.2-1b-legal-india-qlora"
)
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    "### Instruction:\nWhat is Article 21 of Indian Constitution?\n\n### Response:\n",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License Compliance (Llama 3.2)

Requirement	Status
Model name must start with "Llama"	✅ Compliant — llama-3.2-1b-legal-india-qlora
"Built with Llama" displayed on model page	✅ Added to README
Attribution notice included	✅ In README
Monthly users > 700M? (requires Meta license)	N/A — Personal project

Next Steps

Full training: Change max_steps = 1820 on Kaggle free GPU (30 hrs/week) — runs 1 complete epoch over all 14,543 examples
Better evaluation: Test on unseen legal questions to measure actual accuracy
GGUF export: Convert to GGUF format for running locally on CPU with llama.cpp
Gradio demo: Build a simple Hugging Face Space with a chat interface

Project Overview​

What is QLoRA?​

Environment​

Dataset​

Data Format Conversion​

Cell-by-Cell Training Walkthrough​

Model Loading (4-bit)​

LoRA Adapter Setup​

Training Configuration​

Training Results​

Inference Test​

Errors Encountered & Fixes​

Model Save & Publish​

How to Use​

License Compliance (Llama 3.2)​

Next Steps​