Skip to main content

LLM Fine-Tuning — QLoRA Indian Legal

A first-principles fine-tuning experiment — QLoRA (4-bit quantization + LoRA adapters) applied to Meta's Llama 3.2 1B Instruct on Indian Legal data. Trained from scratch on Google Colab free tier at ₹0 cost. Model published on Hugging Face.

🔗 Live Model: huggingface.co/invincibleambuj/llama-3.2-1b-legal-india-qlora


Project Overview

MetricValue
Base ModelLlama 3.2 1B Instruct (Meta)
MethodQLoRA — 4-bit quantization + LoRA rank 16
Dataset14,543 Indian Legal QA pairs
Trainable Parameters11.27M out of 1.247B (0.90%)
Frozen Parameters1,235,814,400 (99.10%)
Training Steps100 (learning experiment)
Training Time~1 min 37 sec on T4
Final Loss1.578 (Step 100)
GPUGoogle Colab Tesla T4 (Free)
Training Cost₹0

What is QLoRA?

Standard fine-tuning requires loading the full model (1B+ params) into GPU memory — impossible on free-tier hardware. QLoRA solves this in two steps:

Normal Fine-tuning:
Full model (1B params) → Need 100GB+ VRAM → ❌ Not possible on free GPU

LoRA:
Freeze base model → Add small adapter layers → Train only adapters
1B params → Only 11M adapter params trained → ✅ Much less VRAM

QLoRA (what we used):
Step 1: Compress base model to 4-bit (Quantization) → ~300MB in memory
Step 2: Add LoRA adapters on top → Train only 11.27M params
Result: 0.90% of model trained → ✅ Runs on FREE T4 GPU

The base model's general knowledge (language, reasoning) is preserved. The adapter layers steer the model's responses toward the Indian Legal domain. The two work together at inference time.

ConceptIn This Project
Base Model Size1.247 Billion parameters
After 4-bit Quantization~300 MB in GPU memory
LoRA Rank (r)16
Trainable Parameters11,272,192 (0.90%)
Frozen Parameters1,235,814,400 (99.10%)
Target Modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Environment

ComponentDetail
PlatformGoogle Colab Free Tier
GPUTesla T4 — 14.5 GB VRAM
CUDA Version12.8
Python3.12
Torch2.10.0+cu128
Unsloth2026.4.4
TRLLatest (upgraded from <0.9.0)
Training Cost₹0 — Free GPU

Dataset

Sourced from Kaggle — three JSON files of Indian Legal question-answer pairs:

FileContentSource Law
constitution_qa.jsonConstitutional QAConstitution of India
ipc_qa.jsonPenal Code QAIndian Penal Code (IPC)
crpc_qa.jsonProcedure QACode of Criminal Procedure (CrPC)
Total14,543 examples

Data Format Conversion

Raw JSON → Alpaca format → Training prompt:

// Raw Data
{
"question": "What is India according to the Union and its Territory?",
"answer": "India, that is Bharat, shall be a Union of States."
}

// Converted to Alpaca Format
{
"instruction": "What is India according to the Union and its Territory?",
"input": "",
"output": "India, that is Bharat, shall be a Union of States."
}
Final Training Prompt Format:

### Instruction:
What is India according to the Union and its Territory?

### Response:
India, that is Bharat, shall be a Union of States.

Cell-by-Cell Training Walkthrough

Model Loading (4-bit)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3.2-1b-instruct",
max_seq_length = 2048,
load_in_4bit = True, # ← This is the "q" in qLoRA
)

Downloaded Llama 3.2 1B (1.10 GB) and loaded it in 4-bit precision. load_in_4bit=True is the quantization step that makes free GPU training possible.

LoRA Adapter Setup

model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = True,
)

LoRA adapters attached to 7 layer types across 16 transformer layers. Result: 11.27M trainable parameters added.

Important: This cell must be re-run after every Runtime restart. If skipped, training fails with "cannot fine-tune quantized model" error.


Training Configuration

ParameterValueWhat it Means
per_device_train_batch_size22 examples processed per GPU step
gradient_accumulation_steps4Accumulate 4 steps = effective batch of 8
max_steps100Total training steps (learning experiment)
learning_rate2e-4 (0.0002)How fast adapter weights update
optimadamw_8bitMemory-efficient 8-bit optimizer
fp16True (T4 GPU)Half precision — faster training
lr_scheduler_typelinearLearning rate decreases linearly
warmup_steps5Gradual LR warmup at start
  • Effective Batch Size: 2 (batch) × 4 (accumulation) = 8 examples per weight update
  • Examples actually seen: 100 steps × 8 = 800 out of 14,543 (5.5% of full dataset)
  • Full epoch would require: ~1,820 steps (14,543 ÷ 8)

Training Results

Loss decreased consistently from 3.47 → 1.57, showing the model was learning the legal domain patterns.

StepTraining LossTrend
13.4753
102.3187↓ Decreasing fast
251.9115↓ Continuing
501.6189↓ Stabilizing
751.6946~ Fluctuating
1001.5784↓ Final

Limitation: 100 steps is a learning experiment. Loss is still high — a proper training run needs ~1,820 steps (1 full epoch) on Kaggle's free GPU (30 hrs/week).


Inference Test

FastLanguageModel.for_inference(model)

inputs = tokenizer(
"### Instruction:\nWhat is IPC Section 302?\n\n### Response:\n",
return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✅ Output: "IPC Section 302 deals with punishment for murder."

Model answered correctly. repetition_penalty=1.3 needed to prevent answer looping (side effect of limited training steps).


Errors Encountered & Fixes

Real engineering — not a clean tutorial. Every error hit during training and the fix applied:

ErrorRoot CauseFix Applied
IndexError: Index 1999 out of rangedataset.select(range(2000)) on 1000-item datasetChanged to range(len(dataset))
ImportError: cannot import SFTConfig from trl.trainertrl<0.9.0 pinned — wrong import pathImport from trl directly
ImportError: cannot import SFTConfig from trltrl version too old — no SFTConfig existsUpgraded trl, restarted runtime
TypeError: unexpected keyword 'tokenizer'New TRL renamed the argumentChanged to processing_class
TypeError: unexpected keyword 'processing_class'Old TRL version — argument didn't exist yetVersion conflict — upgraded trl
ValueError: padding_free=True without packingUnsloth sets padding_free=True by defaultAdded padding_free=False in SFTConfig
ValueError: cannot fine-tune quantized modelCell 3 (LoRA setup) not re-run after restartRe-run Cell 3 after every restart

Model Save & Publish

# Save locally in Colab
model.save_pretrained("lora_legal_india_final")
tokenizer.save_pretrained("lora_legal_india_final")

# Push to Hugging Face
model.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")
tokenizer.push_to_hub("invincibleambuj/llama-3.2-1b-legal-india-qlora", token="HF_TOKEN")
File UploadedSizeWhat It Is
adapter_model.safetensors45.1 MBTrained LoRA adapter weights
tokenizer.json17.2 MBTokenizer vocabulary
README.md~2 KBModel card with usage info

Note: Only the LoRA adapter is saved — not the full base model. Anyone loading this model needs: (1) base model unsloth/llama-3.2-1b-instruct + (2) this adapter. Hugging Face handles this automatically.


How to Use

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "invincibleambuj/llama-3.2-1b-legal-india-qlora"
)
FastLanguageModel.for_inference(model)

inputs = tokenizer(
"### Instruction:\nWhat is Article 21 of Indian Constitution?\n\n### Response:\n",
return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200, repetition_penalty=1.3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License Compliance (Llama 3.2)

RequirementStatus
Model name must start with "Llama"✅ Compliant — llama-3.2-1b-legal-india-qlora
"Built with Llama" displayed on model page✅ Added to README
Attribution notice included✅ In README
Monthly users > 700M? (requires Meta license)N/A — Personal project

Next Steps

  • Full training: Change max_steps = 1820 on Kaggle free GPU (30 hrs/week) — runs 1 complete epoch over all 14,543 examples
  • Better evaluation: Test on unseen legal questions to measure actual accuracy
  • GGUF export: Convert to GGUF format for running locally on CPU with llama.cpp
  • Gradio demo: Build a simple Hugging Face Space with a chat interface