Most fine-tuning tutorials start too late.
They begin with a clean JSONL file, a well-labeled dataset, and a training config. You see the Axolotl YAML, the LoRA hyperparameters, maybe a wandb loss curve. That part is relatively straightforward once you’re there.
What they skip is getting there.
In real projects, data doesn’t come pre-labeled in Alpaca format. It comes from production systems, databases, log files, CRM exports. It’s messy, inconsistently structured, and sitting behind a format that no ML framework wants to read directly. The hardest work in this pipeline wasn’t the training run. It was everything before the training run even became a reasonable idea.
This is a walkthrough of that work: a Go pipeline that starts with a gzip-compressed SQL dump of travel support chat history and ends with a LoRA adapter on a Qwen3-8B base model, thun-ai-concierge-8B-think.
Why fine-tune at all
Before building anything, I wanted to be honest about whether fine-tuning was actually the right tool.
RAG (retrieval-augmented generation) is the default answer for most knowledge-grounding problems, and for good reason. It’s cheaper, faster to iterate on, and easier to update. If the goal is “give the model access to travel information,” RAG handles that well.
But travel support isn’t just an information retrieval problem. It has a behavioral shape to it. Support conversations have a structure: acknowledging the customer’s situation, asking the right clarifying question, proposing a resolution in a specific format, escalating when needed. That’s not just what to say, it’s how to respond.
RAG can’t teach a model how to respond. It can tell a model what facts exist. Fine-tuning on real support interactions teaches it the response pattern itself.
The other factor: we had the data. Years of real customer-support conversations, in production, sitting in a database. That’s a training signal most teams don’t have access to. Not using it felt like a mistake.
The data problem
The source was chat.sql.gz, a gzip-compressed MySQL dump of the chat_messages table.
Not clean JSON. Not a CSV with nice column headers. A raw SQL file with multi-line INSERT statements, escaped special characters, inconsistent column ordering depending on when the export was made, and multiple timestamp formats across different database migrations.
I wrote the extractor in Go for two reasons: the streaming and parsing characteristics, and the single-binary deployment story for distributed processing. More on the deployment part later.
Streaming the SQL
The reader opens the gzip file and streams it through a 4 MB buffer. This matters because the SQL dump is too large to load into memory and we want continuous throughput to the parser without waiting for a full file read.
// reader/reader.go
func NewSQLGZReader(path string) (*SQLGZReader, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
gz, err := gzip.NewReader(f)
if err != nil {
return nil, err
}
return &SQLGZReader{
file: f,
gz: gz,
reader: bufio.NewReaderSize(gz, 4*1024*1024),
}, nil
}
Parsing multi-line INSERTs
MySQL dumps don’t produce one INSERT per row. They batch rows into multi-line INSERT statements to reduce log overhead. The parser has to accumulate lines until it sees a complete INSERT block, then split the values.
The tricky part is column detection. The dump format includes the column list in the INSERT header, INSERT INTO chat_messages (id, chat_id, sender, body, created_at) VALUES ..., but column names varied depending on when the table was created. Older exports used conversation_id instead of chat_id, role instead of sender. The parser auto-detects positions from the column list so the downstream code doesn’t have to care about schema version.
SQL escaping also has to be unwound: \' → ', \\ → \, \n → newline. And timestamps came in at least three formats across the dataset: 2023-01-15 10:23:44, 2023-01-15T10:23:44Z, and Unix epoch integers from an older part of the schema. The parser handles all three.
Memory-bounded grouping
Once messages are parsed, they need to be grouped by chat_id to reconstruct conversations. The naive approach, accumulate all messages in a map, then group, doesn’t work at scale. The table has years of history.
The grouper uses a bounded buffer. When the buffer fills past a threshold, it flushes the oldest 50% of chat groups downstream. This keeps memory usage roughly constant regardless of dataset size, at the cost of occasionally splitting a very long-running conversation across two groups. For our use case that was acceptable, conversations longer than a few hours don’t make useful training examples anyway.
// grouper/grouper.go
func (g *Grouper) Add(msg models.Message) []models.Conversation {
g.mu.Lock()
defer g.mu.Unlock()
g.buffer[msg.ChatID] = append(g.buffer[msg.ChatID], msg)
if len(g.buffer) > g.maxSize {
return g.flush(0.5) // flush oldest 50%
}
return nil
}
LLM as extraction layer
Parsed and grouped conversations aren’t training examples yet. They’re raw message sequences. To turn them into structured training data, each conversation goes through an LLM call that extracts a {input, output, intent} triplet.
The prompt instructs the model to identify what the customer asked (input), what the best support response would be (output), and what category of support interaction this represents (intent). The LLM is doing the labeling work that would otherwise require a human annotation team.
I used RunPod serverless endpoints during this phase, OpenAI-compatible API, no idle cost, scales per request.
Lenient JSON parsing
The extraction step has one persistent failure mode: LLMs wrap JSON in prose. “Here is the extracted training example: json { ... } ”, the JSON is valid, but it’s buried in a markdown code block. Standard json.Unmarshal on the raw response fails.
The client handles this with lenient extraction: scan the response body for the first { and the last }, extract that substring, then unmarshal it. This recovers from prose wrapping and code block formatting without any regex.
// llm/client.go
func extractJSON(body string) string {
start := strings.Index(body, "{")
end := strings.LastIndex(body, "}")
if start == -1 || end == -1 || end <= start {
return ""
}
return body[start : end+1]
}
After extraction, the client validates that the required fields exist (input, output, intent) and rejects examples where any of them is empty.
The dry-run command
Before committing to a full extraction run, I built cmd/count, a dry-run that reads the SQL dump, groups conversations, and reports:
$ make count
Scanned 847,234 messages
Grouped into 91,432 conversations
Eligible for extraction (≥2 messages): 68,817
Estimated LLM calls: 68,817
Estimated tokens (avg 340 tokens/conv): ~23.4M
Estimated cost at $0.15/1M tokens: ~$3.51
Estimated runtime at 8 workers, 2s avg: ~4.8 hours
This is one of those operational things that sounds mundane but changes the decision calculus. Knowing the cost before the run means you can make an informed choice about model quality vs. price, worker count vs. runtime, and whether to run distributed or locally.
Distributed shard processing
The full extraction run was too long for a single machine and a single process. The cmd/shard command splits the grouped conversations into N shard files:
make shard N=100 # produces data/shards/shard_000.jsonl ... shard_099.jsonl
Then ops/fleet.sh distributes work across EC2 instances:
bash ops/fleet.sh --key my-key.pem --machines machines.txt
machines.txt is a list of EC2 instance IPs. The script cross-compiles the cmd/process binary for linux/amd64, copies it to each machine with scp, and starts it. Go’s single-binary compilation made this much simpler than it would be with a Python pipeline, no dependency installation, no virtual environments, no wheel incompatibilities.
Atomic shard claiming
The coordination mechanism is intentionally minimal. No Redis, no SQS, no distributed lock manager. Work claiming uses atomic file rename.
The state/ directory contains one file per shard. A worker claims work by trying to rename shard_042.pending to shard_042.running. On Linux, rename() is atomic, only one process succeeds. The losing process moves on to the next shard.
data/shards/state/
├── shard_000.done
├── shard_001.done
├── shard_042.running ← claimed by instance 3
├── shard_043.pending ← available
└── shard_099.failed ← needs retry
When a worker finishes, it renames .running to .done. On failure, .failed. A separate retry sweep can reset .failed back to .pending for a rerun.
This design works because the bottleneck is LLM inference time (seconds per example), not filesystem operations. The overhead of a few atomic renames per shard is negligible. And there’s no coordination service to maintain, monitor, or lose a connection to.
We ran ~1,100 shards total across the full dataset.
Intent canonicalization
Raw extraction produces messy intent labels. The LLM tries to categorize each conversation, but without a fixed taxonomy, you get:
"hotel_cancellation","cancel_hotel","cancellation_request", same thing"visa_inquiry","ask_about_visa","visa_question", same thing- One-off labels that appeared twice in the dataset and aren’t useful for training
The refiner’s cmd/canonicalize step maps this noise to a fixed set of 31 canonical intent categories. The mapping is defined in a config file:
{
"hotel_cancellation": "cancellation",
"cancel_hotel": "cancellation",
"cancellation_request": "cancellation",
"visa_inquiry": "visa_support",
"ask_about_visa": "visa_support"
}
Anything not in the map gets flagged as unknown and reviewed before deciding whether to add it to the taxonomy or discard it.
After canonicalization, examples are grouped by intent and quality-filtered. Quality filtering removes examples where:
- The output is too short (a one-line support response isn’t useful training data)
- The input is ambiguous to the point of being unlabelable
- The LLM extraction confidence was low (detected by checking if the output reads like a valid support response vs. the LLM commenting on its own uncertainty)
Per-intent capping
The intent distribution isn’t uniform. Some categories, itinerary changes, general inquiries, appear far more often than others. If you train on the raw distribution, the model overfits to high-frequency categories and performs poorly on rare but important ones (visa issues, medical emergencies, supplier failures).
cmd/build caps each intent at a maximum example count before building the final dataset. The cap forces a more balanced distribution. This is a simple intervention with a significant effect on model behavior across the full intent range.
The training format
cmd/build outputs Alpaca-format JSONL:
{
"instruction": "You are a travel support assistant. Help the customer with their inquiry.",
"input": "I booked a hotel in Maldives for next week but I need to move it to the following week. Is that possible?",
"output": "Hi, thanks for reaching out. Let me check availability for the week you'd like to move to. Can you share your booking reference? I'll look into rescheduling options and any change fees that might apply.",
"intent": "booking_modification"
}
The instruction field sets the system framing. input is the customer message (sometimes a multi-turn excerpt). output is the target support response. intent is included as metadata but excluded from the training objective, it’s there for analysis and filtering, not for the model to predict.
cmd/convert produces a second output in SageMaker messages format for flexibility:
{
"messages": [
{"role": "system", "content": "You are a travel support assistant. Help the customer with their inquiry."},
{"role": "user", "content": "I booked a hotel in Maldives for next week..."},
{"role": "assistant", "content": "Hi, thanks for reaching out..."}
]
}
Dataset shape is part of model behavior. The Alpaca format makes the instruction/response structure explicit in a way that maps directly onto how the model will be prompted in production.
Training
By the time data reached the training stage, the interesting engineering work was behind us. The training config itself is relatively conventional.
Setup:
- EC2 g5.xlarge (NVIDIA A10G, 24GB VRAM) with a persistent EBS volume for checkpoint storage
- Axolotl for training orchestration
- LoRA with r=16, alpha=32, targeting the attention projection layers
- 3 epochs, fp16, linear warmup
Base model: Qwen3-8B. I chose this over Llama 3.1 8B at the time because of its stronger multilingual capability, our support dataset includes conversations in English, Tamil, and Hindi, and Qwen3’s tokenizer handles South Asian scripts better out of the box.
The axolotl config:
base_model: Qwen/Qwen3-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
datasets:
- path: data/refined/train.jsonl
type: alpaca
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
sequence_len: 2048
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 100
fp16: true
Training ran for about 6 hours. The final model is saved as thun-ai-concierge-8B-think, a LoRA adapter that sits on top of the Qwen3-8B base.
Testing used a Gradio interface backed by a HuggingFace InferenceClient pointed at a local merged model. Running queries manually through Gradio was the fastest way to feel whether the model’s response style had actually shifted, loss curves tell you the model is learning, but they don’t tell you whether the outputs feel like a support agent.
What the model can and can’t do
The fine-tuned model responds in a noticeably different way than the base Qwen3-8B. The base model is helpful but verbose, it explains things in the style of a language model. The fine-tuned model is concise, structured, and action-oriented in the way that good support responses are. It asks clarifying questions when needed. It acknowledges the customer’s situation before jumping to resolution. It doesn’t over-explain.
What it can’t do: it doesn’t have live system access. It can’t look up a specific booking by reference number. It doesn’t know what’s happened since its training cutoff. In production, it needs to be combined with RAG over live booking data and tool calls to handle actual lookups.
Fine-tuning changed the shape of responses. RAG and tool use provide the content. They’re complements, not substitutes.
What I’d do differently
Define the intent taxonomy before extraction, not after. Running the LLM extractor with no predefined taxonomy and then canonicalizing after-the-fact worked, but it added a full canonicalization round-trip. If the extraction prompt includes the target intent list, you can get cleaner labels on the first pass. The tradeoff is that you need to know your categories before you’ve seen the data, which is hard the first time. The second time around, you have the taxonomy from the first run.
Use a smaller model for iteration. The pipeline from raw data to training is long. When testing pipeline changes, I was running against the full dataset each time. A 10% sample of shards for iteration cycles would have been much faster without meaningfully changing what I was validating.
Consider DSPy for the extraction prompts. The LLM-based extraction step is essentially a structured output problem with a fixed schema (input, output, intent). DSPy’s BootstrapFewShot could optimize the extraction prompt from labeled examples, the same approach I used later in Sherpa for seller scoring. Optimized extraction prompts would likely produce cleaner triplets and reduce the amount of post-hoc quality filtering needed.
The pipeline matters more than the model.
A LoRA adapter over Qwen3-8B isn’t surprising, that’s table stakes for anyone doing fine-tuning today. What made this useful was the pipeline that produced the training data: streaming SQL parsing, memory-bounded grouping, distributed shard claiming over a filesystem, and an intent canonicalization pass that turned label noise into a trainable taxonomy.
If the extraction is weak, the training stage doesn’t save you. If the intent labels are noisy, the model learns the noise. The training config is the easy part. Getting the data into a shape worth training on is where the time goes, and where the decisions that actually matter get made.