DSPy for production: solving LLM output consistency without prompt engineering

Every seller scored between 7.5 and 8.0. Every single one.

Bad conversations, good conversations, a seller who took three hours to respond to a customer in distress and a seller who resolved a complex hotel cancellation in ten minutes, the model couldn’t tell the difference. Or rather, it could tell the difference, but it wasn’t reflecting that difference in its output. It was hedging.

This is the problem that pushed me toward DSPy. Not curiosity about a new framework. A broken product behavior that manual prompt engineering was failing to fix.

The context: Sherpa and seller scoring

Sherpa is a system we built at PickYourTrail that processes WhatsApp conversations between customers and travel sellers, then scores seller performance across three dimensions: responsiveness, how much negative customer sentiment appeared in the conversation, and the overall conversational trend (did the customer feel better at the end than at the start?).

These scores feed into a seller performance dashboard used to identify coaching opportunities, flag urgent support cases, and track seller improvement over time. For the scores to be useful, they need to discriminate. A system where every seller sits between 7.5 and 8.0 is operationally worthless, it tells you nothing you didn’t already know.

The inference runs on self-hosted models via vLLM on RunPod. We were using Qwen 2.5 7B and testing with Llama 3.1 8B. Both exhibited the same pattern: safe, conservative scores that clustered in the middle of the range regardless of conversation quality.

Why prompt engineering wasn’t enough

The first response to score compression is always the same: rewrite the prompt. I tried:

Adding explicit score anchors (“A score of 2 means the seller took >4 hours to respond and did not apologize”)
Giving examples of low-scoring conversations inline
Instructing the model more strongly to use the full 0–10 range
Breaking the scoring into separate prompts per dimension

Some of these helped marginally. None of them solved it. The model would produce more varied scores for a session or two, then drift back toward the middle.

The fundamental problem with smaller models (7B–8B) is that they’ve been fine-tuned to be helpful and non-committal. Giving a harsh score feels, to the model, like a risky thing to do. Manual prompt iteration is working against the model’s training distribution, and you can push against it for a while, but it reasserts itself.

What I needed was a way to show the model concrete examples of what correct scoring behavior looks like, not just instructions about what it should do, but demonstrated examples of it doing it right. That’s the problem DSPy is actually designed for.

What DSPy does and why it matters here

DSPy (Declarative Self-improving Python) treats prompt construction as an optimization problem over labeled examples. Instead of writing a prompt by hand and hoping it generalizes, you define what inputs and outputs the task has, provide labeled training examples, and let an optimizer search for prompt structures that produce correct outputs on those examples.

The key component for our use case was BootstrapFewShot. Given a set of training examples (conversations with verified correct scores), it generates few-shot prompts, concrete examples embedded in the prompt itself, that push the model toward the demonstrated behavior.

First, you define a DSPy signature for the task:

class ChatAnalysisSignature(dspy.Signature):
    """Analyze a travel seller's WhatsApp conversation and score their performance."""

    conversation = dspy.InputField(desc="The full WhatsApp conversation transcript")
    seller_id = dspy.InputField(desc="Seller identifier")

    responsiveness_score = dspy.OutputField(
        desc="Score 0-10: response time, acknowledgment speed, follow-through"
    )
    customer_negative_score = dspy.OutputField(
        desc="Score 0-10: customer frustration, complaints, negative sentiment"
    )
    conversational_trend_score = dspy.OutputField(
        desc="Score 0-10: did customer sentiment improve through the conversation"
    )
    overall_seller_score = dspy.OutputField(
        desc="Weighted score 0-10: 40% responsiveness + 30% negative + 30% trend"
    )
    score_rationale = dspy.OutputField(
        desc="Brief reasoning for the scores given"
    )

The signature is a contract. DSPy uses it to structure how the model receives inputs and formats outputs, you’re not writing the actual prompt text, you’re describing the task semantics.

Then you build the optimizer:

class ChatAnalyzer(dspy.Module):
    def __init__(self):
        self.analyze = dspy.ChainOfThought(ChatAnalysisSignature)

    def forward(self, conversation, seller_id):
        return self.analyze(conversation=conversation, seller_id=seller_id)

# Collect training examples
trainset = load_training_examples(n=20)  # 5 manual + 15 from DB

# Score diversity metric: reward predictions that use a wide range
def score_diversity_metric(example, prediction, trace=None):
    scores = [
        float(prediction.responsiveness_score),
        float(prediction.customer_negative_score),
        float(prediction.conversational_trend_score),
    ]
    range_score = max(scores) - min(scores)
    accuracy = score_accuracy(example, prediction)
    return 0.7 * accuracy + 0.3 * (range_score / 10.0)

optimizer = BootstrapFewShot(
    metric=score_diversity_metric,
    max_bootstrapped_demos=8,
    max_labeled_demos=5,
)

optimized_analyzer = optimizer.compile(ChatAnalyzer(), trainset=trainset)

The metric is the part that made the difference. Standard fine-tuning optimizes for accuracy against the labeled scores. Our custom metric also rewards predictions where the scores span a meaningful range. This explicitly counteracts the model’s tendency toward compression, if the optimizer sees that diverse scores correlate with better outcomes, it finds prompt configurations that produce them.

The vLLM adapter problem

DSPy ships with LiteLLM integration as the default way to connect to LLM backends. For our vLLM deployment on RunPod, LiteLLM kept throwing InternalServerError on valid responses. The vLLM server was returning correctly structured completions, but LiteLLM’s response parsing was rejecting them.

The fix was a custom adapter that talks directly to the vLLM HTTP endpoint, bypassing LiteLLM entirely:

class VLLMAdapter(dspy.LM):
    def __init__(self, base_url: str, model: str, api_key: str):
        self.base_url = base_url
        self.model = model
        self.api_key = api_key
        super().__init__(model)

    def basic_request(self, prompt: str, **kwargs) -> dict:
        payload = {
            "model": self.model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": kwargs.get("temperature", 0.7),
            "max_tokens": kwargs.get("max_tokens", 1000),
        }
        response = requests.post(
            f"{self.base_url}/chat/completions",
            json=payload,
            headers={"Authorization": f"Bearer {self.api_key}"},
            timeout=60,
        )
        response.raise_for_status()
        return response.json()

The adapter is thin, just a direct HTTP POST to the completions endpoint. No middleware, no SDK abstractions that might mismatch the response format. If vLLM is returning valid OpenAI-compatible completions (which it is), this works reliably.

This is worth knowing if you’re running DSPy against a self-hosted vLLM instance and hitting mysterious errors. The issue is often LiteLLM’s compatibility layer, not DSPy itself or your vLLM setup.

Storing optimized prompts in MongoDB

After optimization, you have a compiled DSPy module that contains the few-shot examples and configuration that produced good results. The question is where to put it.

The naive answer is a file on disk. The problem is that files on disk don’t survive container restarts in Kubernetes unless you mount a volume, and making “rebuild the few-shot prompts” a deployment artifact felt wrong, it means optimization is tied to deployment, which is too heavy for something that should run automatically twice a week.

MongoDB was the right fit. The optimized module gets serialized to a dict and stored in the dspy_optimizers collection:

def save_optimizer(db, optimizer_type: str, state: dict):
    db.dspy_optimizers.replace_one(
        {"type": optimizer_type},
        {
            "type": optimizer_type,
            "state": state,
            "optimized_at": datetime.utcnow(),
            "training_examples": len(state.get("demos", [])),
        },
        upsert=True,
    )

def load_optimizer(db, optimizer_type: str) -> dict | None:
    doc = db.dspy_optimizers.find_one({"type": optimizer_type})
    return doc["state"] if doc else None

At service startup, the analyzer checks for a saved optimizer state and loads it:

class DSPyChatAnalysisService:
    def __init__(self, db, lm):
        dspy.configure(lm=lm)
        self.analyzer = ChatAnalyzer()

        saved_state = load_optimizer(db, "chat_analyzer_optimizer")
        if saved_state:
            self.analyzer.load_state(saved_state)
            logger.info("Loaded optimized DSPy prompts from MongoDB")
        else:
            logger.info("No optimized prompts found, using default")

This means optimization and deployment are fully decoupled. A new optimized prompt set goes live as soon as it’s written to MongoDB, no container rebuild, no redeploy, no downtime. The next request picks up the new state.

Automated re-optimization

Once you’ve set up optimization-as-a-service, the natural next step is running it on a schedule. Seller conversations accumulate continuously. New edge cases appear. A model that was well-calibrated three weeks ago on the examples available then might drift as conversation patterns change.

The re-optimization CronJob runs Monday and Thursday at 3 AM IST:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: sherpa-dspy-optimize
spec:
  schedule: "30 21 * * 0,3"  # Mon & Thu 3AM IST = Sun & Wed 21:30 UTC
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: optimizer
            image: sherpa:latest
            command: ["python", "scripts/optimize_dspy.py", "--optimize"]
            env:
            - name: USE_DSPY
              value: "true"

The optimization script collects 20 diverse training examples, 5 manually curated edge cases plus 15 pulled from the database, spread across the score range to ensure coverage of low, mid, and high performers. It then runs BootstrapFewShot, evaluates the compiled module against a held-out validation set, and writes the result to MongoDB if it improves on the current best.

The training data collection matters here as much as the optimization itself. If you sample only from the database, you’ll oversample the common case (mid-range conversations) and undersample the extremes. The 5 manual examples are specifically chosen to be boundary cases, the worst response behavior we’ve seen, the best, and a few genuine edge cases that the model tends to get wrong.

Score validation

One side effect of getting the model to produce more varied scores: it occasionally goes outside the 0–10 range. A model that’s been pushed to discriminate more aggressively sometimes overshoots.

All scores go through validation before storage:

def _validate_scores(scores: dict) -> dict:
    """Clamp all seller scores to valid 0-10 range."""
    score_fields = [
        "responsiveness_score",
        "customer_negative_score",
        "conversational_trend_score",
        "overall_seller_score",
    ]
    validated = dict(scores)
    for field in score_fields:
        if field in validated and validated[field] is not None:
            try:
                value = float(validated[field])
                validated[field] = round(max(0.0, min(10.0, value)), 2)
            except (ValueError, TypeError):
                validated[field] = None
    return validated

This is a minor thing but worth noting: when you optimize a model toward a wider output range, you take on a responsibility to handle the tail. The clamping ensures that an occasionally miscalibrated response doesn’t corrupt the dashboard data.

Results

Before DSPy optimization, seller scores on Qwen 2.5 7B clustered between 6.5 and 8.5 across hundreds of conversations. After optimization, the distribution spread meaningfully: individual conversation scores ranging from 1.3 to 9.65, with the distribution showing genuine separation between low and high performers.

More importantly, the scores started matching human judgment. When we spot-checked high-scoring and low-scoring conversations manually, the model’s assessments were consistent with what a human reviewer would say. That’s the actual test, not score range, but whether the range reflects reality.

The re-optimization schedule has kept the system calibrated through three months of live operation without manual intervention. When a new edge case pattern appears in the data, the next optimization cycle incorporates it.

What I’d do differently

BootstrapFewShot requires labeled training examples. Collecting those examples was the most time-consuming part, the 5 manual edge cases alone took a few hours to select and verify. In hindsight, I should have built the annotation workflow earlier in the project, so we had a growing pool of verified examples rather than scrambling to collect 20 decent ones right before the first optimization run.

I’d also test DSPy optimization earlier in the project lifecycle, before score compression became a production issue. The symptoms were visible in the first week of beta testing. I delayed because I assumed the problem was fixable through prompt iteration alone. It wasn’t.

The other thing worth noting: DSPy doesn’t solve the underlying cause of score compression (smaller models being trained toward safe outputs). It works around it by showing the model better examples of what correct behavior looks like. If you need to run on a 7B model in production for cost or latency reasons, DSPy is a practical way to get better calibration. If you can afford a 70B model or a capable closed model, you may not need it, larger models tend to handle nuanced scoring tasks more naturally. Know which situation you’re in before spending time on optimization infrastructure.

The lesson from Sherpa isn’t “use DSPy.” It’s “treat prompt quality as an engineering problem, not a writing problem.” Manual prompt tuning scales poorly, doesn’t self-correct as data changes, and depends on the engineer who wrote it staying involved. Building an optimization loop, examples, metric, schedule, storage, turns it into something the system can manage.

That’s the framing I’d carry into any similar problem.