ai

Multi-agent systems in production: what CrewAI handles and what you have to build yourself

What CrewAI gave us in production, what it did not, and the real lessons from running a hierarchical multi-agent trip-planning system with real users.

Multi-agent demos are very good at making the happy path look inevitable.

The orchestrator delegates, the specialist responds, the output arrives structured and correct. Everything chains cleanly. You close the notebook thinking “this is going to be straightforward.”

Then you deploy it. Real users ask questions in unexpected sequences. An API call returns a timeout. The orchestrator picks up the previous turn’s context wrong and starts asking for information the user already gave. The streaming response looks alive while the backend step is quietly stuck. And at some point, you’re debugging agent routing at 11pm by reading raw Redis event backlogs.

We built the AI trip planning system at PickYourTrail on CrewAI, a hierarchical setup with an orchestrator delegating to 10 specialist agents: discovery, itinerary architect, hotels, activities, costing, transfers, flights, insurance, discounts, passengers. This is an account of what CrewAI actually handles well and what you have to build yourself to make it production-worthy.

What CrewAI gives you

The hierarchical process is the genuine value. Process.hierarchical with a manager_agent gives you a structured delegation pattern where the orchestrator breaks down the user’s intent and routes to the right specialist. That matches the travel planning domain well, the work naturally decomposes into specialist sub-tasks, and having an orchestrator that owns the conversation while specialists own their domains produces cleaner agent behavior than a flat multi-agent setup.

self.crew = Crew(
    agents=[
        self._agents["discovery"],
        self._agents["itinerary_architect"],
        self._agents["hotel"],
        # ... 7 more specialists
    ],
    tasks=tasks,
    process=Process.hierarchical,
    manager_agent=self._agents["orchestrator"],
    verbose=True,
    stream=True,
)

CrewAI also handles the agent execution loop, retrying on tool errors, feeding context between tasks, managing the internal conversation history for each agent. That’s real scaffolding that would take significant effort to build from scratch.

What it doesn’t give you: reliable state transitions, production-grade observability, streaming architecture, or hard behavioral constraints. All of that has to be built on top.

Structured output routing: the next_action field

The first thing we learned in production is that text-based routing doesn’t work reliably. If the orchestrator’s output is a paragraph of natural language and you’re parsing it to decide what happens next, you’ve introduced a fragile string-matching problem into your state machine.

The fix is structured output. The orchestrator’s task uses output_pydantic=OrchestratorOutput, which means CrewAI validates the output shape against a Pydantic model before accepting it:

class OrchestratorOutput(BaseModel):
    text: str = Field(..., description="Conversational response to the user.")
    ui_components: list[dict[str, Any]] = Field(default_factory=list)
    trip_context_update: dict[str, Any] = Field(default_factory=dict)
    next_action: Optional[str] = Field(
        default=None,
        description="Set to 'build_itinerary' only when user confirmed itinerary creation.",
    )
    next_question: Optional[str] = Field(
        default=None,
        description="Picker to show next: pax, trip_type, child_ages, dates, duration.",
    )

    @field_validator("next_action")
    @classmethod
    def valid_next_action(cls, v: str | None) -> str | None:
        allowed = {None, "build_itinerary"}
        if v not in allowed:
            raise ValueError(f"next_action must be one of {allowed}, got '{v}'")
        return v

The next_action field is validated to only allow None or "build_itinerary". If the LLM tries to emit anything else, Pydantic raises a ValueError, which CrewAI catches and feeds back to the agent as an error, prompting it to retry with a valid value.

This turns the orchestrator from “speaker of vague next steps” into “producer of explicit state transitions.” The Python layer that processes the output does a simple if output.next_action == "build_itinerary" and routes accordingly. No regex, no parsing, no guessing.

next_question follows the same pattern, it’s constrained to a fixed enum of data collection steps. When the orchestrator needs more information from the user, it signals which picker to show next. Invalid values trigger a retry. The field is intentionally narrow.

Tool lists are hard constraints. Instructions are soft.

This is the most practically important lesson I’d give anyone building multi-agent systems.

If you tell an agent in its backstory “don’t call discovery tools during the build phase,” it might still call them. The model drifts. It finds a reason. It decides the instruction doesn’t apply in this case.

If the tool isn’t in the agent’s tool list, the constraint is absolute.

The itinerary architect and the discovery agent have completely separate tool sets. The discovery agent gets search_regions, get_region_info, get_testimonials. The itinerary architect gets get_top_cities, create_itinerary, get_itinerary_summary, manage_city, initiate_booking. Neither can call the other’s tools, not because we told them not to, but because those tools don’t exist in their context.

The run_build() method in VehoCrew makes this explicit in a code comment:

def run_build(self, inputs: dict[str, Any]) -> dict[str, Any]:
    """Run the Itinerary Architect's build task via agent.execute_task().

    Called by crew_run when the orchestrator signals next_action='build_itinerary'.
    The architect already has planner_build_tools (no search_regions / get_region_info),
    so discovery tools are physically unavailable, enforced by tool list, not text.
    """

The backstory reinforces what the tool list enforces. But if the two conflict, if the agent decides to ignore the backstory instruction, the tool list wins. That’s what you actually want. The tool surface is the real boundary.

Guardrails: validation after the agent, before the next step

CrewAI 1.14.6 introduced first-class guardrails: callbacks invoked on TaskOutput after the agent finishes but before execution continues. Returning (True, output) accepts (optionally mutating) the output. Returning (False, error_msg) signals a content failure that triggers a retry with the error fed back to the agent.

We use this for two things.

Orchestrator guardrail: enforces conversational state correctness. The most important check is next_question enforcement, if the trip context is missing required information (no trip type, no passenger count, no departure date) and the orchestrator didn’t signal which picker to show next, the guardrail returns (False, targeted_error) and the agent gets a second chance with specific instructions:

if required and not result.next_question:
    return False, (
        f"You did not set 'next_question'. Based on the trip context, "
        f"'{required}' is still missing from the user. "
        f"Set next_question to the appropriate value: "
        f"'trip_type' | 'pax' | 'vibe' | 'dates' | 'duration' | 'child_ages'."
    )

The guardrail also silently patches things the LLM gets structurally wrong, upgrading question components to confirmation_prompt when they match certain patterns, stripping picker UI components that the Python layer is responsible for injecting. Python owns those concerns; the LLM doesn’t need to get them right for the system to work correctly.

Build guardrail: handles the opposite problem, failing fast on permanent errors. When the Veho API returns “unable to create an itinerary for your current selection,” a CrewAI retry will not help. The guardrail detects terminal error strings and returns (True, mutated_output), accepting the output but marking it status='error'. This prevents a retry loop on an unrecoverable state.

SSE streaming via Redis pub/sub

Direct response streaming doesn’t work across a multi-agent workflow. The work runs in a Celery task. The specialists execute sequentially, each taking seconds. The user is sitting in a browser expecting to see progress. You can’t hold a single HTTP connection open across all of that reliably.

The architecture that works: publish events to Redis as they happen, deliver them to the client via SSE from a separate endpoint that subscribes to the per-session channel.

def _publish_event(self, event_data: dict[str, Any]) -> None:
    backlog_key = f"{self.redis_channel}:events"
    seq_key = f"{self.redis_channel}:seq"

    event = dict(event_data)
    event["ts_ms"] = int(time.time() * 1000)
    event["seq"] = int(_sync_redis.incr(seq_key))

    payload = json.dumps(event)
    _sync_redis.publish(self.redis_channel, payload)   # live subscribers
    _sync_redis.rpush(backlog_key, payload)             # late joiners
    _sync_redis.expire(backlog_key, 600)

Every significant event, agent started, tool called, tool returned, LLM thinking, publishes to chat:task:{task_id}. The events also go into a backlog list with a 10-minute TTL, so a client that connects after the fact can replay what it missed.

The SSEEventListener wraps CrewAI’s process-level event bus. It filters events by the task UUIDs registered to the current crew (important under threaded concurrency where multiple crews could share a bus), then translates them to user-facing events:

_ROLE_LABEL: dict[str, str] = {
    "Senior Travel Planner and Coordinator": "Planning your response...",
    "Hotel Specialist": "Finding hotels...",
    "Costing Specialist": "Calculating costs...",
    # ...
}

The user sees “Finding hotels…” in the thinking indicator while the hotel agent is actually running. The backend agent roles map to human-readable labels. Without this layer, the agent activity is invisible to the user, the response appears to arrive all at once after a 20-second wait.

Draining the stream: the JSON token problem

CrewAI’s streaming mode emits text_delta events as the LLM generates tokens. When the task uses output_pydantic, the LLM emits JSON, which means the token stream is raw JSON fragments, not readable text. If you emit those directly to the client, the user sees {, then "text", then :, then ", then “Let me check…”.

The _drain_stream method handles this:

def _drain_stream(self, result: Any) -> Any:
    if not isinstance(result, CrewStreamingOutput):
        return result

    text_chunks: list[StreamChunk] = []
    for chunk in result:
        if chunk.chunk_type == StreamChunkType.TEXT and chunk.content:
            text_chunks.append(chunk)

    final = result.result
    raw = "".join(c.content for c in text_chunks)

    # Guardrail retries produce multiple adjacent JSON objects.
    # Parse each one, keep the last (winning attempt).
    decoder = json.JSONDecoder()
    last_parsed: dict | None = None
    pos = 0
    while pos < len(raw):
        idx = raw.find("{", pos)
        if idx == -1:
            break
        try:
            obj, end = decoder.raw_decode(raw, idx)
            if isinstance(obj, dict):
                last_parsed = obj
            pos = end
        except json.JSONDecodeError:
            pos = idx + 1

    if last_parsed is not None:
        text_content = last_parsed.get("text", "")
        if text_content:
            self._publish_event({"type": "text_delta", "delta": text_content, "agent": agent})
        return final

The guardrail retry case is particularly subtle. When the guardrail rejects an output with (False, error_msg), the LLM retries, producing a second JSON object appended to the first in the stream buffer. JSONDecoder.raw_decode() parses each one sequentially and keeps the last. The first attempt (the rejected one) gets discarded. The user only sees the text from the winning attempt.

What actually broke in production

The failures were rarely dramatic. No catastrophic hallucination, no agent going off the rails in an obvious way. The actual production failures were quieter:

Stale context on session resume. When a user comes back to a partially planned trip, the orchestrator needs to know what’s already been done. Without injecting the current itinerary state into the agent’s backstory at kickoff, it would ask the user to repeat decisions they’d already made, trip type, destination, dates. We added a resume prompt injection that patches the orchestrator’s backstory based on the stored completion_status before the crew runs.

Tool call failures treated as retryable. The Veho API can return “unable to create an itinerary for your current selection”, a permanent failure that no amount of retrying fixes. Early versions of the build flow would hit this and loop. The build_guardrail accepting-with-error pattern stopped the loop.

Picker components injected by the LLM. Picker UI components (date picker, passenger picker, etc.) are Python’s responsibility, the frontend needs specific structured data to render them. When the LLM tried to emit picker-like components in ui_components, it got the shape wrong. The guardrail strips them; Python injects them based on next_question.

The text field being empty. When the orchestrator is routing to a specialist and there’s nothing particularly conversational to say, the LLM sometimes emits an empty text field. The text_not_empty Pydantic validator and the guardrail’s explicit check on this field catches it and forces a retry with a targeted error.

What I’d add from the start

OpenTelemetry from day one. We added it later and the retrofit was painful. A multi-agent workflow crosses FastAPI, Celery, Redis, SQLAlchemy, and httpx, without distributed traces, a failure that looks like “the agent returned a wrong answer” is actually a Celery task that timed out, or an httpx call to the MCP server that silently retried three times. You can’t debug what you can’t see.

A proper session state store earlier. The trip_context dictionary that tracks what the user has told us (destination, dates, passenger count, trip type) started as an ad hoc dict passed around in the task. It needed to be a proper database entity much sooner than we made it one. Agent state needs to survive across HTTP requests, across retries, across session resumes.

Guardrails before going live, not after. We added the orchestrator guardrail in response to real failures, the empty text field, the missing next_question. It would have been better to define the output contract precisely upfront and write the guardrail against it before any users saw the system. The Pydantic schema and the guardrail together are the contract between the LLM and the Python layer. Write the contract first.


CrewAI was the right choice for this problem. The hierarchical delegation model fit the domain. The agent execution scaffolding saved significant time. What it didn’t give us, and what took most of the production hardening effort, was everything around it: structured state transitions, tool surface enforcement, streaming architecture, output validation, and the observability to debug it all when something went wrong.

The framework handles the happy path well. Production requires building everything else.