Deliver: Ship It Into How the Work Runs
Build deployed the workflow. Deliver makes the organization actually use it. The difference between “deployed” and “delivered” is whether people’s work changed — and whether you can prove it with a number tied to the constraint Signal named. This stage has five parts: a readiness audit, per-role runbooks, an operational log, a Sprint outcome measurement, and a delivery test. Skip none.
We flip the switch on a Wednesday.
The workflow goes live. The agent — an AI system that holds a goal, runs a sequence of steps toward it, and adjusts based on its own outputs without a human re-prompting between steps; not a chatbot that answers a question and stops — starts pulling tickets from the queue, routing them, logging its decisions. I’m watching the dashboard with Sofia, our VP of Operations, reviewing with me in real time. Left side: everything working the way we designed it. Right side: flags. The agent is handling a category of inputs we didn’t map fully in Source, and it’s doing things we didn’t expect.
I tell her: “Pretty buggy over here. But it’s doing a great job over there.”
She nods. We’re both looking at the same screen.
“Live” is a technical claim. The agent is running. The workflow is in production. The ticket is about to get closed. None of that means the Sprint produced.
The Sprint produces on Monday. The operator whose week was supposed to change sits down, opens their queue, and the eight hours of routing work they used to do is gone. They have something different to do instead. If that happens, the Sprint produced. If they’re still routing manually and sending work to the old inbox, still working the system we were supposed to replace, then we deployed a tool and didn’t deliver anything.
The difference is operational.
This chapter fills in the Deliver row of the Sequence first-pass on your Sprint Planning Canvas.
Deployed is not the same as delivered.
Build ends with a working agent running in production. That is a real milestone. It is also incomplete. The agent is doing the work it was designed to do; the organization around it has not yet adjusted to receive what the agent now produces.
Deliver is the stage operators most often skip the work of. The technology side is built; the organizational side is assumed. Every implementation that succeeded at Deliver did three things before the go-live date: named exactly who was affected and how their daily work would change, trained those people on the new handoffs before they encountered them in production, and established a measurement baseline against the cost Signal identified. Every implementation that failed at Deliver skipped one or more — and discovered the gap in production, where fixing it was expensive. The Deploy Readiness Audit below operationalizes all three.
The person whose week used to include four hours of manual routing now has those four hours back — but only if their calendar, their queue, and their handoffs have been re-set to reflect that. The colleague who used to send them tickets needs to know to send the tickets somewhere else. The manager who used to see the weekly summary needs the new reporting wired up. The escalation path for the exceptions the agent flags needs to land on a human who has been told it will land on them.
If you skip that work, the deployed system still functions — but the organization continues to behave as if the old workflow is in place. People route around the new system because they were never told to route through it. The agent does its job. The hours don’t come back. The cost Signal quantified doesn’t get recovered. The Sprint produced less than it could have.
I learned this in my own shop before I understood it as a principle. We finished a Sprint, deployed a workflow, wrote the runbook. Two weeks later I watched a teammate do the task manually, the exact task the agent now handles, because she’d never been explicitly told to stop. The workflow was in production. The old inbox was still open. She was routing to both, hedging, not sure which one was “real.” I asked her why she was still doing it the old way when we’d built the tool to do it. But I was the one who never closed the old inbox.
This wasn’t resistance. Nobody told her to stop. That’s Deliver’s job. The technology was done; what wasn’t done was the operational work: closing the old inbox, redirecting the upstream handoff, making the new workflow the only workflow. Until that happens, you have two systems running in parallel, one of them invisible, and people will choose the one they know.
If the old system is still accessible, people will use it. Deliver’s first move is to close the old path — not alongside the new one, instead of it.
The Deploy Readiness Audit.
Before your team flips the switch, you run a pre-flight check. Four binary gates. Every gate must pass. If any answer is “no,” you stop, fix it, and re-run.
How to run the Deploy Readiness Audit
- Gate 1 — Confirm hands-on training — Every person whose work changes must have practiced on real inputs before go-live; a meeting or email doesn’t count.
- Gate 2 — Document the escalation path — Unexpected agent output needs a named human, a channel, and a response window before it hits production.
- Gate 3 — Test against real data — Validating on actual inputs from the last two weeks surfaces gaps that synthetic tests miss.
- Gate 4 — Book the Human Orchestrator review cadence — A recurring, non-negotiable calendar event makes oversight an operating rhythm, not a best intention.
- Resolve every Fail before deploying — Each failed gate becomes a named task; re-running at 4/4 is the deploy condition.
| # | Gate | Pass / Fail |
|---|---|---|
| 1 | Has every person whose work changes been trained on the new flow, with hands-on practice on actual inputs? | Pass / Fail |
| 2 | Is there a documented escalation path for when the workflow produces unexpected output — with named people, channels, and response windows? | Pass / Fail |
| 3 | Has the workflow been tested against real data from the last two weeks, using actual inputs from your operation? | Pass / Fail |
| 4 | Is the Human Orchestrator’s review cadence defined and on the calendar? Recurring. Non-negotiable. | Pass / Fail |
If the result is not 4/4, you’re not ready to deploy. Every “Fail” is a task. Assign it, complete it, re-run the audit.
Gate 4 names the Human Orchestrator — the operator who runs the shipped workflow and supervises its agent team day to day: reviewing output, course-correcting, and closing the loop on what the agents produce, sprint by sprint (not a chatbot supervisor, but an operational owner with a weekly cadence; covered in depth in the Design chapter).
The Audit is a check on prior work, not new authoring. Every question points to something that should already exist: a trained person, a documented path, a tested input, a scheduled review. If any answer is missing, that item blocks the deploy. Failed deploys trace back to a question nobody asked at this gate.
Meridian Manufacturing.
At Meridian Manufacturing, Elena and Dave completed hands-on training with the three-agent quoting pipeline before the workflow went live. Elena ran real quotes through the system while Dave reviewed the agent’s historical job matching against specs he knew by heart. The escalation path was documented in the Claude Team workspace: any quote flagged as low confidence routed to Dave for non-standard materials, with Elena reviewing every assembled quote before it went to the customer. The workflow was tested against real RFQs pulled from HubSpot, actual customer requests from the prior two weeks. Elena’s review cadence was booked on the calendar: Monday and Thursday mornings, recurring. Systems verified: HubSpot connected, JobBOSS connected, Customer Notes.xlsx with 112 validated pricing rules loaded into the agent’s context. Four gates passed. Clear to deploy.
Pull up your Hybrid Accountability Chart from Design — the map of who owns what, extended to include the agent roles alongside the human ones (covered in the Design chapter). If you run EOS, this extends your Accountability Chart: same structure, with agent roles added to the seats. List every person whose daily work changes because of this Sprint. For each one, confirm they’ve completed hands-on training — not a meeting, not an email. If anyone hasn’t, that’s your first task before deploy.
The per-role runbook.
How to write the per-role runbook
- Identify affected roles from the Hybrid Accountability Chart — Scope the runbook to every person whose handoffs changed — no one surprises left out.
- For each role, answer the four change questions — Forces explicit articulation of what the person does differently, what the agent now handles, how they review output, and what triggers escalation.
- Write the three-column entry (input / output / escalation) — One concrete sentence each — vague entries like ‘reviews output’ are not runbook entries.
- Apply the vacation test — A colleague who missed the entire Sprint should be able to read the document and know exactly what changed about their job.
- Publish the runbook in the team’s live tool before go-live — A runbook in a folder no one opens is not a runbook; it must be findable where the team already works.
For every role affected by the Sprint, document three things: what they receive, what they produce, and when they escalate.
| Role | Input (what they receive) | Output (what they produce) | Escalation (when and how) |
|---|---|---|---|
For each role, answer four questions:
- What does this person now do differently than before the Sprint?
- What does the agent handle that this person used to handle?
- How does this person review agent output?
- What triggers an escalation?
Deliver training is different from technology training. Most technology training covers how to use the new system. The training that prevents adoption failure covers the handoff: where the human’s work ends and the agent’s work begins. The handoff is the moment of highest failure risk in any human-AI workflow. It’s where the human either completes the output correctly and passes it to the agent, or does it incorrectly and compromises everything the agent produces downstream. Every person whose handoff changes should be able to answer two questions before go-live: what do I do differently now, and what does the agent do that I used to do? If they can’t answer both, the training is incomplete, and the runbook isn’t done.
Sit down with the Hybrid Accountability Chart from Design and work through each role whose handoffs changed. For each role, write three things: what they send to the agent (the trigger or input), what comes back to them (the output and its format), and what they do when something lands outside the expected range (the escalation path). One page per role.
At Meridian, the per-role runbook looked like this:
| Role | Input | Output | Escalation |
|---|---|---|---|
| Dave Kowalski (Sr. Design Engineer) | Specs flagged by the Quote Research Agent for non-standard materials or tight tolerances | Manual pricing for flagged line items; confirmation or correction of agent’s historical job match | Consulted on any quote involving Inconel, titanium, or exotic alloys. Receives Claude Team notifications when the agent flags a match it cannot confirm. |
| Elena Ruiz (VP Ops) | Agent-assembled draft quote in her CRM review queue, with confidence score and historical matches | Approved quote (or edited and approved), updated Customer Notes.xlsx when new exceptions arise | Reviews every assembled quote. Monitors confidence score distribution. Updates pricing rules for new exceptions. Works from Claude Team workspace and JobBOSS. |
| Ty Banfield (Sales) | Approved quote delivered via HubSpot notification | Customer-facing proposal sent same-day for standard work | Reports customer feedback to Elena. Flags any quote that does not match what the customer requested. Submits all RFQs through the standardized intake form — no email forwards. |
What changed per role: Dave used to give Elena verbal labor-hour estimates when she asked. Now he reviews the agent’s historical matching weekly and gets pulled in only for non-standard materials — the work that actually requires thirty-one years of fabrication knowledge. Elena used to build every quote from scratch, fifteen hours a week. Now she reviews agent-generated drafts in fifteen to twenty minutes each, spending the recovered hours on production planning and supplier negotiations. Ty used to wait three to five days for Elena to produce a quote, then forward it to the customer. Now he receives approved quotes same-day for standard work and can see every quote’s status in the CRM pipeline.
Get the runbook into the tool the team already uses, whether that’s the project management system, the shared doc, or the Slack channel. Do it before the workflow goes live.
The runbook test: could someone who was on vacation during the entire Sprint sit down with this document and know exactly what changed about their job? If not, the runbook isn’t done.
Once the runbook passes that test, put it where the team will actually find it — not a folder they’ll never open.
Pick one role from your Hybrid Accountability Chart whose handoffs changed. Write the three-column entry — input, output, escalation — in one sentence each. Be concrete: “Reviews output” is not specific enough. “Opens the quote draft in the shared folder, checks unit costs against the rate card, approves or flags within 4 hours” is.
What gets logged.
Deliver also surfaces what did not work. The full retrospective belongs to Compound, in the next chapter. This is the operational log that captures, in real time, the gaps between what was designed and what production actually revealed.
How to build and maintain the operational log
- Create the four-column log template before deploy — The log must exist on day one of production — building it post-launch means the first gap goes unrecorded.
- Assign a single named owner — Accountability without a name means the log dies; one person owns it, no committee.
- Run daily check-ins for the first two weeks — The daily cadence surfaces patterns before they compound; week-two review timing aligns with the delivery test window.
- Categorize every entry (Unplanned intervention / Data gap / Edge case / User confusion) — Consistent categories make tallying possible; the category with the most entries identifies the next constraint.
- Tally by category at end of week two and surface the top category as the next Sprint signal — The log is a sensor, not a complaint box — it feeds Design for the next Sprint.
Create the log on day one of deployment. Assign one person to own it — named, accountable. Set a daily check-in for the first two weeks: review the log, categorize new entries, note patterns.
| Date | What happened | Category | What it signals |
|---|---|---|---|
Four categories. Use exactly these:
- Unplanned intervention — a human had to step in where the agent was supposed to handle it.
- Data gap — the workflow needed information that doesn’t exist or wasn’t connected.
- Edge case — an input the design didn’t account for.
- User confusion — a person whose work changed didn’t know what to do.
In Source, I described a client whose AI kept surfacing expired information because nobody audited the knowledge base before indexing it. That’s what Deliver’s logging phase catches — the gaps that Source couldn’t see because they only become visible in production.
Here’s what Meridian’s first two weeks looked like:
| Date | What happened | Category | What it signals |
|---|---|---|---|
| Day 2 | Agent hit an Inconel 718 specification not in Customer Notes.xlsx. It flagged low confidence and escalated to Dave — but also attempted to estimate using 316 stainless as the closest match, producing a price 40% below actual Inconel cost. Elena caught it in review. | Edge case | Agent must produce no price estimate for unknown materials — flag as “manual pricing required” and stop. Guardrail added same day. |
| Day 4 | Ty submitted an RFQ without the customer’s drawing attached. Agent produced an estimate based on the text description alone, which was too vague to be usable. Elena rejected it. | User confusion | Intake form needs a required field for drawings on custom fabrication work. Ty pushed back (“sometimes the customer calls in a description over the phone”); Elena held the line: “If there’s no drawing, you sketch what they described and attach it.” Required field added. |
| Day 7 | Dave sat in on Elena’s review and challenged the agent’s historical matching on a weldment quote. The agent matched on material and dimensions but missed that the tolerances were tighter, which doubled the labor. Side-by-side comparison showed the agent was correct on four of five specs — Dave caught the one tolerance error that mattered. | Unplanned intervention | Tolerance class needs to be a matching criterion, not just material and dimensions. Design task for next Sprint. |
| Day 10 | Ty reported that a customer praised Meridian’s same-day turnaround on a standard bracket quote — first time Meridian had ever delivered a quote that fast. Customer submitted a second RFQ the same afternoon. | — (positive signal) | Speed is producing repeat inbound. Track whether faster turnaround correlates with increased RFQ volume from existing customers. |
At the end of week two, tally entries by category. The category with the most entries is the first thing to fix — and the strongest signal for where the next Sprint should aim.
The operational log isn’t a complaint box. It’s a sensor. Every entry tells Design something about the next Sprint. If the same category keeps appearing, that category is the next constraint.
The log has to exist before the workflow goes live. Build it as part of the deployment.
Create the log template — four columns: Date, What happened, Category, What it signals — in whatever tool your team already uses. Name the person who owns it. Do this before you flip the switch, not after.
The measurement question.
How to measure the Sprint outcome
- Pull the Signal statement and quantified cost — The baseline must come from the original Signal instrument — same metric, same unit — so the comparison is apples-to-apples.
- Measure the same metric in production (result after Sprint) — An imperfect measurement beats a vibe; write the number even when the original Signal quantification was loose.
- Compute the delta — Leadership needs to see what moved, not a narrative — the delta makes the change legible.
- Write the outcome in one sentence — The outcome sentence mirrors the Signal statement and becomes the artifact Compound works against.
- Calculate and report adoption rate (week-two window) — Operational improvement and adoption rate together diagnose whether a low result is a Design problem or a change-management problem.
Deliver’s first responsibility is to measure the Sprint against Signal’s quantified cost. Signal said the constraint costs $150K a year. Deliver shows a workflow that recovers 80% of that. The Sprint produced. The math is legible. The next Sprint earns the right to start.
That’s the clean case. The harder cases are the ones where the measurement is harder. The Sprint clearly changed the work. The operator’s week looks different, the bottleneck is gone, and the log shows unplanned interventions dropping week over week. But the dollar figure is hard to pin down because the original Signal quantification was looser than it should have been. That’s a Signal lesson, not a Deliver failure, and it goes into the Compound stage as an input to how the next Sprint gets framed.
The discipline is to write the number down anyway. Even an imperfect measurement beats a vibe. If the constraint cost was estimated at “around 8 hours a week” and Deliver shows the work now takes 30 minutes a week, the recovery is real and the math is good enough to report. Refuse the trap of declaring victory without evidence — and the equally bad trap of letting imperfect evidence become an excuse to claim nothing happened.
An honest measurement beats a precise one. The leadership team needs to know what changed. They don’t need a three-decimal-place ROI.
The structure is straightforward. Pull out your Signal statement — the sentence with the number. Write the production result next to it. Compute the delta. Then write the Sprint outcome sentence:
| Field | Entry |
|---|---|
| Signal statement | [the one-sentence constraint from Signal] |
| Cost at start | [the quantified cost — dollars, hours, or margin] |
| Result after Sprint | [what actually happened — same metric, same unit] |
| Delta | [the change — quantified] |
| Outcome in one sentence | [what this Sprint produced, stated plainly] |
Meridian example — what “done” looks like:
| Field | Entry |
|---|---|
| Signal statement | Elena Ruiz is the sole quoting bottleneck. Every quote runs through her — $558K/year in lost revenue, misallocated time, and shop floor underutilization. |
| Cost at start | 3.8-day average turnaround; ~15 hrs/week of Elena’s time; capacity capped at 8–10 quotes/week. |
| Result after Sprint | 4.2-hour average turnaround on standard work; ~5 hrs/week of Elena’s time; 18–22 quotes/week. |
| Delta | Turnaround reduced 89%. Elena recovered 10 hrs/week. Quote volume doubled. First-month revenue recovery: $47K. |
| Outcome in one sentence | Three-agent quoting pipeline removed Elena as the bottleneck on standard work, cut turnaround from 3.8 days to 4.2 hours, and produced $47K in new revenue in month one. |
The five-row table captures the operational result — the cost Signal named against the result Deliver produced. That’s one of two numbers the Sprint outcome requires. The second is adoption rate: what percentage of the affected roles are using the new workflow as designed, not the old workflow out of habit. A workflow that works technically but is adopted by 60% of the team is producing 40% less value than the Sprint was designed to generate.
Adoption rate is countable at the scale most operators run. The denominator is the named roles from the Hybrid Accountability Chart whose handoffs changed in this Sprint — not enterprise-wide users, not licensed seats, but the specific people whose work changed. The numerator is the named people observably running their work through the designed workflow, visible in the operational log. “Use” means the affected person’s work went through the designed path, not around it. Report the number as of week two after launch, the same window the operational log’s first review cadence closes on.
The two numbers together tell the story. High adoption with modest operational improvement is a Design problem: the constraint was named correctly but the workflow wasn’t built to solve it at the projected cost. Low adoption with strong operational improvement on the work that did flow through is an absorption problem: the design works, but the organization hasn’t adopted it yet. The next Sprint adjusts accordingly — either the workflow design or the change-management approach, depending on which number is low. One number alone is incomplete.
At Meridian:
SIGNAL STATEMENT: Elena Ruiz is the sole quoting bottleneck. Every quote runs
through her — $558K/year in lost revenue, misallocated time,
and shop floor underutilization.
COST AT START: Average quote turnaround: 3.8 days. Elena's time on quoting:
~15 hours/week. Throughput limited to 8-10 quotes/week by
Elena's capacity alone.
RESULT AFTER SPRINT: Average quote turnaround on standard work: 4.2 hours. Elena's
time on quoting: ~5 hours/week. Throughput doubled to 18-22
quotes/week. $47K in new revenue in the first month from bids
that would previously have arrived too late.
DELTA: Turnaround reduced by 89%. Elena recovered 10 hours/week.
Quote volume doubled. First-month revenue recovery: $47K.
OUTCOME IN ONE SENTENCE: Three-agent quoting pipeline removed Elena as the
bottleneck on standard work, cut turnaround from 3.8 days to
4.2 hours, and produced $47K in new revenue in month one by
letting Meridian quote fast enough to win jobs they used to lose.
Leading indicators the team tracked weekly: quote volume per week, confidence score distribution across the agent pipeline, and escalation frequency to Dave on non-standard materials. Those leading indicators matter because the outcome sentence is a trailing measurement — it tells you what happened. The leading indicators tell you whether it will keep happening.
That outcome sentence mirrors the Signal statement. It’s the artifact Compound works against. If you can’t produce it, stop and diagnose before declaring the Sprint done.
Go back to your Signal instrument. Copy the Constraint Statement and quantified cost exactly as written. Measure the same metric now — same unit, same timeframe, same source. Calculate the delta. Write the outcome in one sentence. No spin.
The delivery test.
The delivery test is a checklist. Run it before you declare the Sprint complete:
Every unchecked box is a task, not a judgment call. Check them all and the Sprint delivered. Leave any unchecked and you deployed a tool — the organization hasn’t changed yet.
If you can’t fill in the “Result After Sprint” line with a number, go back to the operational log and figure out what you need to measure. The Sprint hasn’t delivered until a number moved.
The change is not the launch. The change is the week.
A well-run Deliver phase changes the work the people in the function do. Their week looks different on Monday than it did on Friday. The hours they used to spend on the work that’s now automated either get reinvested in the strategic work Design identified as the higher-return use of their time, or they come out of the org over time, as the Headcount Math gets rerun against the new operating reality.
That’s the test of whether Deliver delivered: can the people whose work was supposed to change describe, concretely, what they now do that they didn’t do before, and what they no longer do that they used to? The agent running and the dashboard existing are table stakes.
End of the Sprint as a project. Start of it as infrastructure.
Deliver is where the Sprint stops being a project and becomes infrastructure. The workflow is live, the team is trained, the result is measured, and the log of what didn’t work is captured. Depending on the Sprint, the constraint Signal named is either resolved, materially reduced, or honestly described in terms of how much of it the Sprint moved and how much remains.
That artifact set is what the next stage works against: the outcome, the log, the runbook, the updated Hybrid Accountability Chart entry. Compound is where the team turns it into the inputs that accelerate the next Sprint. That’s the next chapter.
Reflection Questions
- Run the Deploy Readiness Audit on your current Sprint before you flip the switch. Which of the four gates — training, escalation path, real-data testing, Human Orchestrator cadence on calendar — is the hardest to confirm as Pass? What specific task would close it?
- Write the per-role runbook entry for the person whose daily work changes most because of this Sprint. Does the runbook pass the vacation test — could someone who missed the entire Sprint sit down with that document and know exactly what changed about their job?
- The chapter describes a teammate who kept routing to the old inbox because nobody explicitly told her to stop. Is there an old inbox, old process, or old system in your Sprint that will stay open after deploy unless someone actively closes it? Who closes it, and when?
- Fill in the Sprint outcome measurement template: Signal statement, cost at start, result after Sprint, delta, outcome in one sentence. If you cannot fill in “result after Sprint” with a real number, what measurement do you need to set up before Deliver begins — not after?