Hightower's AI Harness Engineering Podcast

The Karpathy's AutoResearch: Automating Business Growth

Rick Hightower — Wed, 10 Jun 2026 23:16:28 GMT

Run AI Experiments While You Sleep

The next major advantage in business will not come from working longer hours. It will come from running more experiments while everyone else is offline.

That is the core idea behind autonomous optimization loops: instead of relying on slow, human-led research cycles, companies can now deploy AI agents to test, score, and improve assets continuously. Code, landing pages, ads, cold emails, sales scripts, and content ideas can all become part of an always-on iteration engine.

The old model of R&D was limited by human coordination. Teams met, debated, made a change, waited for results, and repeated the process days or weeks later. The new model compresses that cycle into minutes. An AI agent can form a hypothesis, generate a variation, test it against an objective score, keep the winner, discard the loser, and try again.

This is the power of the “Auto Research” approach popularized by Andrej Karpathy. The system is simple but profound: give the agent clear instructions, give it access to the asset it is allowed to improve, and protect the scoring mechanism so the agent cannot cheat. With those three pieces in place, the system can run like digital natural selection.

The most important requirement is an honest number. If the goal is faster code, measure runtime. If the goal is better outreach, measure positive replies. If the goal is better ads, measure cost per click or conversion rate. Without a clear score, the agent has no reliable direction. With one, it can improve relentlessly.

This is why the best use cases have fast feedback loops, low cost of failure, and enough volume to learn from. Website performance, paid ads, email campaigns, landing pages, and internal tooling are all strong candidates. Long-cycle or subjective goals, like brand perception or six-month churn, are much harder to optimize this way.

The strategic implication is huge: the bottleneck is no longer just compute. The new bottleneck is instruction quality. The companies that win will be the ones that know how to define the right objective, write the right operating instructions, and let agents run thousands of disciplined experiments.

The future of optimization is not one genius having one breakthrough idea. It is a swarm of agents testing thousands of small ideas, keeping what works, and compounding the gains over time.

In other words: your next R&D department may not need more meetings. It may need a better markdown file.

This is now free.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit rickhigh.substack.com/subscribe

What Is Harness Engineering? The Engineering Discipline for Production AI Agents.

Rick Hightower — Mon, 01 Jun 2026 15:34:05 GMT

What Is Harness Engineering? The Engineering Discipline for Production AI Agents.

Discover why the runtime around an LLM can outshine the model itself. And how “harness engineering” is turning AI agents into reliable, production‑grade powerhouses.

Summary: Discover how “harness engineering”, the emerging discipline that treats the runtime surrounding large language models as a first‑class engineering artifact, transforms AI agents from fragile prototypes into reliable, production‑grade systems. This article traces the concept’s roots from a 1947 cockpit study and the 2024 SWE‑agent breakthrough to the recent developments in 2025 and 2026. It explains why the engineered harness (context assembly, tool contracts, memory, observability, recovery, and orchestration) is now recognized as essential as the model itself, and showcases the recent convergence of terminology, open standards, and industry adoption that makes harness engineering the key to building scalable, long‑running AI applications.

(Now open to everyone.)

The model is the easy part

The model is not the hard part. It hasn’t been for a while.

We were building harnesses at Spillwave before the term existed, and before we incorporated. So were a lot of teams I worked with as a consultant, even if none of us called it that. We hit walls. We hit a lot of walls.

We had long-running agentic workflows that ran for hours, produced correct results, and resumed cleanly after interruption; before “long-running agent” was a published term. We had programmatic validation of DAX queries, feedback loops, and LLM-as-judge before the papers showed up. I wrote about drift detection on Medium over a year ago, arguing that drift had to be tracked and retested whenever a model version changed, a prompt changed, or a tool contract shifted. None of these had names yet. We pulled them out of necessity, not literature. We read a blog here or there, but the knowledge was disjointed and lacked cohesion.

Except that this is the part I keep coming back to: they were not really walls. Walls stop you. We were not stopped. We were on the cutting edge, and the cutting edge cuts. We ran into a sliding glass door that we didn’t see until we were already through it. We got cut. We have the battle scars.

The frustrating part, as a consultant, was that the demo worked. Stakeholders could not see why the scaffolding mattered. Why all this overhead? It works in dev. The problem is if it stops working. What happens when drift occurs six weeks later in production, even though it hadn’t occurred yet or been caught yet, and nobody knows how to reproduce the issues in dev. The work felt improvised even when it was not.

Adding a few shots to prompts for a new use case, or more tools, or more subagent decomposition, without drift detection, is like nailing Jell‑O to a wall. Non-deterministic behavior plus “just one more feature” without drift detection is like swimming in a pool of alcohol while covered in paper cuts. That isn’t “the bleeding edge.” It’s deliberate self-destruction.

Naming things is what puts a handle on the door. Once we have shared terminology, the next team does not have to run through the glass. They can see the door, find the handle, and walk through. That is what harness engineering gives the field. It is not validation for the people who were already doing this work. It is a way to talk about getting agentic workflows into production without re-explaining the ground floor every time.

The work itself now has a name. The engineered runtime that wraps a large language model and turns its raw text output into reliable system behavior is the harness, and the discipline of designing, building, and operating it is harness engineering. This article explains where that name came from, what the harness actually is, and why the discipline is 12 months old as a named thing while being roughly 3 years old as a practice.

What a harness is, in one paragraph

The harness is the engineered runtime that wraps a large language model and converts its raw text output into reliable system behavior. Specifically, the harness does six things the model cannot do on its own. It shapes what the model sees on each call (context assembly). It decides what the model is allowed to do (tool contracts and validators). It remembers what happened across calls (memory and durable state). It watches what the model produced (observability spans, drift detection, and evaluation gates).

It recovers when something went wrong (rollback, retry, replay). And it coordinates when multiple models or agents are involved (orchestration and protocol handling). Harness engineering is the discipline of treating the runtime as a first-class engineering artifact, much like SRE treats production infrastructure as code. Many of these are just a given, built into the tools that develop and deploy agents.

A harness is not a try-catch wrapper that prevents the model from failing. It is the engineered environment that enables a capable model to perform larger, longer, and more autonomous work than it could do on its own. A good cockpit does not just keep a pilot from crashing. It lets the pilot fly missions that a worse cockpit could not survive. Hold onto that distinction; it is the one most people get wrong when first encountering the term.

The discipline has a documented birth date

Most engineering disciplines do not. Harness engineering does, and the date is May 2024.

A team at Princeton (Yang, Jimenez, Wettig, Lieret, Yao, Narasimhan, and Press) published a paper called SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. It would later appear at NeurIPS 2024 as arXiv 2405.15793. The paper did something that, in retrospect, looks obvious and at the time looked like a category error.

They held the model fixed. GPT-4 Turbo, no fine-tuning, no prompt tricks. Then they built a small layer between the model and the codebase, called it the Agent-Computer Interface (ACI), and changed only that layer. The ACI had four parts: a cap of fifty results on file searches, a stateful file viewer that showed one hundred lines at a time and remembered position across calls, a linter that ran at edit time and rejected syntactically broken patches before they were applied, and a context window manager that compressed older observations as the trace grew.

That was it. Same model, same weights, same benchmark.

SWE-bench performance increased from 3.8 percent (previous best) to 12.47 percent. A more than three-fold gain, driven entirely by interface design.

The number is striking. The interpretation is more striking. What the SWE-agent team had demonstrated, deliberately and measurably, was that the runtime around the model can matter more than the model itself. Up to that point, the implicit assumption in agent research had been that better agents required better models. The ACI ablation showed that better agents could come from better interfaces, holding the model constant.

That paper is the foundational design document for the discipline. Everything that comes after, the harness patterns, the workflow patterns, the four-protocol stack, the production retrospectives, is a generalization of what SWE-agent demonstrated.

The 1947 cockpit study that the SWE-agent paper drew on

The SWE-agent authors did not invent the principle. They named it explicitly: human-factors engineering. The lineage is older than computing.

In 1947, Paul Fitts and Richard Jones published Analysis of Factors Contributing to 460 Pilot Error Experiences in Operating Aircraft Controls, a study commissioned by the USAF Aero Medical Laboratory after a wave of post-war crashes that the Air Force had been classifying as pilot error. Fitts and Jones interviewed the pilots and looked at the cockpits. What they found was not pilot error. They found that the same controls were laid out differently in different aircraft, that visually identical levers performed wildly different functions, and that under stress, experienced pilots reliably reached for the wrong control because the environment had been designed without regard for how humans actually behave.

Their conclusion reframed the entire field. Stop trying to train better operators. Redesign the environment. The cockpit is the variable.

That conclusion seeded human-factors engineering, was propagated through Don Norman’s The Design of Everyday Things (1988), Atul Gawande’s Checklist Manifesto (2009), and the surgical and ICU checklist literature, which has demonstrably saved lives by changing the environment rather than the operator. The SWE-agent paper put LLMs in the operator chair and applied the same logic. The ACI is a cockpit redesign for an agent.

This matters because it places harness engineering inside an eighty-year tradition with measurable outcomes. The discipline is not a fashion. It is the latest instance of a principle that has held true every time it has been applied: when an operator keeps repeating the same mistake, the environment is the variable.

Mechanical sympathy applied to a new substrate

The cockpit metaphor is one lens. The other, and the one that hits software engineers the hardest, is mechanical sympathy.

The phrase was coined by racing driver Jackie Stewart, who said you cannot drive a car fast unless you understand how it works. Martin Thompson brought it into software engineering around 2011 with the LMAX Disruptor, demonstrating that you could process millions of operations per second on commodity hardware if you wrote code that respected how the underlying machine actually behaves: CPU cache lines, branch prediction, memory hierarchy, false sharing, page faults. Mechanical sympathy is the discipline of writing software that adapts to the substrate on which it runs, rather than fighting it.

Harness engineering is mechanical sympathy applied to a new substrate. The substrate is the LLM, the context memory, and the attention budget. Like every substrate, it has named failure modes that the discipline teaches you to design around.

Mechanical sympathy vs harness engineering:

Mechanical sympathy (software)

* Write software that adapts to how the CPU works

* …how memory works

* …how the disk works

* Work around: cache misses, branch mispredictions, false sharing, page faults

Harness engineering (AI agents)

* Write agents that adapt to how LLMs work

* …how context memory works

* …how the attention budget works

* Work around: context rot, context panic, lost-in-the-middle, U-shaped attention

The four AI-side failure modes are real, named, measurable phenomena that production engineers hit constantly. Context rot is the documented decay in model performance as a context window fills with stale or low-signal tokens. Context panic is the failure mode where an agent under context pressure starts skipping steps and short-circuiting plans. Lost-in-the-middle is the now-replicated finding that information buried in the middle of a long prompt is reliably under-attended relative to the beginning and end. U-shaped attention is the broader generalization. None of these were nameable two years ago. All of them now have remediation patterns the harness can apply: context compression, working-memory discipline, retrieval ordering, structured note-taking, and sub-agent isolation.

This framing tells a tighter story about the lineage. Hardware mechanical sympathy taught software engineers to respect cache lines and memory layout around 2011. The SWE-agent paper in 2024 marked the moment when mechanical sympathy crossed from CPUs to LLMs. SWEs influenced projects such as Claude Code, OpenCode, Gemini CLI, and more. The same insight, applied to a new substrate, more than tripled coding-agent performance without changing the model. Harness engineering is the generalization of that insight to production agent systems. Three generations of the same idea, applied at progressively higher abstractions: CPUs, then agent-computer interfaces, now full agent runtimes.

How the practice became a named discipline: the late-2025 to early-2026 convergence

The practice of building harnesses is older than the name. Production teams shipping agentic systems through 2024 and 2025 were already building tool layers, context-assembly pipelines, validators, memory tiers, observability spans, and recovery loops around their models. Anyone who tried to ship Claude Code, OpenAI Codex, or Cursor on real codebases knew the model alone was not enough. The work had no shared vocabulary. Every team called it something different (the wrapper, the agent loop, the orchestration layer, the runtime), and each thought its version was custom.

Anthropic seeded the vocabulary first. Through the second half of 2025, while most teams still called the layer around the model “the wrapper” or “the agent loop,” Anthropic was already publishing engineering writing that used the term “harness” as a term of art. Effective Context Engineering for AI Agents (28 September 2025) named context engineering a discrete engineering concern with its own patterns, separate from prompt engineering, and a key part of harness engineering. Effective Harnesses for Long-Running Agents (25 November 2025) went further, naming the harness itself a discrete artifact with a discrete set of design problems. By the time the rest of the industry caught up with the term in early 2026, Anthropic had been publishing about harnesses in print for months.

The credit Anthropic deserves for this discipline goes well beyond the vocabulary. The Model Context Protocol, the Agent Skills Open Standard, and the subagent design pattern. This subagent pattern was introduced by Claude Code (also in the Claude Agent SDK, derived from Claude Code) and later adopted by Codex, OpenCode, Gemini CLI, and LangChain Deep Agents. These are the load-bearing primitives that the rest of the harness ecosystem now builds on. Anthropic also chose to open these primitives rather than keep them proprietary: MCP was donated to the Linux Foundation in December 2025, and Agent Skills was released as an open standard with a public specification at agentskills.io.

The Claude Certified Architect: Foundations exam, released by Anthropic in 2025, tests practitioners on patterns that map almost one-to-one onto what this article calls harness engineering. If the certification were named today and vendor-agnostic, rather than in 2025, “Certified Harness Engineering Foundations” would not be an unreasonable title. The exam is, in substance, a certification in harness engineering.

In February 2026, Mitchell Hashimoto, co-founder of HashiCorp, wrote a blog post about his personal AI-adoption journey, in which he used the phrase “harness engineering” to describe the systematic practice of fixing agent mistakes by improving the harness rather than the prompt. Anthropic had the word; Hashimoto turned it into the name of a discipline. The framing landed.

On February 11, 2026, OpenAI followed with a formal definition in their post about building a million-line production codebase entirely with Codex agents. They described their primary engineering challenge not as model capability but as designing the environments, feedback loops, and control systems around the model. That post is what made the term institutional, in the sense that two of the three frontier labs were now using the same vocabulary in their public engineering writing.

From February to March 2026, Martin Fowler’s site, LangChain, and Cobus Greyling wrote follow-up essays that distilled the discipline into formulas a working engineer could quote. Birgitta Boeckeler, writing on Fowler’s site, framed the harness as the tooling and practices used to keep AI agents in check, naming three concerns specifically: context engineering (what the model sees), architectural constraints (what the model is allowed to do), and error garbage collection (continually pruning bad artifacts and drift before they propagate). LangChain compressed the whole picture into one formula:

Agent = Model + Harness

The model provides raw intelligence. The harness manages memory, tools, retries, human approvals, and observability so the model can focus on reasoning.

Then, on 23 March 2026, Anthropic published Harness Design for Long-Running Application Development, the most complete published reference design for the discipline to date. It is not a short-form essay; it is a full reference architecture covering context assembly, memory tiers, evaluation gates, recovery loops, and the operational patterns long-running agents need. If you read one document on harness design, that is the one. The post effectively closed the convergence window: the lab that first used the word “harness” in print also published the reference design that the rest of the industry now points to.

By April 2026, the term was in working use across major AI engineering teams, vendor blogs, and production retrospectives. The Hashimoto post and the OpenAI post were the moment the practice and the name converged. Anthropic’s September 2025, November 2025, and March 2026 posts established the discipline’s working vocabulary and reference architecture. The discipline is 12 months old as a named entity and roughly 3 years old as a practice.

What lives in the harness, and what lives in the model

Useful agent design depends on knowing exactly which dimensions of an agent are model concerns and which are harness concerns.

The cleanest working model I have found is six-dimensional:

Agent = Perception + Brain + Memory + Planning + Action + Collaboration

* Perception is how the agent receives and preprocesses inputs (text, images, structured data, tool responses).

* Brain is the reasoning engine, often a family of models routed by the harness (a fast model for extraction, a stronger model for orchestration, a frontier model for high-stakes decisions).

* Memory is its own engineering discipline, with short-term, working, and long-term layers, distinct from the reasoning engine.

* Planning is either a ReAct loop (reason and act at each step) or a plan-and-execute approach (decompose upfront, execute steps in parallel where possible).

* Action is increasingly code-as-action: the agent writes a short script that calls multiple tools, handles retries in code, and returns a single clean output, rather than streaming individual tool calls through a loop.

* Collaboration is now a protocol-level concern, governed by four open standards that operate at different layers. MCP (Model Context Protocol) is the vertical interface between an agent and its tools. A2A (Agent-to-Agent Protocol) is the horizontal interface between agents. Then, a quasi-standard is delegating tasks to subagents within a process to keep the main orchestrator agent’s context clean. AG-UI is the frontend interface between an agent and its human user. The Agent Skills Open Standard is the capability acquisition interface.

Two of those six dimensions (Perception and Brain) are largely shaped by the model. The other four (Memory, Planning, Action, Collaboration) are largely shaped by the harness. That ratio, four to two, is the answer to “where does the engineering effort actually go?” It goes into the harness.

What a harness is not

The single most common misreading of the term comes from the failure-prevention framing. The harness is not a try/catch block around the model. It is not a guardrail in the moral-panic sense. It is not a wrapper whose purpose is to keep the model from saying something embarrassing.

The harness is what enables the model to do work it could not do alone. The cockpit metaphor is exact. A pilot in a 1944 fighter and a pilot in a modern fly-by-wire fighter have similar reflexes; the difference in what they can accomplish is overwhelmingly in the cockpit, the avionics, and the airframe. Same operator, different envelope. Harness engineering is what builds the new envelope.

The capability framing matters because it changes what you optimize for. If you treat the harness as a failure-prevention measure, you measure it by fewer bad outcomes. If you treat it as capability enabling, you measure it by larger, longer, more autonomous work successfully completed. Production teams that have made the shift in framing report that the second metric is the one that actually moves the business.

Why the timing is not a coincidence

Three forces converged in the last twelve months on top of the SWE-agent foundation, and none of them coordinated.

First, the terminology crystallized in February 2026. Hashimoto’s essay, Birgitta Boeckeler’s three-pillar definition on Martin Fowler’s site, LangChain’s “Anatomy of an Agent Harness,” and Anthropic’s harness design paper appeared within weeks of one another. They converged because the problem they were each solving had the same shape. AGENTS.md, an open convention that emerged from this work, has been adopted by more than 60,000 projects in under a year.

Second, the frameworks stabilized. Claude Agent SDK, LangChain Deep Agents, and the rebuilt OpenAI Agents SDK all reached general availability within the last six months. The four-protocol stack underneath them stabilized at the same time: MCP was donated to the Agentic AI Foundation under the Linux Foundation in December 2025; A2A reached version 1.0 in early 2026 with backing from more than 150 organizations; AG-UI achieved native support in Amazon Bedrock AgentCore and Microsoft Agent Framework; the Agent Skills Open Standard reached implementations across Claude Code, OpenAI Codex CLI, Cursor, GitHub Copilot, Goose, and Gemini CLI.

Third, the regulatory clock arrived on the same calendar. EU AI Act enforcement deadlines hit in 2026. NIST’s AI Risk Management Framework is now a de facto standard for U.S. federal contractors. The FDA published draft guidance for AI in regulated settings. Compliance teams are now asking engineering teams to demonstrate exactly the auditability that a harness provides.

Naming, frameworks, protocols, and regulation. All four arrived atop a four-year design lineage. That is why “harness engineering” stuck, and “agent loop” did not.

What does this mean if you are building agents today

You are doing harness engineering, whether you call it that or not. The cost of doing it without shared vocabulary is real. Teams reinvent the same patterns. They miss the same failure modes. They rebuild the same infrastructure three times.

The vocabulary is now stable enough to use. The reference designs are public (Anthropic’s three harness posts, the SWE-agent ACI ablation, LangChain’s Deep Agents harness commentary). The pattern catalogs are converging across vendors, with no coordination, on the same shapes. The open protocols underneath the harness layer (MCP, A2A, AG-UI, Agent Skills) are stable enough to build on.

If you have been calling it the wrapper, the loop, or the runtime, this is the moment to switch. The harness is the right name. The discipline is harness engineering. And the principle, the one Fitts and Jones articulated in 1947, and the SWE-agent team applied to LLMs in 2024, is the same in every generation: when the operator keeps making the same mistake, the environment is the variable.

Coda

Twelve months ago, building production agents felt like running through a sliding glass door we did not see. We got cut. We learned. The discipline now has names for what was behind the glass: context rot, drift, the harness, the cockpit, the four protocols, and naming, which is what puts a handle on the door. The next team does not have to bleed to get through. They can see the door, find the handle, and walk through. That is what the cockpit does for the pilot. That is what the harness does for the agent. And that is what a named discipline does for the field.

References

* Yang, Jimenez, Wettig, Lieret, Yao, Narasimhan, Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024; arXiv 2405.15793.

* Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.

* Fitts and Jones. Analysis of Factors Contributing to 460 Pilot Error Experiences in Operating Aircraft Controls. USAF Aero Medical Laboratory, 1947.

* Don Norman. The Design of Everyday Things, Revised and Expanded Edition. Basic Books, 1988/2013.

* Atul Gawande. The Checklist Manifesto. Metropolitan Books, 2009.

* Martin Thompson et al. LMAX Disruptor (mechanical sympathy in high-performance Java systems), circa 2011.

* Anthropic. Effective Context Engineering for AI Agents. September 2025.

* Anthropic. Effective Harnesses for Long-Running Agents. November 2025.

* Anthropic. Harness Design for Long-Running Application Development. March 2026.

* Mitchell Hashimoto. Personal blog, February 2026 (coined harness engineering as a discipline name).

* OpenAI. Building a million-line production codebase with Codex agents. February 11, 2026.

* Birgitta Boeckeler / Martin Fowler’s site. Three-pillar harness definition (context engineering, architectural constraints, error garbage collection), February 2026.

* LangChain. Anatomy of an Agent Harness and Improving Deep Agents with Harness Engineering, 2026.

* Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024.

* Du et al. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. EMNLP Findings 2025; arXiv 2510.05381.

About the Author — Claude Certified Architect

Rick Hightower is a former Senior Distinguished Engineer at a Fortune 100 company, focusing on delivering ML / AI insights to front-line applications, and a practitioner building multi-agent production systems. Follow him on SubStack and Medium for more hands-on agent engineering content. You can also book him to speak and train your team: Check out Rick Hightower’s SpeakerHub.

Rick Hightower helps companies become AI-first through practical mentoring, executive and team training, and custom AI solution development. He is a former Senior Distinguished Engineer at a Fortune 100 company, where he focused on bringing ML and AI insights into real front-line business applications.

Subscribe to Rick’s newsletter to see videos and guides.

Rick is a Claude Certified Architect, AI systems practitioner, and builder of production multi-agent systems. He is currently working on authoring a book on Harness Engineering with Manning publishing. He created Skilz, a universal agent skill installer supporting 30+ coding agents including Claude Code, Gemini, Copilot, and Cursor, and co-founded one of the largest agentic skill marketplaces.

Today, Rick and the Spillwave team works with leaders and teams who want to move beyond AI experiments and build real AI capability inside their companies. He helps organizations adopt AI safely, train their people, redesign workflows, and build practical AI systems that create measurable business value.

Ready to make your company AI-first? Connect with Rick on LinkedIn, Substack or Medium, book him to speak or train your team, or visit Spillwave to explore mentoring, training, and custom AI solutions for your organization.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit rickhigh.substack.com/subscribe

Video: Claude Agent SDK: You Already Wrote the Agent Loop. The Claude Agent SDK Deletes It.

Rick Hightower — Mon, 01 Jun 2026 02:37:35 GMT

You have written the tool-use loop five times. It stopped feeling like a rite of passage and started feeling like plumbing you keep reinventing. Every developer who has called an LLM with tools has hand-written the same tool-use while loop. The Claude Agent SDK is the layer that deletes it, and naming the difference between the two SDKs is the whole point of getting started.

In this video: You will learn what the Claude Agent SDK is, and the one distinction that makes it click: the difference between an SDK that gives you the model and an SDK that gives you the model with the tool loop already built. We cover why the agent loop is plumbing you keep reinventing, when to reach for the SDK versus the Claude Code CLI, how to install it without losing ten minutes to a version error, and how to run a first agent that reads a real file and fixes a real bug.

Hightower’s AI Harness Engineering is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Now free.

You know the loop. You call the model, it asks to use a tool, you run the tool, you stuff the result back into the next request, and you do it again, and again, until the model finally stops asking. The first time you wrote it, the loop felt like a rite of passage. The fifth time, parsing tool calls, threading results, and guessing when to stop, it just felt like plumbing you keep reinventing.

The Claude Agent SDK is the layer that makes that plumbing disappear. You hand Claude a prompt and a set of tools, and Claude runs the loop: it reads files, runs commands, edits code, and decides when it is done. You configure the agent. Claude executes it.

This article sets up that distinction, gets the SDK installed, and ends with a working agent that reads a real repository and fixes a real bug. That repository is the spine of a whole series. We build one agent, a code-maintenance agent that works on a small Python project called buggy-shop, and grow it part by part: streaming, permissions, sessions, custom tools, subagents, hooks, observability, and a hardened production deployment. Nothing gets thrown away. Each capability is one the previous step made you wish you had.

Two SDKs, and why the difference is the whole point

If you have called Claude before, you have probably used the Anthropic Client SDK. It gives you direct API access: you send a prompt, you get a response, and when the model wants a tool, you implement the execution yourself. That is the loop you have written.

The Agent SDK gives you the same model with the tool loop already built. The contrast is easiest to see side by side. Python then TypeScript versions.

# Client SDK: you implement the tool loop response = client.messages.create(...) while response.stop_reason == “tool_use”: result = your_tool_executor(response.tool_use) response = client.messages.create(tool_result=result, **params) # Agent SDK: Claude handles tools autonomously async for message in query(prompt=”Fix the bug in auth.py”): print(message)

// Client SDK: you implement the tool loop let response = await client.messages.create({ ...params }); while (response.stop_reason === “tool_use”) { const result = yourToolExecutor(response.tool_use); response = await client.messages.create({ tool_result: result, ...params }); } // Agent SDK: Claude handles tools autonomously for await (const message of query({ prompt: “Fix the bug in auth.ts” })) { console.log(message); }

The bottom half of each snippet is the entire point. There is no executor to write, no stop_reason to check, and no result threading. You iterate over messages as Claude works, and the loop runs itself. The SDK ships with built-in tools for reading files, running shell commands, editing code, and searching, so your agent can start doing real work without you wiring up a single tool by hand.

Look at the diagram and notice where the work lives. On the Client SDK path, the red box, the executor, is yours: you write the code that runs the tool and threads the result. On the Agent SDK path, that box is inside the library. You never see it. You configure the agent, and the loop runs itself.

Thanks for reading Hightower’s AI Harness Engineering! This post is public so feel free to share it. It helps a lot.

SDK or CLI? You will probably use both

There is a second tool worth placing on the map: the Claude Code CLI. It runs the same engine as the Agent SDK, just behind a different interface. The CLI is for interactive work at your terminal. The SDK is for code: production applications, CI pipelines, and anything you want to run without a human typing.

A rough guide to which one to reach for follows. The CLI fits interactive and throwaway work. The SDK fits anything programmatic or automated.

* Interactive development: CLI

* One-off tasks: CLI

* Custom applications: SDK

* CI/CD pipelines: SDK

* Production automation: SDK

Most teams end up using both: the CLI for daily hands-on development, and the SDK for production. Because they share an engine, the mental model and the workflows carry directly from one to the other. What you learn building agents here applies the moment you drop into the CLI, and the reverse is also true. The decision is not “which tool do I commit to.” It is “is a human typing right now.”

Install it

Pick your language. Both packages install in one line.

pip install claude-agent-sdk

npm install @anthropic-ai/claude-agent-sdk

A couple of notes will save you a confusing first ten minutes. The TypeScript package bundles a native Claude Code binary as an optional dependency, so you do not install Claude Code separately. This series targets Opus 4.7, the model string claude-opus-4-7, which requires a recent version of the Agent SDK.

Gotcha: if your first run throws a thinking.type.enabled API error, you are on an SDK version older than 4.7 support. Upgrade the package and the error clears.

One more thing before code. For production agents, authenticate with an Anthropic API key, not a claude.ai login. Anthropic does not permit third-party products built on the Agent SDK to use claude.ai login or rate limits unless separately approved. Set up key-based authentication from the start and you will not have to rework it later.

Hello, agent

The smallest useful program is not “hello world.” It is an agent that reads a file, finds a bug, and fixes it, because that is the shape of everything that follows.

# Python import asyncio from claude_agent_sdk import query, ClaudeAgentOptions async def main(): async for message in query( prompt=”Find and fix the bug in auth.py”, options=ClaudeAgentOptions(allowed_tools=[”Read”, “Edit”, “Bash”]), ): print(message) # Claude reads the file, finds the bug, edits it asyncio.run(main())

//TypeScript import { query } from “@anthropic-ai/claude-agent-sdk”; for await (const message of query({ prompt: “Find and fix the bug in auth.ts”, options: { allowedTools: [”Read”, “Edit”, “Bash”] }, })) { console.log(message); // Claude reads the file, finds the bug, edits it }

Read what is actually happening. You give query() a prompt and a list of allowed tools. Read lets Claude open files, Edit lets it change them, and Bash lets it run commands. Then you iterate. Each turn of the loop yields a message: Claude’s reasoning, a tool it is calling, or a result coming back. You did not parse a tool call or decide when to stop. Claude did, and it stopped when the bug was fixed.

The sequence diagram shows the shape of every agent you will build. Each round trip is one turn. Claude asks for a tool, the SDK runs it, the result feeds back, and Claude decides what to do next. When Claude produces a message with no tool calls, the loop ends. You never scheduled any of it.

That print(message) is deliberately raw. It dumps every message object, which is exactly what you want the first time, because seeing the loop’s actual output is how the loop stops being abstract. The next step is to pick those messages apart by type and turn the firehose into something a human would want to watch.

Where this goes from here

Here is the arc, so you can see why the order is what it is. The next step takes apart the agent loop itself: the turn-by-turn cycle, the built-in tools, and the day-one safety valves (maxTurns, maxBudgetUsd, and the effort control) that keep a runaway agent from becoming a runtime problem instead of just a billing one. From there the work climbs through streaming and live UX, permissions and approvals, sessions and rewinding, on-disk project context, custom tools and MCP and subagents, hooks, structured output with full observability, and finally a locked-down production deployment. The final step packages everything you have built into a reusable plugin.

Every step adds one layer the previous step left you reaching for. That is the design.

Do this today

Five minutes turns this article from reading into muscle memory. Do these in order:

* Install the SDK in your language of choice: pip install claude-agent-sdk or npm install @anthropic-ai/claude-agent-sdk.

* Confirm your version supports Opus 4.7. You need a recent version of the Agent SDK. If a first run throws a thinking.type.enabled error, upgrade the package.

* Set up an Anthropic API key, not a claude.ai login, so production authentication is right from the start.

* Run the hello agent against a file with an obvious bug, and watch every raw message stream by.

* Read the loop. Notice that you never parsed a tool call or decided when to stop. Claude did both.

The takeaway

The agent loop was never the interesting part of your work. It was the tax you paid to get to the interesting part. The Claude Agent SDK collects that tax once, inside the library, and hands you back the thing you actually wanted: a model that can act, with you deciding what it is allowed to do.

So do the five-minute version of this whole article. Install the SDK, point the hello agent at a file with an obvious bug, and watch the messages stream by. Once you have seen the loop run itself, you will never want to write while stop_reason == "tool_use" again. Good. That is the loop you came here to delete.

Join Rick Hightower’s subscriber chat

Available in the Substack app and on web

Join chat

This is Part 1 of “Building with the Claude Agent SDK,” a 14-part guide to building production-ready AI agents.

About the Author — Claude Certified Architect

Rick Hightower is a former Senior Distinguished Engineer at a Fortune 100 company, focusing on delivering ML / AI insights to front-line applications, and a practitioner building multi-agent production systems. Rick is a Claude Certified Architect. Follow him on Medium for more hands-on agent engineering content. You can also book him to speak and train your team: Check out Rick Hightower’s SpeakerHub.

Rick created Skilz, the universal agent skill installer that supports 30+ coding agents, including Claude Code, Gemini, Copilot, and Cursor, and co-founded the world’s largest agentic skill marketplace. Connect with Rick Hightower on LinkedIn or Medium. Check out SpillWave, your source for AI expertise.

Rick has been actively developing generative AI systems, agents, and agentic workflows for years. He is the author of numerous agentic frameworks and developer tools and brings deep practical expertise to teams adopting AI. He enjoys writing about himself in the 3rd person.

Rick also wrote a Claude Certified Architect (CCA) series of articles that have a lot of useful information on writing agentic AI systems. Many ideas captured in the CCA and the exam prep Rick wrote echo what you see in this article. If you want to improve your ability to create well-behaved AI agents, studying for the CCA Exam is a good place to start.

CCA Exam Prep on Agentic Development

* Claude Certified Architect: The Complete Guide to Passing the CCA Foundations Exam

* CCA Exam Prep: Mastering the Code Generation with Claude Code Scenario

* CCA Exam Prep: Mastering the Multi-Agent Research System Scenario

* CCA Exam Prep: Structured Data Extraction

* CCA: Master the Developer Productivity Scenario

* Claude Certified Architect: Master the CI/CD Scenario

* CCA Exam Prep: Mastering the Customer Support Resolution Agent Scenario

* Get the complete reading list for CCA-F exam prep articles from this Claude Certified Architect Exam Prep list.

Rick also wrote a series on harness engineering and how to improve agentic systems using harness engineering for feedback loops and adversarial agents. These articles also go hand in hand with this article.

Harness Engineering Articles

* The $9 Disaster: What Anthropic’s Harness Design Paper Teaches Us About Building Autonomous AI

* Harness Engineering vs Context Engineering: The Model is the CPU, the Harness is the OS

* LangChain Deep Agents: Harness and Context Engineering: Memory, Skills, and Security

* Beyond the AI Coding Hangover: How Harness Engineering Prevents the Next Outage

* LangChain’s Harness Engineering: From Top 30 to Top 5 on Terminal Bench 2.0

* Anthropic’s Harness Engineering: Two Agents, One Feature List, Zero Context Overflow

* OpenAI’s Harness Engineering Experiment: Zero Manually-Written Code

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit rickhigh.substack.com/subscribe

Claude Code Day by Day Series Day 4

Rick Hightower — Sun, 31 May 2026 18:11:59 GMT

Why Your AI Coding Sessions Fall Apart at Hour Three (and the Workflow That Fixes It)

Claude Code already ships a complete project-management system. Most developers never find it, so they bolt on tooling that the tool was designed to make unnecessary.

In this article: Working engineers love Claude Code for writing functions and quietly resent it for building features. The fix is not a plugin or a hand-rolled TODO.md. It is spec-driven development with Claude Code, built from four native layers: plan mode, the interview-to-spec pattern, the live task list, and a durable todos.json mirror. You will learn when to use each layer, how to wire them together, and how to hand Claude a spec and walk away.

You ask Claude Code to “implement OAuth.” It starts strong. Then, three hours later, the code is half-finished, the conversation has wandered through six detours, and you cannot remember which acceptance criteria still fail. So you do what everyone does. You reach for an overlay: a project-management plugin, a discipline framework, or a TODO.md you keep nagging the model to update.

Here is the uncomfortable truth. None of that is necessary. Claude Code already ships the planning, the written record, and the live progress tracker. Your sessions fall apart not because the tool lacks project management, but because nobody told you which layers it already includes. Spec-driven development with Claude Code is not a feature you install. It is a workflow you assemble from primitives that are already on disk.

The four layers, and the one mistake everyone makes

Project tracking in Claude Code has four layers. Each does exactly one job. The mistake almost everyone makes is forcing a single layer to do all four, then concluding the tool is weak. It is not weak. It is layered.

Plan mode is read-only research that ends in a proposed plan. It lives in the session and evaporates when the session ends. It answers “what should we do?”

SPEC.md is the agreed approach, committed to the repo at docs/specs/.md. It is durable but static. It answers “what did we agree to?”

The task list is the live working checklist. It runs in-session and, crucially, survives context compaction. It answers “where are we right now?”

.claude/todos.json plus TODO.md is the durable mirror, machine-readable and human-readable, kept current by a small skill and hook. It answers “where are we across sessions?”

Plan mode is fast but ephemeral. A spec is durable but frozen. The task list is live but trapped in the session. The JSON mirror outlives every session but goes stale the moment you stop syncing it. You want all four, each pulling its own weight. The discipline is refusing to ask any one of them to be the others.

The interview-to-spec workflow

This is the highest-leverage pattern in the whole Claude Code planning toolkit, and it comes straight from Anthropic’s official best-practices documentation. For any feature bigger than an afternoon, this is where you start.

The prompt template, almost verbatim:

I want to build [brief description]. Interview me in detail using the AskUserQuestion tool. Ask about technical implementation, UI/UX, edge cases, concerns, and tradeoffs. Don't ask obvious questions, dig into the hard parts I might not have considered. Keep interviewing until we've covered everything, then write a complete spec to docs/specs/.md.

Paste that into a fresh session with whatever rough brief you have. The AskUserQuestion tool gives Claude a structured way to interview you. Not “what’s the deadline?” but the hard questions you have not thought through. Should the OAuth refresh flow be silent or surfaced to the user? Do you need multiple identity providers per account, or just one? When the access token expires mid-request, do you retry once or fail loudly?

Those are the questions a senior engineer asks in a design review. You get them now, before the first line of code, instead of in a postmortem.

Ten to twenty minutes later, Claude writes a complete spec to disk. Open it. Read it. Fix anything wrong. Commit it. The spec is the deliverable of the planning phase, full stop.

Then comes the step most people skip, and it matters far more than it sounds. Start a fresh session to execute the spec. Use /clear, or open a new terminal. The interview session is full of “we considered X but rejected it because Y” detours. Those detours were essential for writing the spec and are pure noise for executing it. A fresh session that reads the committed spec from disk is sharper, faster, and produces noticeably better code. The spec is a clean handoff between research-Claude and build-Claude, even though both are the same model.

Where the file goes: SPEC.md at the repo root for a one-off feature, or docs/specs/.md once you have done this more than twice. Commit them. Six months later, the spec is the single best answer to “why does the code look like this?” For a small change, skip the spec entirely and use plan mode. Not everything needs ceremony.

A spec from this workflow typically has: problem statement, goals and non-goals, proposed approach, considered alternatives with reasons for rejection, acceptance criteria, and open questions. You do not have to request that format; Claude defaults close to it. If your team has a template, point at it.

Plan mode: the lightweight version

Not every change earns a full interview round. For the rest, plan mode does the same job in miniature. There are three ways in:

* Shift+Tab cycles the mode until the status bar shows plan.

* Prefixing a single prompt with /plan runs one prompt in plan mode without changing the session mode.

* claude --permission-mode plan starts the whole session in plan mode.

In plan mode, Claude reads files and runs exploration commands but does not touch your source. When the plan is ready, it presents the plan with explicit options: approve and start in auto mode, approve and accept edits, approve and review each edit manually, or keep planning.

Two power moves are worth memorizing.

Ctrl+G edits the plan in your editor before you approve it. The plan is plain markdown. Open it, strike the parts you disagree with, add what is missing, save, then accept. This single keystroke is what turns plan mode from “Claude proposed something” into “we agreed on something.” Most people have never used it.

Saving the plan to the repo before approving makes it outlive the session: Write this plan to docs/plans/auth-migration.md, then exit plan mode and start implementing. That is the poor-man’s spec. No interview, no design review, just a written record of what you decided. Plenty of features deserve nothing more.

So when do you reach for which? Plan mode is for “I know roughly what I want, propose it.” Interview-to-spec is for “I have a fuzzy goal, help me think it through.” The deciding question is whether the design space is mostly resolved or mostly open.

The built-in task list: your live working checklist

Here is the answer to “where is the list of todos that stays current as we work?” Claude Code has had it built in for months. Most people have never opened it.

Since Claude Code v2.1.142 (TypeScript SDK 0.3.142), sessions use structured Task tools: TaskCreate, TaskUpdate, TaskGet, and TaskList. They maintain a live task list during any non-trivial work. Earlier versions used TodoWrite, which did the same job with a less flexible shape. The behavior is identical: Claude spots tasks as it works, creates them, marks them in-progress when it starts each one, and completes them as it finishes.

It activates on its own when a request needs three or more distinct actions, when you hand Claude a list of items, or when the operation is non-trivial enough to benefit from tracking. For anything bigger than a quick edit, it simply appears.

How do you see the list? The interactive-mode docs mention a Ctrl+T toggle, but in practice it is unreliable across terminals; on macOS that chord is bound to other things in iTerm2 and Apple Terminal. The reliable move is to ask Claude in plain English:

* “Show me all tasks.”

* “What’s left on the task list?”

* “Mark task 3 as done.”

* “Add a task for updating the changelog.”

Claude reads with TaskList, updates with TaskUpdate, creates with TaskCreate. You never learn the tool names. Describing what you want works fine, and the answer comes back as a clean rendered list.

Why does this beat a TODO.md for in-session work? The list survives /compact, where file references get summarized away. Claude updates it as part of doing the work, not as a separate ceremony. And asking for it gives you a current snapshot, not a stale file.

To share one list across sessions, set CLAUDE_CODE_TASK_LIST_ID when launching Claude:

CLAUDE_CODE_TASK_LIST_ID=oauth-migration claude

Same env var, same named list under ~/.claude/tasks/, different session. That is how you carry a task list across /clear boundaries, across worktrees, or across days. Without it, each session gets a fresh list.

The durable mirror: todos.json plus TODO.md

The in-session task list is excellent for “where are we right now.” It is useless for “what should I read Monday morning?” For that you want a durable file in the repo, and the pattern that works keeps two files, one for machines and one for humans.

.claude/todos.json is the machine-readable mirror of the task list. Same shape as the Task tools’ internal state: easy to read programmatically, easy to diff, easy to merge when two collaborators touch it. It lives in .claude/ so it travels with the repo without cluttering the root.

TODO.md at the repo root is the human-readable rendering, generated from the JSON. This is what you actually cat with coffee: pretty markdown, organized by workstream, with each item’s status legible at a glance.

Why two files? Because the audiences want opposite things. A machine wants stable IDs, structured statuses, and dependency relationships. A human wants headers, checkboxes, dates, and permission to ignore everything already done. One file serving both means doing one job badly.

The todos.json shape maps cleanly to the Task tools’ internal state:

{ “version”: 1, “updated_at”: “2026-05-20T14:32:00Z”, “todos”: [ { “id”: “task-001”, “subject”: “Implement JWT refresh endpoint”, “description”: “POST /auth/refresh accepts a refresh token, returns a new access token. 401 on invalid/expired.”, “status”: “in_progress”, “active_form”: “Implementing JWT refresh endpoint”, “workstream”: “auth-migration”, “blocks”: [], “blocked_by”: [], “created_at”: “2026-05-20T10:14:00Z”, “completed_at”: null } ] }

The fields are the same ones the Task tools use internally, plus a workstream tag for grouping and timestamps so you can see what moved when. Every task moves through a small, predictable lifecycle, and that lifecycle is the same whether it lives in the session or in the JSON.

The matching TODO.md rendered from that JSON is plain markdown grouped by status, with an emoji per state and the task ID in parentheses. TODO.md is what you read. todos.json is what tooling reads and writes. Neither does the other’s job.

A skill plus a hook keeps everything in sync

So who keeps these two files in agreement with the live task list? The right answer is a skill that does it on demand, plus a hook that triggers the skill automatically. Build them in that order: skill first, hook second, because the skill is useful on its own before you ever wire the hook.

The skill is a regular project skill at .claude/skills/sync-todos/SKILL.md. Its job, in plain terms: call TaskList for the current in-session state, read .claude/todos.json if it exists, merge the two so in-session tasks win for anything in both and durable-only pending or blocked tasks are preserved as backlog, write the merged result back with a fresh updated_at, then regenerate TODO.md grouped by workstream and status. The skill’s description field is what Claude reads to decide when to invoke it autonomously, so write it with the keywords you would naturally type.

If you do not want to hand-write the YAML, the official skill-creator does it for you. Install it with /plugin install skill-creator@claude-plugins-official, then ask it to build a sync-todos skill that reads the task list with TaskList, mirrors it to .claude/todos.json, and renders TODO.md from the JSON. You get a working skill on disk in about ninety seconds.

To make the sync automatic, wire a hook. TaskCreated and TaskCompleted are real hook events that fire on every task creation and completion, passing the task ID, subject, and description as JSON on stdin. But the simpler and usually better answer is a Stop hook that runs the sync at the end of every turn:

{ “hooks”: { “Stop”: [ { “hooks”: [ { “type”: “prompt”, “prompt”: “Use the sync-todos skill to update .claude/todos.json and TODO.md from the current task list. Do not modify any other files.” } ] } ] } }

After every turn finishes, the hook fires, Claude runs sync-todos, and both files get rewritten. You never remember to sync. The Stop hook is one config block, and it catches everything that changed in the turn, including manual TaskUpdate calls that the lifecycle events would miss. The TaskCreated and TaskCompleted hooks are precise and skip turns that touch no tasks. Pick one. Do not run both, or you sync twice for nothing.

Two production notes. The skill needs Write and Edit permissions for .claude/todos.json and TODO.md; if Write sits in ask, the hook stalls waiting for approval, so pre-approve writes to those two paths. And commit both files to version control, since the whole point of a durable mirror is that the team can read it. Do not commit the in-session task list itself; that is working state, not a project deliverable.

The same skill can carry a second mode that renders TODO-public.md: no task IDs, no internal-only items, plain-language status, for stakeholders who will never read JSON. That gives you three views of one source of truth: todos.json for tooling, TODO.md for the dev team, TODO-public.md for outsiders. One skill, one hook, three views, zero manual sync.

Connecting a spec to autonomous execution

This is where the spec workflow earns its keep. With Auto Mode and /goal, you hand Claude a spec and walk away. Here is the full flow for a real feature.

Interview-to-spec. Run the interview prompt. Land docs/specs/oauth-migration.md on disk. Commit it.

Fresh session. /clear or a new terminal. Reset the context.

Set the goal.

/goal implement the spec in docs/specs/oauth-migration.md until all acceptance criteria hold and all tests pass

/goal sets a completion condition. After every turn, a small fast model checks whether the condition is met. If not, Claude starts another turn instead of returning control. If yes, the goal clears and Claude reports done.

Switch to Auto Mode. Press Shift+Tab until the status bar shows auto, or set defaultMode: "auto" in user settings if you have decided that is your default. Auto Mode approves tool calls within a turn using a classifier; /goal decides when to stop. The combination is the closest thing to autonomous execution without writing your own orchestration.

Walk away. Get coffee. Check on it from your phone via Remote Control. Claude works through the spec, updates the task list as it goes, and stops when the condition holds. A worked run finishes with a clear report: Goal achieved (2h 14m, 47 turns), followed by each acceptance criterion verified and the test count, 132 passing, 0 failing.

You wrote one prompt to start the interview and one to start execution. Everything else happened while you were elsewhere. This is the workflow that project-management overlays were always trying to give you, except every layer is native: the spec is a committed markdown file, the goal is a one-line command, the autonomy is built in, and the progress lives in version control.

Tracking progress across weeks and teammates

For multi-week or shared work, a few habits matter. Commit the spec; it is the source of truth for “what are we building” and the answer to “why does this work the way it does” months later. Commit both .claude/todos.json and TODO.md; with the hook wired, pull requests that touch one will touch both.

If your team already lives on a tracker, Linear, GitHub Issues, or Jira, use the relevant MCP server instead of .claude/todos.json. Same idea, different storage: Claude reads the tracker, proposes work, moves issues to in-progress, and posts results back. Solo projects do fine with the JSON-plus-skill setup; multi-person teams probably want everything centralized in the tracker they already have.

For visibility across parallel work, agent-view (a research preview as of Q1 2026, run with claude agents) shows every Claude Code session on your machine: running, blocked on you, or done. It is the cross-session dashboard for anyone juggling more than one job at a time.

Do this today

Pick a project that has been sitting half-finished and do exactly this:

* Run the interview prompt against the next feature on the list. Use the AskUserQuestion template verbatim. Let Claude interview you for fifteen minutes.

* Read the spec it writes to docs/specs/.md. Edit what is wrong. Commit it.

* Start a fresh session with /clear. Set /goal against the committed spec. Switch to Auto Mode with Shift+Tab. Walk away.

* Before you start the next feature, not after, build the sync-todos skill (use skill-creator if you do not want to hand-write YAML) and wire the Stop hook.

That is the whole workflow. Everything else in this article is what you reach for when this alone is not enough.

The point is to stop bolting things on

The reason your AI coding sessions fall apart at hour three is not a missing plugin. It is a missing mental model. Claude Code already gives you four layers, each doing one job well: plan mode for “what should we do,” a spec for “what did we agree to,” the live task list for “where are we now,” and the todos.json mirror for “where are we across sessions.” Wire them together and you get a written record, a live tracker, and a durable backlog without integrating anything.

Spec-driven development with Claude Code is not ceremony. It is the difference between asking a tool to write code and asking a collaborator to build a feature. The plan goes into version control. The work goes into version control. The progress tracker stays in sync on its own. The autonomy is a flag.

Stop coding blind. Write the spec, commit it, and let a fresh session build against it. The next time you come back to a project after a week away, the answer to “what was I doing?” will be sitting right there in the repo, waiting for you.

This is Part 4 of “Claude Code, Day-to-Day,” a 19-part guide to mastering Claude Code for working engineers.

This is part 4 of a larger series. This is a list of no paywall links for my subscribers.

Claude Code Series Friend Links

* Claude Code OS: https://medium.com/@richardhightower/your-ai-coding-agent-forgets-things-manage-the-context-window-to-10x-results-a7aa214712ea?sk=6ae9cebadb34a65dd59c919f3d94ddc1

* Claude Code First Hour: https://medium.com/@richardhightower/why-your-first-hour-with-claude-code-decides-everything-d2d3054a44ba?sk=24932592424fbdeebed7c8d3dbee3e24

* Claude Code Daily Ritual: https://medium.com/@richardhightower/the-ten-minute-ritual-that-decides-whether-claude-code-actually-helps-you-792fb80eb3e5?sk=bb5eaff3838a062e2d801fff43aaf19d

* Claude Code Task List: https://medium.com/@richardhightower/claude-code-spec-driven-development-why-your-ai-coding-sessions-fall-apart-at-hour-three-e7145128bfc0?sk=9ed945697db3db481146249fba494566

* Claude Code Permissions: https://medium.com/@richardhightower/claude-code-permissions-you-are-keyboard-mashing-y-here-is-how-to-stop-3dab0f743a66?sk=b19fb684d7d799183f3a87ede80a49f6

* Claude Code Subagents: https://medium.com/@richardhightower/claude-code-subagents-the-claude-code-feature-you-skip-every-day-and-why-it-quietly-wrecks-your-6ecde4db6d75?sk=52bd5726ea7463bd729be36775eb3e7c

* Claude Code Memory: https://medium.com/@richardhightower/claude-code-memory-why-you-keep-explaining-the-same-thing-to-claude-and-the-five-layers-that-fix-2bffcf182186?sk=14e4157e635da22c17ebe55a320b0061

* Claude Agent Skills: https://medium.com/@richardhightower/claude-codes-agent-skills-stop-retyping-the-same-prompt-a-practical-guide-to-claude-code-skills-b6081e5afbe9?sk=93f68bfb43c461e4506a81220a081f02

* Claude Hooks: https://medium.com/@richardhightower/claude-codes-hooks-stop-hoping-claude-remembers-your-standards-make-them-enforce-themselves-c8ed2b4bf38f?sk=52d9f444bb8e86b9c443e23292edcb60

* Claude MCP: https://medium.com/@richardhightower/claude-code-mcp-your-ai-says-the-code-works-can-it-prove-it-c46a52e84ea6?sk=7df99797dc56384e0d1f4c0437c2539a

* Claude Code Platforms:

* Claude Verification Loop: https://medium.com/@richardhightower/claude-code-verification-how-to-verify-ai-written-code-so-the-bugs-get-caught-before-they-ship-3036ff01bec0?sk=577f7fc8b02955158a3789a2d5b0557f

* Claude Autonomous work /batch, /loop, /goal: https://medium.com/@richardhightower/claude-code-the-autonomous-commands-that-finish-work-while-you-sleep-goal-loop-batch-etc-7acb82bf46b1?sk=6cfdf3f4acaaf66c68f1b00b6cfe00d8

* Claude Code Plugins: https://medium.com/@richardhightower/claude-code-plugins-your-claude-code-setup-is-trapped-in-one-repo-plugins-set-it-free-6adaa30fba4d?sk=496ceb3bf0f343c413a26031dbee46de

* Claude Worktrees and Agent Teams: https://medium.com/@richardhightower/claude-code-agent-teams-and-worktrees-one-claude-is-not-enough-running-parallel-sessions-without-b5d97ffc0d23?sk=f7f0df8f97be987f609203dcb3a0a17f

* Claude Code Personas: https://medium.com/@richardhightower/claude-code-casual-pro-elite-the-three-working-personas-of-claude-code-mastery-cbd55d6cfbbc?sk=2d541b8d70dc04af8d52a73ef3ddec38

* Claude Code Pitfalls: https://medium.com/@richardhightower/claude-code-pitfalls-claude-code-wont-do-what-you-told-it-a-troubleshooting-catalog-579b93d46115?sk=beca8bc447a99ee35ae8e9c156311bd3

* Claude Code Going Further: https://medium.com/@richardhightower/claude-code-advanced-six-frontiers-of-advanced-claude-code-where-daily-use-stops-being-the-edge-a2b65e5d1f94?sk=60ed209d3532efe3055aebc00233dab3

* Claude Code Quick Reference: https://medium.com/@richardhightower/claude-code-cheat-sheet-every-command-shortcut-and-config-template-that-actually-matters-88fbe108ff1c?sk=e1e91c931e97805d8f51bd5718e6ade8

About the Author — Claude Certified Architect

Rick Hightower is a former Senior Distinguished Engineer at a Fortune 100 company, focusing on delivering ML / AI insights to front-line applications, and a practitioner building multi-agent production systems. Rick is a Claude Certified Architect. Follow him on Medium for more hands-on agent engineering content. You can also book him to speak and train your team: Check out Rick Hightower’s SpeakerHub.

CCA Exam Prep on Agentic Development

* Claude Certified Architect: The Complete Guide to Passing the CCA Foundations Exam

* CCA Exam Prep: Mastering the Code Generation with Claude Code Scenario

* CCA Exam Prep: Mastering the Multi-Agent Research System Scenario

* CCA Exam Prep: Structured Data Extraction

* CCA: Master the Developer Productivity Scenario

* Claude Certified Architect: Master the CI/CD Scenario

* CCA Exam Prep: Mastering the Customer Support Resolution Agent Scenario

* Get the complete reading list for CCA-F exam prep articles from this Claude Certified Architect Exam Prep list.

Harness Engineering Articles

* The $9 Disaster: What Anthropic’s Harness Design Paper Teaches Us About Building Autonomous AI

* Harness Engineering vs Context Engineering: The Model is the CPU, the Harness is the OS

* LangChain Deep Agents: Harness and Context Engineering: Memory, Skills, and Security

* Beyond the AI Coding Hangover: How Harness Engineering Prevents the Next Outage

* LangChain’s Harness Engineering: From Top 30 to Top 5 on Terminal Bench 2.0

* Anthropic’s Harness Engineering: Two Agents, One Feature List, Zero Context Overflow

* OpenAI’s Harness Engineering Experiment: Zero Manually-Written Code

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit rickhigh.substack.com/subscribe