Harness Engineering: The Missing Layer Between AI Coding and Safe Automation for Financial Institutions
Harness Engineering isn’t about letting autonomous agents run the pipeline today. It’s about building the structure now that makes safe automation possible later.*

Where Most Enterprise Engineering Teams Actually Are
Let’s start with reality.
At large financial institutions, most developers are using GitHub Copilot, Cursor, or Claude to write code faster. Some teams are using those tools locally to scaffold services, generate boilerplate, and draft unit tests. A smaller group is experimenting with agentic tools that can plan and execute multi-step coding tasks inside the IDE.
That’s where most enterprise teams are in March 2026.
Autonomous AI agents committing to repositories, running pipelines, or making deployment decisions on their own are not operating at scale. Regulation, change management, and plain institutional risk tolerance make that a 2027 or 2028 conversation at the earliest.
But a real problem is already here.
Developers are producing much more code than before. That code is moving through the same testing gates, the same security scans, and the same approval workflows that were designed for a slower pace. Test suites built for teams opening 10 PRs a week are now supporting teams opening 40. Security findings multiply. Review queues grow. Release managers, AppSec engineers, testers, and compliance teams absorb the extra load.
That is the gap Harness Engineering is meant to close.
What a Harness Actually Is
The term comes from OpenAI’s February 2026 write-up, Harness engineering: leveraging Codex in an agent-first world, about building and shipping an internal beta product with “0 lines of manually-written code.” Their setup was extreme: a small team running highly automated agent workflows at scale. Most corporate engineering organizations are nowhere near that.
But the core idea translates.
A harness is the set of constraints, context, and feedback loops built around AI-assisted software delivery so that the people downstream, testers, security engineers, release managers, and compliance teams, are not paying for every AI mistake.
The long-term advantage will not go to the organization with the biggest model. It will go to the organization with the best operating environment around that model. Less depends on raw model intelligence than on the quality of the system wrapped around it.
That system does not appear all at once. It develops in three stages: Foundation, Augmentation, and Automation.
Why this matters in a regulated environment
For U.S. financial institutions, AI adoption is not happening in a regulatory vacuum. Banks are largely expected to apply existing controls to new AI use cases: model risk management, third-party risk management, and explainability where decisions affect consumers. Harness Engineering matters because it turns AI-assisted work into something easier to govern: documented inputs, bounded workflows, traceable decisions, and clear review points.
The Three Stages of Harness Maturity
Most teams will move through the same progression.
Stage 1: Foundation
Foundation is where teams create the structure AI needs in order to be useful without creating chaos. Repository context starts to take shape. Core standards are written down. Specs begin to exist before implementation. Prompt patterns, pre-commit checks, and workflow discipline become shared practice instead of individual habit. At this stage, the goal is not to automate anything. It is to make AI-assisted development legible, repeatable, and easier to govern.
In banking, that foundation also supports model-risk discipline. Teams need to define intended use, constrain outputs, validate high-impact behavior, and keep humans accountable for decisions. A good harness makes those controls part of the engineering workflow instead of a separate after-the-fact exercise.
Stage 2: Augmentation
Once that structure exists, teams start using AI to strengthen the delivery system itself. AI no longer just helps developers write code faster. It helps generate test fixes, draft deployment documentation, explain security findings, summarize risk, and connect code changes back to standards and specs. Humans are still clearly in charge, but their workflows are now being amplified by AI at the points where delivery friction is highest.
Stage 3: Automation
Only after foundation and augmentation are in place does automation become safe. Agents begin acting inside the delivery process within defined authority boundaries. Some fixes can be made automatically. Some tests can update themselves. Some deployments can proceed without human intervention. Some rollback decisions can be triggered by policy and telemetry. Humans still govern the system, but they no longer have to manually execute every step inside it.
This sequence matters.
Automation only works when augmentation has already made workflows structured and machine-readable. Augmentation only works when foundation has already created reliable context and standards. Teams that try to skip ahead usually discover they are missing the very inputs automation depends on.
The Knowledge Base: The Layer Under Everything Else
Before testing, deployment, security, or compliance, there is a foundational layer underneath all of them: a continuously maintained knowledge base inside the repository.
OpenAI’s team identified this as the first and most important part of their harness in the same write-up above. For enterprise teams, it is probably the highest-leverage place to start.
This is also where Spec-Driven Development becomes practical rather than theoretical. For a useful overview of SDD, see Thoughtworks’ December 2025 article, Spec-driven development: Unpacking one of 2025’s key new AI-assisted engineering practices.
What the repository needs to know
When an AI coding tool has no meaningful context, it produces code that looks plausible but does not fit the system around it. It breaks local patterns, ignores service boundaries, and misses domain-specific constraints that experienced engineers carry in their heads.
That knowledge has to live in the repository itself: architecture decisions, service boundaries, data contracts, performance constraints, compliance requirements, and approved patterns. Not in Confluence. Not in Slack. Not in a senior engineer’s memory. In version-controlled files that live next to the code they describe.
OpenAI’s team tried the obvious shortcut first: one large instruction file containing every rule. It turned into a graveyard of stale rules. When everything is marked important, nothing is easy to use.
The better approach is modular, navigable documentation. Separate files for architecture, service boundaries, data contracts, deployment constraints, and operating standards. An agent working on a specific task should be able to find the right context without ingesting the whole repository.
That is the practical value of Spec-Driven Development. Specs are not just prompts for generating code. They are living artifacts that capture intent, constraints, and design choices for both humans and AI.
Two kinds of repository knowledge
A useful distinction comes from Birgitta Böckeler’s Martin Fowler article, Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl: memory bank versus specs.
The memory bank is the always-on layer. It includes architecture, module boundaries, data classification rules, approved patterns, anti-patterns, performance expectations, and regulatory constraints. It tells the AI what kind of system it is working in.
Specs are task-specific. They describe what a feature should do before code is generated: inputs, outputs, edge cases, constraints, and acceptance criteria. The AI uses the spec to generate the code, and the spec stays in the repository as the source of truth for future changes.
Together, those two layers answer the two questions every AI system is implicitly asking:
What kind of world am I operating in?
What exactly am I supposed to build?
What this looks like in a bank
Large financial institutions already have most of this knowledge. It just is not written down in a form AI can use.
Approved cipher suites, PII handling rules, data residency requirements, service ownership, communication patterns, and change classifications already exist somewhere in policy documents, architecture boards, or the heads of senior engineers. The work is translating that into short, precise documents that live in the repo.
A simple starting structure might look like this:

These documents do not need to be long. A 300-word security standards file with two concrete examples is more useful than a 40-page policy PDF. The goal is not completeness. The goal is enough clarity that AI-generated code looks like it belongs in this system rather than in a generic example application.
The three stages applied to the knowledge base
Stage 1: Foundation
Start with the memory bank documents that would help in almost every AI coding session: architecture overview, security standards, and performance profile. Keep them short and commit them to the repo. Teams begin writing lightweight specs for selected features so intent exists before implementation. The goal is simple: create a dependable base of repository context that reduces avoidable AI mistakes.
Stage 2: Augmentation
Once those documents exist, they start shaping the workflow. Specs become expected before code generation. The memory bank has clear owners and update expectations. Specs are reviewed alongside code, so teams ask not only whether the code works, but whether it matches the spec. When code changes affect documented behavior, AI can draft documentation updates for human approval. At this stage, the knowledge base becomes load-bearing. The pipeline references it. SAST rules draw from it. Compliance evidence traces back to it.
Stage 3: Automation
Background agents scan for drift between documentation and implementation: data models that no longer match contracts, services that have grown beyond their documented boundaries, specs that no longer match behavior. They open PRs to fix inconsistencies and keep the knowledge base honest. At this stage, the spec is not just documentation. It becomes an operating input for automation.
The knowledge base makes every other domain better. Better context improves testing. Better standards improve security. Better mappings improve compliance. Better performance profiles reduce inefficient AI-generated code.
Build this layer first.
Testing
The problem now
A developer uses Cursor or Copilot to build a feature in two hours instead of eight. They open a PR. The existing test suite runs. Some tests fail, either because the AI introduced a bug or because it changed an interface the tests did not expect.
Now the developer has to debug failures in tests they did not write for code they did not fully write from scratch. That is slow, frustrating, and increasingly common.
There is another pattern too: AI-generated code usually handles the happy path well and misses edge cases. Coverage can look fine even when production risk goes up.
Stage 1: Foundation
Define shared prompt templates for test-aware coding. For example: implement a feature based on the spec at /docs/specs/[feature].md and generate unit tests that cover the happy path, null or empty inputs, and the two most likely error conditions. Referencing the spec keeps the prompt short and makes the spec the source of truth.
Add a pre-commit hook that checks for test files when new logic appears. If a developer adds a file with meaningful business logic and no related test file, catch it before CI.
Improve test failure output so AI can help fix it. Include which business rule the test validates, which spec it maps to, and what changed since it last passed. That gives developers structured input they can feed back into their coding tools.
Stage 2: Augmentation
Testing becomes one of the clearest examples of AI strengthening the delivery process. CI surfaces test failures alongside AI-generated remediation suggestions tied back to the spec. Coverage gaps in AI-generated code trigger a second pass to generate missing edge-case tests before merge. Flaky tests are classified automatically so infrastructure noise does not drown out actual regressions.
Stage 3: Automation
Test suites become self-healing within defined limits. If AI-generated code changes an interface and the spec confirms the behavior change was intentional, assertions can be updated automatically. Agents generate baseline coverage for every new service. Humans shift from writing every test by hand to reviewing testing strategy and approving coverage standards.
Deployment
The problem now
Release and change management processes have not changed, but the volume of code moving through them has. Release managers review more changes. Approval queues grow. And the documentation tied to those changes often gets worse, because developers who used AI to generate code cannot always explain every implementation detail clearly after the fact.
Stage 1: Foundation
Generate deployment documentation from the spec. The spec already explains what changed, why it changed, what contracts were affected, and how success should be measured. Before opening a PR, developers use that material to produce a plain-language change summary, affected services, rollback plan, and test coverage note.
Update PR templates to capture AI usage and spec references. Two questions go a long way: Was AI used in this change? Which spec does this implement? That gives reviewers context and creates a traceable link between intent, implementation, and approval.
Write deployment runbooks in structured form. Numbered steps, rollback triggers, and explicit commands are more useful now and machine-readable later.
Stage 2: Augmentation
This is where AI starts improving the release process instead of just feeding more work into it. Deployment documentation is generated automatically from PR metadata and linked specs, then reviewed by humans instead of written from scratch. Pipeline gates validate against policy-as-code rules mapped to the standards in the memory bank. AI-generated risk ratings help reviewers focus on blast radius and service impact rather than reconstructing change intent manually.
Stage 3: Automation
Low-risk services deploy automatically within approved policy bounds. Medium-risk services use automated canary validation and roll back on SLO breaches. High-risk changes still require human sign-off, but that sign-off focuses on judgment, not mechanical process. Humans own policy. Agents handle volume.
Security
The problem now
One AppSec engineer supporting 50 developers was already a bottleneck. Give those developers AI tools that let them produce more code faster, and the finding volume rises with it. Worse, models repeat common patterns from training data, including insecure ones. AppSec teams end up triaging the same classes of issues again and again.
Stage 1: Foundation
Deploy IDE-integrated SAST tied to internal standards. Tools such as Semgrep, Snyk, or SonarLint are most useful when configured against rules derived from the repository’s own security documents: approved cipher suites, parameterized query requirements, PII handling expectations, and forbidden patterns. If AI generates something unsafe, the developer should see it in the editor immediately, with a link back to the relevant standard.
Add a secrets-scanning pre-commit hook. This is fast to implement and removes an entire category of preventable incidents.
Keep security standards short and linkable. The same document that gives AI context during code generation should also help developers understand and fix issues when a scanner fires.
Stage 2: Augmentation
Security is another place where augmentation should be obvious. Findings appear at the PR level with AI-generated fix suggestions attached, tied both to the vulnerability type and the internal rule it violates. Findings from IDE tools, pre-commit hooks, and CI scans are correlated into one view instead of creating duplicate triage work. AppSec teams spend less time reviewing repeated mistakes and more time improving the rules that prevent them.
This also matters for third-party risk. Many banks adopt AI through vendors, whether that is a coding assistant, model provider, or cloud platform. The harness gives the institution clearer boundaries around access, review, logging, and evidence retention, which makes vendor risk easier to manage in practice.
Stage 3: Automation
Well-understood vulnerability classes below a defined severity threshold can move into automated remediation loops. The security standards document becomes a live policy layer that agents check before generating code, not just after a scanner catches an issue. Runtime signals from DAST and production environments feed back into the knowledge base so the system gets smarter over time.
Compliance
The problem now
Developers are making real technical decisions inside AI chat sessions: architectural tradeoffs, security choices, implementation approaches. Almost none of that reasoning is captured. So when an auditor asks why something was implemented a certain way, the answer may only exist in a tool window that was closed weeks ago.
Meanwhile, compliance evidence is still assembled after the fact. That gap between what was built and what the documentation says was built is a recurring source of audit pain.
Stage 1: Foundation
Treat the spec as the decision log. A spec written before implementation already records intent, constraints, and chosen approach. Commit it alongside the code, and you no longer need to reconstruct reasoning later. For broader architectural decisions, use a short ADR in /docs/architecture/.
That traceability matters even more when AI affects consumer-facing decisions. CFPB guidance makes clear that lenders using AI or other complex models still need to provide specific and accurate reasons for adverse actions. If a team cannot reconstruct why a system behaved the way it did, it is not ready for regulated decisioning.
Use AI to draft compliance artifacts from the spec and change metadata. Change tickets, data-flow descriptions, impact assessments, and risk summaries can all start from information that already exists. AI drafts them. Humans review and approve them.
Enforce structured metadata in PR titles and descriptions. A title that maps a change to a control identifier creates a traceable link between implementation and compliance requirements. When those control mappings live in the memory bank, developers are more likely to keep them in view throughout delivery.
Stage 2: Augmentation
Compliance work becomes less retrospective and less manual. Artifacts are generated automatically from the linked spec and PR metadata, then reviewed rather than authored. Control identifiers map directly into a live evidence matrix. Drift between documented architecture and deployed reality becomes visible through pipeline data rather than during audit preparation.
Stage 3: Automation
Every pipeline action emits a structured compliance event when it happens. Reporting is generated from that event stream. Pre-audit evidence packages are assembled automatically and reviewed by humans before submission. The question “Why was this implemented this way?” now has a default answer, traceable from event to PR to spec.
Optimization
The problem now
Optimization is still mostly reactive. Something breaks or slows down, someone investigates, a fix gets made, and the learning often disappears into Slack threads or individual memory.
AI-generated code adds a specific risk here: it is often correct enough to pass but inefficient enough to hurt later. A model may generate a data-access pattern that works functionally but causes N+1 queries or unnecessary compute because it was never given the performance constraints.
Stage 1: Foundation
Put performance profiles in the memory bank. For each service, capture expected data volumes, latency SLOs, known bottlenecks, and patterns that have caused incidents before. When a developer prompts an AI tool to build in that service, that context should come with it.
Use retrospectives to improve the harness, not just document lessons learned. After an incident, ask what rule, test, checklist item, or prompt template would have caught the issue earlier. Then add that back into the system. Over time, the knowledge base starts reflecting organizational failure patterns instead of generic best practice.
Add AI-specific checks to the definition of done. Was the change generated with the right performance context? Does the PR link to a spec? Was secrets scanning run locally? Is there a clear decision record for architectural choices? This is how the knowledge base becomes a working habit rather than shelfware.
Stage 2: Augmentation
Performance work becomes more proactive. Profiles are automatically included in AI context for the relevant service. Retrospectives generate candidate updates to standards, templates, and linter rules for human approval. Cost anomalies trigger structured investigations, with AI summarizing likely contributing changes from recent deployment history and connecting them back to the performance profile.
Stage 3: Automation
Observability data, latency, error rates, and infrastructure cost become part of agent context alongside the repository knowledge base. Scheduled cleanup agents identify stale specs, deprecated feature flags, orphaned infrastructure, and outdated documentation, then either remediate directly or route items for review. The system stays current because drift is actively managed.
The Payoff
The point here is not to rush toward autonomous agents. It is to build the progression that makes them safe and useful when the time comes.
Foundation creates structure.
Augmentation reduces human bottlenecks.
Automation becomes possible because the first two stages made the system legible.
That is why the knowledge base matters so much. Every domain improves when AI has better context.
Better context produces better tests.
Better standards produce better security scanning.
Better mappings produce better audit trails.
Better performance profiles produce less inefficient code.
Everything else is downstream of that.
Teams that wait for tooling to mature before adding structure will have a harder transition later. Retrofitting context into a codebase is always more painful than building it in early. Organizations will struggle with automation not because they lack access to advanced models, but because they have nothing reliable for those models to reason against.
The good news is that this does not start with a major platform purchase or a specialist AI team.
Foundation is within reach of almost any DevOps organization. Start with the knowledge base. Three documents are enough: an architecture overview, a security standard, and a performance profile. Keep them short. Commit them. Then pick the domain where pain is most visible and start augmenting the workflow around it.
The harness compounds.
Build it before you need it.