AI Engineering

AI-Engineering: How We Built a 750k-Line Framework Entirely With AI Agents

We did not write a single line of code by hand. 750,000+ lines later, here is the engineering discipline - spec-driven development, the AI harness, fork-free design - that lets three people ship at the pace of twenty.

Piotr Karwatka

June 5, 2026

Software is about to be built completely differently

Clone the Repo

Table of contents

Heading 2

When we started Open Mercato in September, we made one rule that sounded reckless at the time: we would not write a single line of code by hand. Not one. Every line that ships in the framework is generated by an AI coding agent. Eight months and 750,000+ lines of code later, that rule still holds - and the quality has not deteriorated the way most people assume it would.

This is not a story about prompting tricks or a clever wrapper around a chatbot. It is a story about what happens when you treat AI-assisted development as an engineering discipline instead of a party trick. We call the result an AI-Engineering Foundation Framework, and "built with AI, designed for AI" is not a marketing line - it is a description of how the codebase is physically organized.

I have been building enterprise software for over 20 years. What follows is the honest version of what we learned: what works, what we abandoned, and why the most valuable part of our project is the part you cannot see in the product.

The bet we placed

Open Mercato is a framework for building business applications. Our promise is "start with 80% done." Most frameworks - think Django - give you the foundation: modules, routing, design patterns, database migrations, encryption, an admin to manage your data, an API layer. Open Mercato has all of that, and you can absolutely run it empty and build from zero like you would on Django.

But almost nobody does. People choose Open Mercato for the other 80%: a CRM, a product catalog, a WMS (currently in build), resource management, a workflow engine. Real business modules, ready to run.

The bet underneath all of this is about the moment we are living in. For years, teams reached for SaaS products - a CRM like HubSpot, some tool for tickets, another for inventory. Those tools were almost always slightly wrong for the business. They had features you never used but paid for anyway, and they could not bend to the 20% that actually made your operation different. Our wager is that in the age of AI, the calculus flips. You can have all the features a mature SaaS gives you, and because of how Open Mercato is built, AI can bend the last 20% to fit your business instead of forcing your business to fit the software.

That bet only pays off if AI can reliably extend the system without breaking it. Which is the entire reason the engineering discipline exists.

Three products hiding in one repository

Very early on it became obvious that to use AI to the maximum, we were not building one product. We were building three.

The first product is Open Mercato itself - the business application. We started with the CRM. These are the concrete, useful features a company actually runs on.

The second product is the framework. From literally day zero we did not just build a nice CRM - we built it so the CRM reused mechanisms that belong in a framework: an event system, database migrations, design patterns like Command, and a long list of reusable primitives. The discipline here is that no module is allowed to depend on another module directly. One module cannot reach into another and pull things out of it. That constraint is enforced, and it is what makes the whole thing composable later.

The third product is the most interesting, and it is genuinely new: the AI harness. This is a large, deliberately structured set of AI skills, an AGENTS.md graph, and a CLI infrastructure that together tell coding agents how to work inside our codebase. I now spend most of my time on this layer, optimizing it and pushing developer experience as high as it will go. If the framework is the body, the harness is the nervous system that lets an AI operate the body without hurting it.

Most teams using AI to code are building only the first product and wondering why quality erodes over time. The harness is the answer to that erosion.

What a "harness" actually is

AGENTS.md is shorthand. It is not one file - it is a whole structure, a graph of AGENTS.md files, plus the guarantees that coding agents actually follow that graph the way we intend. On top of it sits a CLI layer: you can scaffold a new project, generate a migration, run the standard checks, all through commands the agent is expected to use.

The point of the harness is repeatability. When you issue a command - say, "build me a module for managing service tickets" - we want the same process to run every time, regardless of who is asking or which model is answering. Break that request into its atomic pieces and a disciplined sequence appears:

First you write a spec. We have a skill for that. Then you plan - a skill that splits the spec into tasks, into atomic commits. Then you implement. Each stage is engineered to be repeatable, and each stage carries our standards inside it. Spec planning includes risk analysis and backward-compatibility analysis as first-class steps, not afterthoughts. Implementation always runs unit tests, always writes unit tests, and always runs integration tests that click through the app automatically. Something that touches security gets more security tests. The requirements differ by feature, but the shape of the process does not.

That is what we mean by a harness: a guarantee that the entire software-delivery pipeline is the same - or as close to the same as possible - every single time. The payoff is twofold. First, software quality: we have been building this for a while now, it is a serious amount of code, and it does not noticeably rot. That is largely thanks to the harness. Second, a developer who comes to Open Mercato and says "build me a service-ticket module" should get the same result in every new version of Open Mercato. We treat that like benchmarking, and we evaluate it - we have a tool, built by core team member Patryk Lewczuk, that checks exactly this kind of reproducibility. It is not in the public repo yet, but it runs.

The developer flow, end to end

Here is how the work actually happens, and how it evolved. None of this was obvious on day one. It changed as our own understanding grew and as the coding agents themselves got dramatically better.

For a long time I did not use skills at all. I used a very simple AGENTS.md that described the architecture - naming conventions, design patterns, "when you build an API, always remember authorization" - plain, simple rules. And I worked interactively. I would prompt, the agent would produce something, I would test it, hand back what was wrong, it would fix it, I would ask for unit tests. Ping-pong. Interactive ping-pong.

I will be honest about why this is seductive: it feels like playing a game. There is a little hit of dopamine in every round, and it pulls you in completely. But it is not actually an efficient way to work. It is worse than micromanagement, and you end up wasting more time on corrections than you spent on the initial development. The fixing costs more than the building.

Spec-driven development

Spec-driven development turned out to be, at least right now, the best method we have found. There is a line I keep repeating to clients in presentations: you still cannot delegate thinking. That remains true. Spec-driven development pulls the thinking forward, to step zero, where it belongs. You have to write the spec, and writing the spec is where the thinking happens.

There is a quiet advantage to doing this inside our setup. We run a monorepo, so every package, every NPM module, lives in one place. When a coding agent cannot find something in the harness or the docs, it searches the code - and because it has all of the code, it really can find anything. The specs live in the repo too. So when the agent writes a spec inside the repo, it writes it extremely precisely: it uses the real conventions, it links to actual examples, it grounds itself in code that exists.

Refinement: ask the agent what is wrong

The second move that pays off enormously is running a refinement pass on the spec. The agent produces a spec, and then you ask it a deliberately philosophical question: What is missing here? What could go wrong? Self-critique.

Because these agents try hard to follow instructions, when you tell one that something is probably wrong, it will go and find something that is wrong. And the spec improves. Two or three passes of this and the spec is genuinely better than anything I would have written alone.

Planning: split into atomic, verifiable tasks

This is the stage people skip most often, and skipping it is a real mistake, because it produces the best results.

You have a spec. Until recently I would just say "implement this spec," and sometimes, in a second run with a fresh context, I would ask "is this fully implemented?" Fresh context matters - you do not want the agent holding its old context, because then it is far too forgiving of its own work. New context, or even a different model entirely (Claude versus GPT), to verify what is missing. That works well.

But long specs break this. Three thousand lines of spec text is a long spec, and if it includes ASCII art - mockups, diagrams - it gets longer and starts to fall apart. So the move that works best is to tell the agent: split this spec into phases, each phase in its own MD file, so you are not dragging the entire spec into context every time. And split it into atomic commits - tasks - and make each one verifiable. Ideally each task ends with a unit test, and if it touches the UI you test it with Playwright or a browser agent. (We want to move to a browser agent generally - it is far more efficient because it works off the accessibility tree instead of taking screenshots.) The principle is simple: some verification for every task.

If you spend the time to plan at that granular level, you can then run the whole thing in a loop.

The loop, and the discipline of not looking

You can build the loop in different ways. I generally run it inside the agent itself: in Claude, for instance, you can say "you are a task coordinator, spawn subagents for tasks that do not collide, or just run them in order until you are done." My longest single loop ran for four days. That was a large framework component getting implemented end to end.

Here is the hardest part, and it is a discipline more than a technique. When a coding loop is running, the worst thing you can do is look. In the middle of the run the agent usually does something stupid. You watch it and think, why are you doing that? And the instinct is to jump in: "hey, do not do that, there is no bug there, do not look over there." But the moment you interrupt, you break its train of thought and it gets lost. The strange truth is that the agent often does foolish-looking things along the way and still arrives at a good result. So you have to let it cook.

I will not pretend this is easy for everyone. My own natural style was always iterative - build an MVP, fix it, build the next version - and the urge to peek mid-run is strong. But in a loop, peeking is exactly what poisons the outcome.

Why we deliberately test on "dumber" models

Most people run their newest, most expensive model and call it a day. On a top-tier model, almost everything works - the model is so good that when something is off, it quietly fixes it and moves on. That is convenient, and it hides problems.

The real craft is optimizing the harness for cheaper, smaller, "dumber" (in quotes - they are not actually dumb) models. If it works on those, it works on the expensive ones for certain, and it is far more efficient. I have been testing Open Mercato on genuinely small models - GPT-OSS-20B, for example, which runs on a modest laptop with 32GB of RAM. A model like that does not reason broadly. It reads the harness very literally, exactly as written.

That literalness is gold for testing. A small model exposed a pile of bugs for us. The CLI syntax for generating a database migration had changed, and a strong model never flinched - it would run the wrong command, notice, search, and self-correct so fast that nobody saw the gap. The small model just hung on it. The expensive model's competence was masking a real defect in our harness. The cheap model could not paper over it, so it surfaced the truth. We are also starting - early days - to optimize the harness for open-source coding models like Qwen and Google's Gemma, which are getting remarkably good.

The lesson generalizes: if your AI workflow only works on frontier models, you do not have a robust workflow. You have a model compensating for a workflow that is not actually there.

A tech stack chosen for what LLMs know

We picked our stack for one reason above all others: so agents would be good at it. The key criterion was popularity. The more mainstream the technology, the more training data behind it, the more reliably an agent can write it.

So the frontend is React, with Next.js as the framework. I know there are "better," more enterprise-grade frameworks out there, but Next.js is the absolute standard, so Next.js it is. Postgres for the database. Redis. MeiliSearch for search. Docker for containers. There are two deliberate exceptions to the "most default option" rule: MikroORM instead of Prisma, and Awilix for dependency injection. They are not the trendiest defaults, but they are well established - Medusa.js uses them, and since we are investors in Medusa.js and I am close to that codebase and admire how it is built, some of that influence shows up here.

Two other things mattered as much as popularity. Coherence: we did not want a Python backend and a separate Next.js frontend server and a Go service somewhere else. The moment you split languages, standing the project up gets harder and you immediately hit the skills question - "I know React but I have never touched Python." (For the record, that is no longer much of a barrier with AI, but the simplicity still matters.) One command to up-and-running: I care enormously that a single command gives you a working project. Everything in our stack is chosen to keep that true.

People always ask about scaling, especially the bigger clients, and Next.js raises the question. There are several answers. Because modules cannot depend on each other directly, you can split Open Mercato into separate apps and scale them independently - run the products module on one server, another module elsewhere. Postgres scales the usual ways (sharding, master-slave), Redis gives us queues, and none of this is exotic. What people see in the repo is the admin panel, but you are not stuck with it: you can build a front-end app with Open Mercato as the admin, or go fully headless and let a completely separate frontend talk to the backend over the API - every single data method is exposed there. We even have a recent YouTube tutorial showing a landing page where leads drop straight into the Open Mercato CRM.

Fork-free by design

If you have ever worked with Magento, Django, or Sylius, you know the oldest trap in open source: someone forks the core, makes their changes on top, and loses upgradeability forever. I spent a large part of my career in Magento, and the rule was always the same - do not modify the core. Break that rule and updates become a nightmare.

So we put heavy emphasis on extensibility, with a lot of mechanisms to avoid ever touching the core. You can override routes - entire URLs - from your own modules, though that is the last resort because you do start to lose some upgradeability there. Before you get to that, there is an injection system: you can inject your own React components in place of others. Your CRM customer card does not look the way you want? Drop a component from your own module into that slot. You can extend the database with custom fields - not a code change at all, more like a dictionary, with real support for different data types. There are events. There are many such mechanisms.

And the harness is built to respect all of this. When you scaffold a standalone app - create-mercato-app <name> gives you an almost empty repo with one TSX file and a package.json that pulls all of Open Mercato in as an NPM package - the AGENTS.md harness is deliberately different from the monorepo version. In the monorepo, the harness is about where everything lives in the core. In standalone mode, it is about the extension mechanisms: here is the event system, here is how you override things. So when you ask the standalone agent to build something, it tries hard not to modify the core and instead reaches for those mechanisms to get the result. One more principle sits underneath: you can use our features, but you never have to. If our CRM genuinely does not fit - which would be unusual, it is fairly general - you can ignore it and build your own, while still keeping everything else.

I tested this myself, deliberately trying to use Open Mercato in my own company without adopting our CRM, just pulling a few pieces of our business logic in. It worked - and it also showed me exactly where the standalone harness still needs work, specifically when the agent does not have access to the full monorepo to fall back on. That is the part I am still actively improving.

The economics: what this actually replaces

The obvious question is how far three people can get this way. We are three full-time - me, Tomek, Patryk Lewczuk, and Maciek Greń on the build - plus a core contributor team of another five or six of the most engaged open-source contributors, then a wider circle. We have 25 partners whose employees contribute, and on GitHub I have counted over 100 contributors with at least one contribution each. A typical release sees around 17 contributors, which is not nothing, and coordinating all of that is its own challenge - but we have it reasonably well in hand.

I ran the reproduction-cost exercise out of curiosity (using ChatGPT, which of course tells me the truth and then compliments me at the end). At our current scale it came out to millions of zloty in equivalent hourly billing. Translate that into people and it is something like 20 developers working for a year - call it two to four teams for a year depending on seniority. And that estimate undercounts the overhead: at that headcount you pay an enormous tax on communication, coordination, and simply bootstrapping the thing. Even to do it the traditional way you would still need a core team for the modules, the framework, all of it.

I do not actually know whether those are good numbers or not. But they are real, and they were produced by three full-time people and an open-source community, not by 40 hires.

What I would tell you to take away

If there is a single idea worth carrying out of all this, it is that AI-assisted development becomes serious the moment you stop treating the model as the product and start treating the process as the product. The model will keep getting better on its own. Your leverage is in the harness around it: the repeatable spec-plan-implement pipeline, the verification baked into every task, the discipline to plan granularly and then not interrupt the loop, the willingness to test on a weak model precisely because it refuses to hide your mistakes.

You still cannot delegate thinking. But you can build a system that makes sure the thinking you did at step zero is faithfully carried all the way to production - by an agent, every time, without the quality quietly rotting underneath you. That system is the real product. The CRM is just the proof that it works.

Open Mercato is open source. The harness, the skills, and the CLI are all in the repo - go and play with them.