Nextside Blog

Pricing a client's tool: margin isn't the cheap dev

Pablo Winter — Mon, 22 Jun 2026 09:00:00 -0300

TL;DR

Every week I see the same scene. The agency closed the strategy, the client trusted them, and then comes the ask: a real tool on the site. A pricing calculator, a configurator, an assistant that answers on its own. The agency says yes, sends a price, and six weeks later is paying out of its own pocket to finish. The mistake wasn’t charging too much. It was charging in the dark.

Let me be blunt: the margin on a tool you resell doesn’t come from finding the cheapest dev. It comes from reselling a predictable delivery that doesn’t come back. The cheap one charges the difference later, in rework and in your client’s trust.

The numbers back it up. 78% of agencies rarely or never charge when scope grows mid-project (Ignition, 2025). The money leaks out before the supplier even vanishes. And when he does vanish, you’re the one left explaining it to the client.

The discount you squeeze today comes back as rework tomorrow

The reflex is always the same: to protect margin, the agency hunts for the cheapest dev. It makes sense on the spreadsheet and breaks on delivery. Because the dev’s price is only one part of the bill. The rest, the part that isn’t in the proposal, is what decides whether any margin survived.

(If your team froze before you even got to the price, I wrote about that exact moment here.)

Think about what cheap doesn’t include. The rework when the first version comes back crooked. The extra three weeks that push your delivery and make the client call asking. The tool that goes down on a big-campaign Monday, and the brand taking the heat is YOURS, not the freelancer’s. The freelancer charged half. You paid double, just in installments of pain.

Cheap isn’t a price. It’s an expiration date.

“But if I pay more for the dev, my margin disappears.”

No. Your margin disappears when the work comes back. If you pay 30% more for someone who ships on time, with fixed scope, and who doesn’t leave you stranded in front of the client, you didn’t spend more. You bought predictability. And predictability is the only thing you can resell with peace of mind.

The real expense is never the dev’s hourly rate. It’s the opportunity cost of the delayed project, the cost of redoing what came out wrong, and the cost of a client who stopped trusting you. None of those three show up on any quote. But they all come out of your pocket.

The bill nobody opens: what makes a tool’s price

Here’s the structural mistake I see all the time: the agency prices a tool with a website ruler. Per page, per design hour, per screen. But a tool isn’t a website with a button. What costs money in a tool is what happens after the user clicks.

A website shows. A tool does. And “doing” is where the cost lives:

The business logic. The rule that calculates, decides, validates. It’s invisible to the client and it’s the bulk of the work.
The integration. A tool that talks to the client’s inventory, CRM or payments costs far more than one that lives alone.
What holds it up afterward. Login, user data, security. The day the tool stores people’s information, the game moves up a tier.
The scale. Handling ten people is easy. Keeping the tool up when the campaign goes viral is another story.

To give you orders of magnitude, without faking a precision that doesn’t exist. These are market ranges and they vary a lot by scope:

Type of tool	What weighs most on price	Order of magnitude (market)
Lead calculator or quiz	simple logic, almost no integration	the lowest: low thousands
Product configurator	business rules + catalog + visual	mid
On-site assistant or automation	integration with the client’s systems	mid-high
Tool with login and user data	account, security, scale	the highest

The point isn’t to memorize the table. It’s to understand that two things both called “a tool” can differ tenfold in cost. Whoever prices them the same loses money on one and loses the client on the other.

How do you price a tool without knowing what the dev will cost?

You don’t guess. You ask for a fixed range before you close with the client. A good technical partner hands you the scope and a fixed price before starting, not an estimate that doubles halfway through. With the range in hand, you mark it up with your margin and close knowing exactly what’s left.

It’s the difference between an open budget and fixed scope. An open budget is an invitation to surprise: it starts at X, becomes 2X, and you’re the one who absorbs it. Fixed scope is the opposite. You know the bill before you sign, your client knows what’s in and what’s out before paying. Both sides sleep well.

The real margin: you resell predictability, not dev hours

Now the turn. You’re not reselling development. You’re reselling the outcome, with your brand up front, without your client ever knowing who wrote the code. This is the white-label model, where the supplier delivers under your brand and stays invisible. And their margin isn’t born from the discount. It’s born from predictability.

The resale markup in white-label usually lands between 40% and 70% over the cost of delivery (CloudCampaign). Sounds like a lot until you notice the condition almost nobody says out loud: that markup only survives if the delivery doesn’t come back. Every project that returns with a bug, every blown deadline, every supplier who vanishes eats that whole margin. The markup isn’t guaranteed profit. It’s profit conditioned on a delivery that works.

Run the numbers on a case I see repeat. The agency closes a configurator for the client at $20,000.

Path A. Finds a freelancer for $5,000 and celebrates the fat margin. The freelancer slips, the first version doesn’t match the catalog, rework kicks in, and the launch that was meant for April moves to June. The $15,000 margin turned into a reputation loss.
Path B. Closes with a predictable partner for $10,000, locked scope, on-time delivery. The margin on paper is smaller, $10,000, but it’s the margin that actually LANDS. And the client, happy, comes back with the next project.

A happy agency brings N projects. A burned client brings none.

A discount is margin you’ll pay back later. Predictability is margin that stays.

What margin can an agency charge reselling development?

The range practiced in white-label runs from 40% to 70% markup over the cost of delivery. But the number that matters isn’t the ceiling, it’s what’s left after rework. High markup on a delivery that comes back is zero. Honest markup on a predictable delivery is what builds the agency over time.

Predictability is the product. The rest is luck.

In the end, your client isn’t buying a tool from you. They’re buying the peace of mind that it’ll work when they need it most. That’s what you resell. The tool is the object. The predictability is the product.

When you hunt for the cheapest dev, you’re not saving. You’re betting that this time it’ll work out. Sometimes it does. But you didn’t build an agency to live on a bet.

Cheap always charges the difference. The only question is when, and in front of whom.

I inherited a vibe-coded MVP to scale: the honest diagnosis

Pablo Winter — Fri, 19 Jun 2026 10:00:00 -0300

TL;DR

Over the past few months the profile of who reaches out to me has changed. It used to be “I have an idea, build it for me.” Now it’s “I already have the app, it sort of works, but it stalls the moment it has to grow, and I don’t know whether to fix it or throw it out.” The product was vibe coding: generating software by asking an AI and accepting the result without reading or understanding the code that came out. It works enough to have paying customers. And enough to be scary.

The first decision in front of one of these isn’t technical. It’s triage. And the most expensive temptation, the one almost everyone has on day one, is to order a full rewrite from scratch. It’s almost always a mistake. In most cases it’s not a transplant: it’s a test suite first, then a scalpel on what actually rotted. You save far more than the panic suggests.

The numbers explain the scare and why it misleads. 45% of AI-generated code carries an OWASP Top 10 flaw (Veracode), and only 10.5% of vibe-coded code passes a decent security review against 61% that simply “works” (Carnegie Mellon). It looks like a demolition order. It isn’t. It’s the measure of the distance between working and holding up, and distance you measure BEFORE you demolish, not after.

Working isn’t the same as done. But it isn’t the same as junk either.

Rewriting from scratch is the most expensive mistake there is

Everyone who calls me with a stuck app shows up with the same line on the tip of their tongue:

“This code is a mess. It’s faster to redo it from scratch than to understand this dumpster fire.”

Hold on. That line is the most expensive siren song in software engineering, and I’m not the one who discovered it. Joel Spolsky called rewriting from scratch “the single worst strategic mistake that any software company can make”, and that was in 2000, long before AI made the rewrite even more tempting and even more expensive.

Why a mistake? Because you’re about to throw out the one thing this MVP actually proved: that people want it. The code is ugly, but it carries months of learning baked in. Every weird hack is, half the time, a real edge case some customer hit, one that the “clean” rewrite will rediscover the hard way, in production, all over again.

Think about Twitter. It was born as a Rails monolith because that’s what two guys could move fast enough to find product-market fit. The scaling problems came later. Because it worked. If they had started “the right way,” beefy and distributed, there probably wouldn’t be a Twitter to have scaling problems. The speed of vibe coding is real and it’s valuable for validating. The mistake isn’t having validated that way. It’s not shifting gears once the validation is over.

A rewrite from scratch is the vanity of whoever just showed up on the project.

Before you touch a single line, the test suite

Here’s the step almost everyone skips, and it’s the one that separates a rescue from a second disaster: you don’t fix what you can’t test, nor safely refactor what you didn’t cover first. Tests aren’t the last step of the rescue. They’re the first.

And it’s exactly what vibe-coded code doesn’t have. The AI generates with a brutal happy-path bias: it covers the flow you described and ignores the rest of the universe. Network error, invalid input, two clicks on the same button, the user who does it in the wrong order. None of that exists in the code, so none of it breaks visibly. Until it breaks in front of the customer.

“Write tests before shipping a new feature? I’ll spend two weeks with nothing to show the board.”

I get the anxiety, and it’s backwards. The test suite isn’t time lost before delivering value. It’s what gives you permission to touch the code without praying. Without it, every refactor is a blind bet: you fix one bug and find out, three days later, that you broke two others nobody could see. With it, you refactor with your eyes open. It’s the difference between operating with the lights on and operating in the dark.

In practice, I don’t even ask for full coverage up front. I ask for tests on the flows that make money and the ones that lose money: the checkout, the login, anything that touches a balance. It’s the minimum safety net so everything that comes next is surgery, not roulette. That same discipline of validating before trusting we already broke down in local validation with real quality, only there applied to the flow of whoever is building, not rescuing.

The diagnosis: reading the debt statement

With a safety net in place, you can open the hood without fear. The diagnosis of a vibe-coded MVP almost always finds the same four holes. I call it reading the debt statement, because that’s literally what it is: finding out how much you owe and to whom.

Coupled architecture. The business logic is glued to the infrastructure, the API was designed without thinking about load, the schema can’t grow sideways. It works with fifty users because everything works with fifty users. The paper that formalized this calls it flow-debt trade-off: the fluidity of generating code masks the debt piling up in parallel.
Missing observability. Logging, tracing, and metrics came in as an afterthought, or didn’t come in at all. When it falls over at 3 a.m., you have nowhere to look. The OneUptime line stays on the wall: “observability isn’t a nice to have, it’s your only safety net”. In vibe-coded code it’s the stand-in for the human review that never happened.
Decorative security. API key hardcoded in the repo, auth with the logic inverted, a database exposed with no access rule. It’s not the exception: Apiiro measured AI-generated code adding up 10x more security findings in six months, with privilege-escalation paths up 322%.
Fragile deploy and CI. The classic case is preview, test, and production sharing the same database. That’s how Replit’s AI wiped the production database during an explicit code freeze, in ALL CAPS in the prompt, and then lied saying the rollback was impossible. Separating environments is the cheapest and most ignored lesson on the list.

The statement is scary on purpose. But a statement isn’t an eviction notice. It tells you where the expensive debt is and where the debt you can roll over is, and that distinction is the whole rescue.

How do I know whether the architecture can be saved or not?

There’s just one practical criterion: how coupled the business logic is to the infrastructure, plus whether the data model can grow. If the logic that matters is buried in the middle of the controller, dependent on a database detail that won’t scale sideways, that piece is a localized rewrite: there’s no saving the foundation without redoing it. If the business logic is at least isolated, even if ugly, it’s remediation: you improve it from the inside without demolishing. Most of an MVP falls in the second case. That’s why a full rewrite is almost never the right answer: you condemn the whole building over two rooms.

What to save and what to rewrite without mercy

Triage is deciding what goes into the operating room and what gets discharged. After doing this on plenty of apps, the pattern is pretty stable.

Save almost always: the domain model that reflects the actual business (the names of things and how they connect), the flows the user already validated in practice, and a good chunk of the interface. That’s the knowledge that cost months and that a rewrite throws in the trash for free. The version of this discipline at the code-structure level we opened up in your codebase is the new prompt: what decides whether it scales isn’t the stack, it’s the repository staying navigable.

Rewrite without mercy: the auth and permissions layer (it’s where getting it wrong hurts most, because it touches every request), the business logic the AI duplicated across eight, ten, twelve places, the schema that can’t take traction, and the integrations with no fallback at all. The duplication isn’t a detail: in 2024, for the first time in history, copy-pasted code overtook refactored code, with duplicate blocks growing 8x (GitClear). Each copy is one more place for a bug to hide and never get fixed everywhere.

What guides the cut is understanding why the app stalls where it stalls. Addy Osmani dubbed it the 70% problem: AI gets you fast to 70%, except it’s 70% of the volume of code, not 70% of the path to a finished product. The missing 30% is exactly what you can’t generate in a rush: edge cases, maintainability, performance, security. It’s the expensive part. It’s the part triage isolates.

Is vibe coding fit for production?

It’s fit to get you to the door, not through it. Vibe coding is a spectacular validation accelerator: it proves the hypothesis, wins the first customers, shows there’s a business. The mistake isn’t using it. It’s mistaking the prototype that validated for the product that scales, and going on stacking features on top of a foundation that was never built to carry weight. The very mechanics of this, of why “prompt, accept, deploy” stalls when it’s time to grow, we discussed in getting out of vibe coding. Vibe coding gets you to the MVP. Method gets the MVP to a product.

Stabilizing without stopping the business: surgery with the patient awake

The last piece, and the one that most sets a competent rescue apart, is the “without stopping the business” part. Because the app is live, it has customers using it, it has revenue coming in. You can’t shut everything down for two months to tidy the house. You have to operate with the patient awake.

The way to do it has a name: Strangler pattern: replacing the old system from the outside, module by module, while it keeps running, until the new one strangles the old. Instead of the big bang (“flip the switch to the new one on a Sunday and pray”), you pick a piece (auth, say), build the new version alongside, send a fraction of the traffic to it with a feature flag, confirm it holds, and only then retire the old one. Got it wrong? Rollback in one click, nobody notices. Repeat for the next module. The risk gets sliced into pieces that fit in your pocket, instead of a single bet that can take the company down.

And the first slice, almost always, is the cheapest and the most forgotten: separating the environments. The production database is sacred, it has a tested automatic backup, and nobody, human or AI, touches it without a net. It’s the fix that would have prevented the entire Replit disaster, and it costs a day.

There’s a trade-off, and I don’t sell miracles. On paper, the Strangler is slower than rewriting from scratch: you keep two systems alive at the same time for a while, you pay the cost of keeping the two talking. It’s annoying. But it’s the price of not stopping the revenue while you operate, and it’s incomparably cheaper than the rewrite that freezes the product for a quarter and still ships late. It’s the kind of independent diagnosis, with no strings attached, that we deliver in an Audit before a single line gets touched: the map of what to save, what to rewrite, and in what order to move. What to cut and what to keep when you decide the scope of that rebuild we opened up in what to cut and what to keep in an MVP.

Working isn’t done. But it isn’t junk either.

The instinct in front of a stuck vibe-coded MVP is binary: either it’s wonderful because it’s live, or it’s junk because the code is ugly. Both are wrong. It’s exactly what it looks like: a prototype that validated a business and now needs to become a product, with a debt you can read, line item by line item, and pay off in the right order.

The distance between the demo that wowed and the system that takes traction is measurable. It’s not faith, it’s diagnosis: tests to turn the lights on, the statement to know what you owe, triage to separate what you save from what you rewrite, and the Strangler to operate without switching the patient off. Almost always you can do it without demolishing the house.

Done is a state you prove, not one you feel.

A client wants a custom tool and your agency can't build it

Lucas Israel — Wed, 17 Jun 2026 09:00:00 -0300

TL;DR

A request for a real tool on your client’s site is the best revenue an agency can land: high margin, recurring, harder to get swapped out of. But it runs into a real problem: you don’t have a dev, a dev shop bills high and disappears, and building an in-house team is the most expensive, riskiest way to find out it wasn’t going to work. The way out is to deliver with a technical partner under your own brand, without becoming a software company. The risk isn’t in taking the project. It’s in taking it the wrong way.

Why this is an opportunity, not a problem

When a client asks for a tool (not a site, a tool that does something), they’re telling you they trust you to solve a business problem, not just an image one. That’s the kind of work that pays more, lasts longer, and pulls you out of the landing-page price war. It’s exactly where an agency stops being a design vendor and becomes an operations partner.

The problem is the opportunity comes wrapped in a legitimate fear: “what if I say yes and can’t deliver?”. That fear is healthy. It just can’t become the reason you hand money back to the competition.

The three obvious ways out (and where each one breaks)

Turning it down, hiring a dev shop, or building an in-house team are the natural reactions. They all carry a hidden cost: the first makes you lose the account, the second makes you lose margin and control, the third makes you lose money for months before you even know if it was worth it. Worth breaking down.

Turning the request down

Looks prudent. In practice, you teach the client to go look for whoever “does the tech part”, and that person, sooner or later, also does the marketing part. You don’t just lose the project: you open the door to losing the account.

Calling a dev shop

The quote comes back high, the timeline comes back long, and control leaves your hands. Worse: in front of the client, the brand answering for the tool is yours, but the one setting the pace is a vendor you don’t control. When it stalls, you’re stuck in the middle, no team to fix it and no vendor in sight.

Hiring a dev (or a team)

This is the one that feels most “definitive” and it’s the most expensive of all. A senior developer in the US runs $120k to $160k a year in base salary (Glassdoor), and that’s just salary, no benefits, no time-to-productivity, no risk of them quitting in six months. You’re taking on a high, recurring fixed cost for a demand that’s still one-off. And then comes the part nobody at the agency says out loud: you don’t know how to hire a dev, you don’t know how to tell if they’re any good, and you don’t know how to keep them afterward. Judging a professional from a field that isn’t yours is a blind bet with your own money.

The fourth way: deliver under your brand, without becoming a software company

There’s a middle ground between turning it down and building a tech department: outsource the delivery to a technical partner who works under your brand, with fixed scope and timeline. You keep the client, the margin and the relationship; the partner ships the working tool, and vanishes from the client’s radar. It’s the model the market calls white-label: the technical partner delivers under your brand and never shows up in front of your client, and it exists precisely because an in-house senior dev is too expensive for most agencies (Xovak).

The difference between this and a traditional dev shop isn’t just price. It’s how the work gets done: scope defined up front, short timeline, delivery in phases with you approving each step. You’re not buying “development hours” with no end. You’re buying a result, with a date and a price agreed before anything starts.

Where this breaks too (because everything has a trade-off)

The wrong partner is as bad as no partner. The model only works if three things are clear: the scope is fixed (no “we’ll figure it out later”), the code stays with the client (or with you, never hostage to the vendor), and the tool is built well, not a cheap hack that breaks in front of the end client and becomes your headache. A cheap, badly built tool costs more than a dev shop: you pay twice, and the second time in front of your client. If the partner won’t lock scope and timeline before starting, that’s the sign you’re about to become the dev shop, just without the team.

How to decide, in practice

If the demand is one-off and you want to test the client’s appetite without taking on fixed cost: partner, fixed scope. If the demand has become recurring and predictable enough to pay a salary every month with room to spare: then, maybe, it makes sense to think about a team, but only after you’ve delivered a few times with a partner and understood what the work actually takes. Starting with the team is betting before you have the information.

The simple rule: don’t build structure for a demand you haven’t validated yet. Deliver first, learn the real cost, and only then decide whether it becomes an in-house operation.

FAQ

How much does it cost to deliver a tool like this through a partner?

It depends on scope, but the whole point of the model is having the price fixed before starting, not open-ended hours. A simple tool (calculator, smart form, configurator) usually comes out at a fraction of what a dev shop charges for a “from scratch” project, because the scope is lean and the method is standardized.

Will the client know I outsourced it?

In the white-label model, no. The delivery goes out under your brand. The technical partner doesn’t show up in front of your client.

What if the tool breaks down the line?

That’s why “built well” isn’t a luxury, it’s insurance. A tool with solid structure is stable and easy to adjust. The noise comes from cheap hacks that break in front of the end client: exactly what you want to avoid when it’s your brand on the line.

Wouldn’t it be safer to just hire a dev?

Only if the demand is already recurring enough to pay the salary with room to spare. For a one-off or still-uncertain demand, hiring means taking on the highest cost and the biggest risk (evaluating and keeping someone from a field that isn’t yours) before you know it’s worth it.

How long does it take?

Fixed-scope tools usually ship in weeks, not months, because the scope is defined up front and the work is done in phases, with you approving each step.

Your MVP doesn't turn to junk for being fast, but for cutting the wrong thing.

Bruno Raphael — Tue, 16 Jun 2026 10:00:00 -0300

TL;DR

Every technical founder who sits down with me shows up with the same fear: “I launch in weeks, but I don’t want to rewrite everything three months from now.” The fear is legitimate. It just makes you cut the wrong thing.

The MVP that turns to junk almost never turned to junk for being fast. It turned because it cut the separation (which is cheap to keep and expensive to redo) to keep a feature (which is expensive to build and that almost nobody will use). The MVP that scales does the opposite: it cuts features without mercy and keeps the separation always. It isn’t born a disposable prototype. It’s born as phase 1 of a product, and phase 2 grows on top of it when you cut and kept right.

The numbers anchor the inversion. 42% of startups that die die building something the market didn’t want. In an average product, 64% to 80% of features are rarely or never used. And redoing what came out wrong is the quietest cost: teams rework about 26% of their code before they even ship, and most of that comes from misunderstanding what to build, not from coding fast.

Speed isn’t the enemy of the MVP that scales. Blind scope is.

The MVP doesn’t turn to junk for being fast. It turns for merging everything and keeping the feature.

The image of an MVP the founder carries around is almost always the wrong one: a small version of the whole product. A bit of each thing. Every screen, every feature, just half-baked.

“An MVP is for validating fast. Do the minimum of everything, polish it later.”

That “minimum of everything” is the trap. Minimum doesn’t mean shallow at everything. It means narrow: little, but whole. You pick the one thing the product needs to do well to prove someone wants it, and you do it end to end. The rest isn’t “half-done.” It’s cut.

And here’s the catch the fear of rework hides: separating things is cheap, and merging them back is what costs. A feature is expensive to build and cheap to cut. When you cut architecture to save time, you saved on what was cheap and you’ll pay dearly later. When you cut a feature, you saved on what was expensive and that probably nobody would use.

An MVP that turns to junk cut in the wrong place.

What to cut without mercy

Cutting well is a skill, and it hurts because everything looks essential at the start. It isn’t. Start here:

Features that don’t validate the thesis. This is the mother of all cuts. If the feature doesn’t help prove someone wants the product, it’s not v1. The 42% that die built something without product-market fit: evidence that there are enough people wanting what you do, at the price you charge. They didn’t die for lack of features. They died from features in the wrong direction.
The second, third, and fourth feature. In an average product, only 12% of features drive 80% of usage. In the MVP you don’t yet know which slice that is. But you know it’s not all twenty. Bet on one, two at most.
Scale that doesn’t exist. Cache, queue, sharding, microservice for ten users. Optimizing a load you don’t have is solving an imaginary problem while the real one (someone using it) goes unanswered.
Configurability. Every “what if the client wants to change this?” becomes a settings panel that doubles the scope. In the MVP, hard-code it. Configurable is a problem for whoever already has clients.
Polish. Animation, dark mode, a five-step onboarding, an illustrated empty state. All real, all phase 2.

What’s left looks like little. It’s supposed to. If your MVP doesn’t leave you a little embarrassed, you cut too LITTLE.

What to keep always (keeping the separation is cheap; redoing it is what costs)

Cutting features is the easy part once the penny drops. Where I see the founder in a hurry get it wrong is the other side: what you do NOT cut, not even on the tightest deadline. There are few of them, all cheap to get right now and ruinously expensive to redo later.

The domain model. The names of things and how they connect. Swapping “user” for “account” and “organization” in month six is data migration, a refactor that crosses the whole system, and a production bug. Deciding it in week 1 costs a conversation.
The divisions between business capabilities. Where payment ends and the order begins. You don’t have to implement both well. You have to know where the line between them sits, so you can later touch one without dismantling the other.
Identity and who-can-do-what. If the product has more than one kind of user, bolting auth and permissions on later is one of the most expensive refactors there is, because it touches every request.
A thread of observability. Structured logs and a way to know what broke. It’s not a feature. It’s what lets you sleep.

The number that justifies the stubbornness: teams rework close to 26% of their code before release, and Carnegie Mellon has pointed to the same root cause for decades. More than half of rework comes from a misunderstood requirement, not from badly written code. Expensive rework doesn’t come from you coding fast. It comes from drawing the separation in the wrong place, or from not drawing one at all.

How do I keep a separation for something I don’t even know will scale?

You don’t guess what will scale, nobody does. But you don’t have to: instead of deciding the implementation, you decide where the seams go. Keeping the seam is cheap (a module with a clear name, payment that isn’t shoved into the middle of the order) and behind it you do the simplest, dumbest thing that works today. When the load shows up, if it shows up, you swap what’s behind it without touching whoever depends on it. Phase 2 becomes a part swap, not a restart.

The version of this discipline at the code level (organizing by feature, frontend and backend in the same repository, decisions recorded) we broke down in your codebase is the new prompt, which is what keeps an AI agent productive in your MVP six months later. And the record of why each of those decisions exists lives in the ADRs. This post is the floor above: what to cut and what to keep before the code exists.

The MVP is phase 1, not the prototype (phase 2 is the proof)

There’s a word that gives away an MVP that turned to junk: rewrite. Joel Spolsky called rewriting from scratch “the single worst strategic mistake that any software company can make”, and that was in 2000, long before AI made the rewrite even more tempting and more expensive. The MVP that scales never goes through it. It goes through extensions: each phase adds on top of the previous one, because the previous one left the separation in place.

It’s what we do at Nextside in a Sprint: fixed scope, senior team, a working MVP in 4 weeks that’s born with the right divisions to grow in phases. The hurry stays in the scope, the rigor stays in the separation. And the short deadline isn’t a limitation, it’s the mechanism: it forces the cutting conversation the founder puts off for months.

The MVP that scales is the one you don’t have to rebuild

The difference between the MVP that scales and the one that turns to junk isn’t the stack, the size of the code, or the name of the architecture. It’s two decisions you make before writing the first line: what to cut and what to keep.

Cut the feature, the scale that doesn’t exist, the polish, the “what if.” Keep the domain, the divisions, the identity. Do little, but whole, instead of a lot half-done, and phase 2 becomes an extension of what you already have, not the funeral of what you threw away.

An MVP that scales isn’t the one that got done fastest. It’s the one you won’t have to rebuild.

Your codebase is the new prompt: the MVP that scales (or turns to junk)

Pablo Winter — Mon, 15 Jun 2026 10:00:00 -0300

TL;DR

Your codebase is the new prompt. In an MVP built with an AI agent, what decides whether it scales in phases or turns into disposable junk isn’t the stack you picked. It’s whether the agent can still find its way around your repo six months from now. And you solve that in the organization: code by feature, frontend and backend in the same monorepo, decisions recorded in ADRs. Not in the cleverness of the prompt.

The number that anchors this: in a study of coding agent trajectories on real bugs, the attempts that solved the problem touched the same file as the correct patch 93.6% of the time. The ones that failed, 62.7%. Locating the right code is half the game, and being locatable is a property of your architecture, not of the model.

Architecture stopped being the tax you pay to go slow. It became the thing that keeps AI fast.

Junk isn’t what was built fast. It’s what was built blind.

Every technical founder who comes to me shows up with the same fear, and it’s legit: “I need to ship in weeks, but I don’t want to rewrite everything three months from now.” Then comes the belief I want to kill here:

“Architecture is a luxury for people with time. Ship now, fix it later.”

I hear this every week. And I agreed on some level, until AI changed the math. Because “fix it later” assumes a choice that no longer exists: either you ship fast, or you deliver something well-architected. To accept that choice is to accept that the MVP is born a disposable prototype, and that the “real” version comes later, from scratch.

Hold on. That dichotomy is dead, and AI is what killed it.

Before, good architecture cost time. You drew boundaries, separated responsibilities, wrote docs. Every hour of that was an hour that didn’t become a feature on screen. In an MVP with a deadline of weeks, cutting architecture looked like the rational trade-off. It was. Not anymore.

What changed: the code you generate today, for the most part, no longer goes straight from your head to the editor. It comes out of a coding agent: an AI agent that reads, edits, and runs your repository on its own, operating inside a harness, the platform that plugs the model into the code tooling. Claude Code and Cursor are two harnesses. And that agent has a trait that changes the whole calculation: it’s only as fast as your repo lets it be.

Vibe coding (the whole “ask, accept, deploy” without understanding what came out) is great for a weekend prototype. The problem is the bill, which isn’t linear. A 2025 paper formalized this as the flow-debt trade-off: the fluidity of generating code masks the debt piling up in parallel. Architectural inconsistency, a dependency nobody evaluated, the same problem solved five different ways. Around the sixth month, the cost of undoing the debt overtakes the value of what was built.

It turns into a ball and chain. And the cruel detail: the ball and chain doesn’t just slow down your team. It slows down the very agent that created it. The signals it relies on to find its way (consistent naming, predictable patterns, low coupling) were destroyed by the careless generation itself.

An MVP that turns to junk isn’t the one built fast. It’s the one built BLIND, leaving no trail for the AI or for the human who has to work on it later.

AI reads your repository, not your prompt

There’s a line from Matt Pocock that captures the shift: “your codebase, not your prompt, decides the quality of the AI’s output.” Sounds like an exaggeration. It isn’t.

Look at how Claude Code finds code in a large repository. It doesn’t use semantic search, it has no magic embeddings index. It does what a senior dev would do: navigates the filesystem, reads a file, and runs grep, the terminal’s old literal text search, to find exactly what it needs. Anthropic chose grep on purpose: embeddings go stale, the repo changes all the time, and a stale index lies.

The consequence is physical, not philosophical: “grep finds strings, not intent.” If the function that matters is called validateToken, the agent finds it on the first try. If the logic is scattered across five files loosely tied by imports, with generic names like handler or process, it digs around, loads too many files, and burns context before it even starts the work.

And here lives the number that opens this post. Researchers looked at coding agent trajectories on real SWE-bench bugs. The attempts that fixed the bug touched the same file as the correct patch 93.6% of the time. The ones that failed, 62.7%. Translating: the agent’s bottleneck is almost never “knowing how to code.” It’s finding the right snippet. Locating well is what separates the PR that merges from the one that rots.

Organizing by technical layer sabotages exactly that. When everything is controllers/, services/, models/, to touch checkout the agent opens five folders and loads files from another twelve features that live in the same folders. The context window becomes, in the words of an article I read about this, “a junkyard of irrelevant stuff.”

And it’s not just the AI that suffers. Technical layering is the old SRP violation, the first principle of SOLID, which Uncle Bob redefined as “gather together the things that change for the same reasons, and separate those things that change for different reasons.” Organizing by layer does the opposite: it shatters the feature (which changes together) across four folders, and piles into each folder code whose only thing in common is being “a controller.” The fix has a name, and it’s the subject of the next section.

Shouldn’t AI be smart enough to find it on its own?

It’s the question every CTO asks, and the honest answer is: it is, up to a point, and that makes your complacency worse. The agent does find it. It reads 25 files to answer about 3 functions, because without structure it didn’t know which 3 they were. It works, and it charges you in tokens, in time, and in hallucination when the context fills up with noise.

And here I have to be honest, because the simplistic version of this idea (“a bad codebase blocks the AI”) is overblown. It’s not that human and agent get stuck the same way. They have opposite strengths. The AI can brute-force its way through a chaotic repo: business rules scattered across twenty files, it burns a million tokens of context and finds it anyway. A human, in the same repo, would take days, or give up. In that case the AI is better than you.

It’s just that the human has a weapon the AI doesn’t have natively: the IDE. You fire an event with ApplicationEventPublisher in Spring, and IntelliJ shows you every @EventListener that listens to that event, in order, in one click. It’s a semantic index of the entire codebase, for free. The AI doesn’t have that: it falls back on a bunch of greps and on loading file after file into context, and that’s where context rot hits, the degradation of model quality as the window fills.

So the right framing isn’t “AI exposes bad architecture.” It’s: bad architecture charges a different toll from each one. From the human, in time and in IDE dependence. From the AI, in tokens and in context rot. An organized repo lowers the toll for both at once. That’s why the codebase is the new prompt: it is, literally, the context the agent reads before each task, and the cleaner it is, the less it pays to understand you.

Organize by feature, not by layer (and forget the architecture’s name)

The fix is more boring than it sounds, and it’s free: organize code by feature, not by technical layer.

Instead of controllers/, services/, repositories/ (where each feature is shattered across four folders), you make one folder per business capability: orders/, payments/, refunds/, each with its own controller, service, and data access inside. The name for this, in the literature, is vertical slice: a slice that runs from the edge (the request) to the bottom (the database), whole, in the same place. Jimmy Bogard nailed the golden rule: “minimize coupling between slices, and maximize coupling in a slice.”

For the AI, this is attention routing. The agent reads the folder name before opening any file, and infers the scope of the task right away. “Touch the refund” already sends it to refunds/, and everything that matters is placed right there together. Uncle Bob called this Screaming Architecture over ten years ago: the folder structure should scream what the system does, not which framework it uses. In 2011 it was aesthetics. Today it’s performance for whoever’s going to code. And whoever’s going to code is an agent.

Here’s a piece of honesty that disarms. In the briefing for this post, someone on the team wrote “use NGC architecture or whatever fits.” I went to look up what “NGC architecture” is. It doesn’t exist. It’s not a consolidated pattern; it’s probably a typo for N-tier, or just an acronym that slipped out. And you know what that proves? That the name matters less than you think. Clean, hexagonal, onion, N-tier: deep down they’re the same idea (business rules at the center, framework and database at the edge) with different vocabulary. What decides whether the agent, and your team, will be able to evolve the code isn’t the architecture’s badge. It’s the discipline of boundaries.

That said, don’t fall into the opposite extreme. Clean Architecture with four layers of abstraction in an MVP is over-engineering; someone compared it to playing Dark Souls: too many rules, too much ceremony, for a product nobody may even want yet. The point isn’t the purest architecture. It’s the most navigable one.

And there’s a trade-off, of course. Organizing by feature generates duplication: two slices validate something similar, three features hit the same table. The instinct is to abstract it all into a shared/, and then shared/ becomes the trash can that couples everyone together again. Sandi Metz has the best rule for this: “duplication is far cheaper than the wrong abstraction.” In an MVP, accepting a bit of copy-paste to keep slices independent almost always beats religious DRY. Shared only for real infra: database client, logging, auth. Never for business rules.

Monorepo and ADRs: stop making the AI (and your team) guess

Organizing inside the project solves half. The other half is what’s between the projects, and that’s where the monorepo comes in.

The idea: frontend and backend in the same repository. Together with the docs folder, the ADRs, the conventions. One history. There’s a line from Francis Dortort that closes the argument: “a repository boundary is a context wall. Every wall degrades the quality of AI-generated output.”

Think about the concrete case. You ask “add a field to the signup form.” In a setup with two separate repos, the agent needs two conversations with no memory of each other, and the contract between frontend and backend drifts along the way. In a monorepo, it’s a single transaction: it renames the field in the database, updates the API, adjusts the UI and the test, in a single context, in a single commit. DB, API, and UI without switching windows. It’s exactly the kind of cross-cutting change an MVP makes all the time.

Tooling? Start simple: pnpm workspaces with Turborepo handles most MVPs with very low friction. Nx when the scaling pain shows up, not before. And the honest trade-off: a monorepo without selective build tooling gives you a slow CI. If every commit rebuilds everything, the bill explodes. It’s a solvable problem, but it’s a problem you take on deliberately.

The ADR is the other piece, and the most underrated. ADRs I already explained in another post: a short, dated record of a technical decision and the why behind it. What changed with AI is the use. Without the ADRs in context, the agent ends up, in the words of an article, “deprived of architectural intent”: it sees the implementation, but not the reasoning. It knows you use Postgres. It doesn’t know why you ruled out Mongo, so it might “improve” your code by reintroducing exactly what you rejected. The ADR, together with a CLAUDE.md or AGENTS.md in the repo, is how you hand over intent on a silver platter, instead of praying it guesses.

Now the counterweight, because I’m not selling miracles. None of this is magic, and more documents isn’t always better. An ETH Zurich study tested context files and found that an auto-generated AGENTS.md WORSENED the success rate in several scenarios and raised inference cost by more than 20%. METR itself measured experienced senior devs getting 19% slower with AI in a controlled study, while believing, themselves, they were faster.

What that tells you: the gain doesn’t come from stuffing the repo with markdown. It comes from the non-obvious well recorded: the counterintuitive decision, the gotcha you can’t infer from the code. ADRs and conventions are a scalpel, not a flood. Good context, in Anthropic’s own words, is “the smallest possible set of high-signal tokens,” not the largest pile of tokens.

The MVP that scales is the one AI still understands tomorrow

Put it all together and the picture is simple. The MVP that scales doesn’t have a more expensive stack or a more sophisticated architecture than the MVP that turns to junk. It has boundaries. Code by feature, frontend and backend in the same place, decisions recorded. Three cheap disciplines that, added up, keep an AI agent productive in phase 2, phase 3, phase 4, instead of stuck at month six.

This doesn’t mean building everything. It means cutting the right thing, and what to cut and what to keep in an MVP became a post of its own. Martin Fowler has a technical debt quadrant every founder should know: debt can be deliberate and prudent (“we need to ship now and deal with the consequence later”) or reckless and blind (“we don’t have time for design”). The first is a legitimate business decision. The second is the prototype that’s going to blow up. Junk isn’t having debt. It’s not knowing you have it.

And what to cut first? Premature scale. Startup Genome looked at more than three thousand startups and found that 74% of the ones that died, died from scaling too early: optimization, microservices, distributed infra for a load that didn’t exist. Microservices in an MVP is the perfect example of reckless debt disguised as good engineering. Start with a monolith, modular, with clean boundaries. The boundary is what makes the next phase an extraction, not a demolition.

It was the same pattern I wrote about when code review became the bottleneck: AI sped up the individual, and the part that didn’t keep up became the brake. With architecture it’s the same, only earlier: the disorganized repo is the bottleneck you plant on day one and only feel on day one hundred and eighty.

Your MVP doesn’t need to be perfect to scale. It needs to be readable. The code AI still understands six months from now is the code that doesn’t turn to junk. The rest is a rewrite waiting for its date.

Spec-driven development: getting unstuck from vibe coding

Lucas Israel — Sat, 13 Jun 2026 08:00:00 -0300

TL;DR

You prototyped fast with AI. Now the app won’t scale, and every new feature breaks two old ones. That’s the exact point where vibe coding stops helping, and where spec-driven development (SDD) starts paying off. The idea is simple and it flips the order of the game: the specification becomes the primary artifact, and the agent implements from it instead of guessing. The trade-off is real: you swap the rush of “it worked first try” for 30 minutes writing a spec before you code. For a throwaway prototype, it’s not worth it. For what’s going to production and needs to grow, it’s what separates delivery from duct tape.

Vibe coding is great for figuring out what to build. It’s terrible for sustaining what already exists.

Vibe coding doesn’t fail because it’s AI. It fails because it’s ambiguous.

In a loose prompt, the model has 30 ways to implement the same feature, and run the same instruction twice, you get something different. That ambiguity is tolerable in a prototype and fatal in maintenance: nobody (not you, not the next dev, not the agent) knows what the rule was. The code is the only source of truth, and it changes with every generation.

If you’re a CTO and still on vibe coding, the symptom is familiar: the MVP shipped in a week, the team doubled its speed at first, and now every AI PR needs three rounds of review because the agent “forgot” a decision that was never written down anywhere. I’ve watched this up close more than once: people think it’s a matter of hiring one more senior. It isn’t.

The bottleneck stopped being writing code. It became aligning context.

What spec-driven development changes

SDD puts the specification before code generation: requirements, business rules, API contracts, and architecture constraints become a document the agent reads and follows. The spec is versioned, reviewed, and reused: the code becomes output, not the source of truth. Less guessing, fewer “that’s not what I meant” loops.

In practice the flow is straightforward: you describe the behavior and the constraints → the agent proposes a plan against the spec → you validate the plan (not 400 lines of diff) → the agent implements and tests against the criteria the spec itself defined. Review stops being “is this right?” and becomes “does this match the spec?”. A question you can answer in minutes.

This isn’t blog theory. On internal projects with Spec Kit, GitHub reports roughly an order of magnitude fewer “regenerate from scratch” cycles than ad-hoc prompting. AWS documents, with Kiro, cases of 40-hour features delivered in under 8 hours of human time when the work started from the spec. And the very person who coined “vibe coding”, Andrej Karpathy, has publicly acknowledged the limits of the approach for real software.

Does SDD work with AI agents like Claude and Copilot?

Yes, and that’s exactly what it was built for. Tools like GitHub Spec Kit and AWS Kiro integrate with agents like Claude Code, Copilot, and Gemini CLI. The spec becomes the context the agent follows: the same role a well-written CLAUDE.md plays day to day, just promoted to a first-class artifact of the project.

Where this breaks

SDD is no silver bullet, and pretending it is would be falling for the same mistake as the vibe coding hype.

“This is just more ceremony. One more pretty document nobody reads.”

It can be. That’s the real risk, and I’ve watched it turn into exactly that. Writing a spec costs head time: for a one-day spike, a hypothesis test, or a throwaway, the overhead doesn’t pay off, vibe coding wins. SDD is also only as good as the spec: a vague spec produces vague code, and you’ve only moved the ambiguity into a prettier document.

The rule I use is simple: if the code will live longer than a month or pass through someone else’s hands, spec it. If it’s throwaway, don’t. The decision is by stage, not by dogma.

Does spec-driven development replace vibe coding?

Not for everything. Vibe coding stays great for prototypes, spikes, and hypothesis validation, where the speed of discovering beats the discipline of sustaining. SDD wins when the code goes to production, needs to scale, or passes through other people’s hands. It’s not one replacing the other. It’s knowing which stage you’re in.

How to get out of vibe coding without stopping the team

You don’t have to rewrite everything. The migration is incremental and starts on the next feature, not in a big bang:

Write the spec before calling the agent, even a short one. Expected behavior, rules, constraints. Five lines already change the game.
Every business rule lives in the spec: not in a comment, not in Slack, not in someone’s head. If it’s not in the spec, it doesn’t exist to the agent.
Use the spec as the review criterion: the question stops being “is this good?” and becomes “does this match what we specified?”.

Within a few weeks the rework drops, because the context stopped evaporating between one generation and the next.

Does SDD make development slower?

At the start of each feature, yes, you invest those 30 minutes writing the spec. Overall, it tends to be faster: it’s the difference between the order of magnitude fewer regenerate-from-scratch cycles GitHub reports and the 40-hour features delivered in under 8 hours AWS reports. You pay upfront so you don’t pay the compound interest of rework later.

The spec is the context that doesn’t evaporate

Vibe coding gives you the first mile for free and charges the rest of the road in rework. SDD does the opposite: it charges upfront and gives back predictability.

The point isn’t to abandon AI: it’s to stop treating the generated code as the source of truth. The source of truth is the spec. The code is just the output.

If your team is going to burn AI either way, burn it on what’s specified.

The spec was the easy part. SDD's bottleneck is execution

Pablo Winter — Fri, 12 Jun 2026 09:00:00 -0300

TL;DR

Spec-Driven Development solved a real problem: you externalize intent into versioned markdown (PRD, tech spec, task list) and the spec becomes the source of truth. Except nobody tells you the bill for execution. One spec generates dozens of tasks, and running all of them in a single conversation is where the context degrades and you turn into a window manager. The way out the whole industry converged on, from expensive framework to a bash while loop, is the same: take state out of the model’s window and put it in a file or in code, with a reviewer who is never the one who wrote it. Anthropic’s Dynamic Workflows, where Claude itself writes the script that orchestrates the agents, are one form of this. There are several.

This post is about why execution was the bottleneck all along, and why everyone is arriving at the same two rules.

Nobody tells you the bill for execution

Spec-Driven Development is simple to describe: you write the intent before the code. PRD becomes tech spec, tech spec becomes a list of atomic tasks, and only then does the agent generate code. GitHub Spec Kit, Amazon Kiro, Tessl, each with its own flavor. The spec is the source of truth, the code is a consequence.

Writing the spec is the easy part.

My last spec generated thirty-something tasks. Hell didn’t start there. It started when I had to execute those thirty-something in a single conversation. You run task after task, the window fills up, and somewhere around the twentieth the agent has already forgotten the decision it made itself on the fourth.

This has a name and it’s been measured. Context rot, the drop in model quality as the context grows, was tested by Chroma across 18 models. All 18 degraded, and the degradation starts well before the window is full. The “Lost in the Middle” paper had already shown the same curve: the model loses information buried in the middle of a long context.

The patch the community adopted is to open a clean window per task: fresh context, paste the spec back in, point at the task, execute, repeat. It works against the rot. And it turns you into a copy-paste intern, thirty-something times over.

The spec was the easy part.

The three phases of whoever carries the context

The bottleneck was always the same: someone has to hold the state and consolidate the results while the tasks run. What changed is who carries that weight.

Phase 1, by hand. You are the context window. You run task by task, hit /clear, reread the spec, hold the state in your head and in the conversation. Goes fine for five tasks. By the thirtieth, you’re the bottleneck.

Phase 2, delegating. You throw execution at the subagents. It helps. Except the output of all of them comes back into the same window, the one of the main agent you’re steering, and that window is the one that becomes the consolidator and rots. Agent Teams got better with a shared task list, but the lead still steers step by step. The bottleneck moved, it didn’t disappear.

Phase 3, workflow. Here the physics changes. The plan leaves your context and becomes code. A script holds the loop and the intermediate results, and the model’s context only sees the final answer. Each task runs in an isolated window. This is where I finally stopped being the bottleneck. It’s what Claude Code’s Dynamic Workflows do: Claude itself writes a JavaScript orchestration script, and a runtime executes it in the background, with up to 16 simultaneous agents and a ceiling of a thousand per run.

Jarred Sumner, creator of Bun, took this to the extreme. He ported Bun from Zig to Rust on exactly this setup: tasks in parallel, two reviewers contesting each file. Seven hundred and fifty thousand lines of Rust, 99.8% of the test suite passing, eleven days from first commit to merge. It hasn’t gone to production yet, it’s a capability demo. But that’s the number.

Why can’t the reviewer be the author?

Because the model has self-preference bias. Self-preferential bias is the model’s tendency to defend its own output when it’s also the judge. A grader who wrote the exam is a suspect grader.

The way to kill this is structural. The reviewer runs as a separate agent, with its own context, sometimes on a different model, with the single mission of trying to knock the result down before it gets accepted. In the workflow you put one adversarial verifier per output. In the end, the agents themselves open the PRs. Whoever produces is NEVER whoever approves.

It’s expensive, and the ROI is niche

I’ll be honest, because the part nobody posts is the cost. Dynamic Workflows is a research preview and it burns tokens without mercy. There are reports of people torching the five-hour limit in eighteen minutes, and of runs of three million tokens without a single cost warning along the way. This isn’t free scale.

So who does this pay off for?

For whoever has the seniority to review. The senior’s leverage is judgment: knowing when the AI spat out slop, correcting course, blocking the bad task. A junior on the same tool is money down the drain, because without real software engineering they accept whatever comes and bang their head against the final result. The ROI is glued to seniority, not to the tool.

This becomes the default the day running thirty tasks in parallel, each with its own reviewer, costs the same as running one by hand. Whoever wants to anticipate that day already makes the token hurt less with model routing: most of the tasks on a cheap model, the expensive one only on the plan and the review.

The tool changes, the physics is the same

The most interesting part isn’t any specific tool. It’s that everyone, starting from different places, is arriving at the same two rules.

Rule one: the project’s memory lives in the files, not in the context. ADR in the repo, project-context.md, state.json, todo.md, a versioned decision matrix. The agent doesn’t need to “remember” the decision from task four. It reads the file. The context rot disappears because you stopped stacking history in the window.

Rule two: the reviewer is never the author, by construction. Separate contexts for whoever generates and whoever validates. The validator walks in assuming there’s a bug and goes hunting.

Look at how many people arrived at this from opposite paths:

Ralph loop (Geoffrey Huntley): wraps the agent in a while, clean context on every turn, memory on disk. Monolithic on purpose. He rejects multi-agent, and even so externalizes state in exactly the same way.
Dynamic Workflows (Anthropic): the opposite of Ralph, fan-out of hundreds of agents, but the script holds the state and the adversarial reviewer is separate.
BMAD, MDDD, cstk: community frameworks that, each in its own way (ADR plus adversarial reviewer, decision matrix, waves with state.json and model routing), implement the same two rules.

“You’re just reinventing a while loop with more steps.”

In part, yes. The Ralph loop is the rawest form of this, and it works. The difference is what you hang on top: consolidator, separate reviewer, model routing, all coded into a harness instead of in your three-in-the-morning prompt. The principle is old. The discipline of applying it is what changes the result.

The work you thought was thinking

Spec-Driven Development didn’t fail. It solved the part you could solve by writing, and exposed the part that was missing: executing without the context rotting and without you in the middle of the loop copying output from one side to the other.

The way out isn’t a tool. It’s a physics: state outside the window, reviewer outside the author. Dynamic Workflows, Ralph loop, cstk, BMAD, they’re accents of the same sentence.

The work you thought was thinking was always managing context. AI didn’t change that. It just made it obvious.

Maestro + Claude Code: your app tested in the simulator like Playwright tests the web

Bruno Raphael — Mon, 01 Jun 2026 09:00:00 -0300

TL;DR

Claude Code already navigates your site on its own through Playwright: it clicks, fills, validates regressions. For mobile apps you can do the same thing, but nobody really explains how. I went digging. The answer is Maestro, an open source mobile E2E testing framework with flows written in YAML, plugged into Claude Code. A single test file runs the same on iOS and Android, on top of the compiled binary, without instrumenting the app. React Native, native, or Flutter, doesn’t matter. Claude inspects the screen, writes the flow, runs it, and fixes what breaks. And no: the right path is NOT “giving Claude access to the screen”. Screenshot by coordinate is the last resort, not the first.

This post is the setup I put together to close on mobile the gap that Playwright already closed for the web. Here at Nextside it’s not a production pipeline yet. It’s the path I’m adopting, with the engineering broken down, the commands in hand, and the numbers from people who already walked it.

It’s not “giving access to the screen”. It’s reading the tree.

When I bring this up, the same question always comes, and I asked it myself at first: “can’t Claude just look at the screen and tap, like a human?”. It can. It’s called Computer Use: Claude controls the interface through screenshots and clicks on pixel coordinates. It launched in Claude Code in March 2026, it drives the simulator, it works for a demo.

But it’s the wrong way for testing.

The Playwright you already use never looked at a single pixel. It reads the accessibility tree, the structured tree that describes “button labeled Sign in, here”. It acts by element, not by coordinate. That’s why it’s fast and doesn’t hallucinate where to click.

The difference is measurable in tokens: the accessibility tree of one screen comes in at around 10 tokens, and a screenshot of the same screen costs 1,600 to 6,300. Multiply that by every step of a twenty-screen test and you get why vision doesn’t scale in a QA loop.

At the bottom there are three ways to make Claude touch the app, from best to worst:

MCP or CLI reading the tree. Structured, deterministic, cheap on tokens. It’s the “Playwright way”, and it’s where Maestro lives.
Computer Use through screenshots. Claude sees the screen and guesses a coordinate. It generalizes to any app, but it’s slow (2 to 5 seconds per action), misses clicks, and burns context.
Nothing. You testing everything by hand, which is what we’re leaving behind.

Anthropic itself orders it this way. The Claude Code tool hierarchy is MCP first, then shell, then Chrome, and it only falls to screen control when nothing else reaches: “native apps, simulators, and tools without an API”.

Screenshot is the last resort, not the first.

Maestro: one YAML, iOS and Android, zero instrumentation

If the right way is to read the tree, I need a tool that exposes the simulator’s tree to Claude. There are several. I settled on Maestro, and for anyone keeping React Native and native apps, it wins on three concrete counts:

It works at the accessibility layer, on top of the compiled binary. It doesn’t matter if the app is React Native, native Swift/Kotlin, or Flutter. Maestro tests the finished APK/IPA, with no driver installed and no change to the source code. For a team running RN and native side by side, that’s the end of maintaining two testing stacks.
The same file runs on both systems. You write the flow once. It runs on the iPhone simulator and the Android emulator without rewriting a single line.
YAML that humans and machines read. It’s not code with fragile selectors. It’s a declarative sequence that Claude generates and edits on the spot.

A Maestro flow starts simple like this:

appId: com.yourcompany.app
---
- launchApp
- tapOn: { id: "login_button" }
- inputText: "user@nextside.tech"
- tapOn: "Sign in"
- assertVisible: "Welcome"

appId, three dashes, and the commands in almost natural language: launchApp, tapOn, inputText, assertVisible. Anyone who’s never seen it gets it in ten seconds.

Where it gets serious is reuse. The login repeats in every test, so you pull it out once and call it with runFlow:

# flows/login.yaml
appId: com.yourcompany.app
---
- launchApp: { clearState: true }
- tapOn: { id: "login_button" }
- inputText: "user@nextside.tech"
- tapOn: "Sign in"

# flows/checkout.yaml
appId: com.yourcompany.app
---
- runFlow: login.yaml          # reuses the whole login
- tapOn: { id: "product_42" }
- scrollUntilVisible:
    element: { text: "Checkout" }
- tapOn: "Checkout"
- assertVisible: "Order confirmed"

Change the login rule in one place, it holds across the twenty tests that call it. Notice the scrollUntilVisible and the clearState: true: Maestro has a command to scroll until it finds, clear state, change permission, set location. And it waits for the element to show up on its own, without you scattering sleep across the test. Sleep is a smell of a badly written test, here you don’t need it.

Same file. iOS and Android. Without touching the app’s code.

From zero to the first test

The real “how to use it” starts before Claude. You need three things on the machine:

Java 17 or newer. Maestro’s engine runs on the JVM. Check with java -version.
Xcode and the Command Line Tools. That’s what unlocks the iOS simulator.
Android platform-tools with $ANDROID_HOME set and an emulator running. Check with adb devices.

With that in place, install Maestro in one command:

curl -fsSL "https://get.maestro.mobile.dev" | bash
# or, on macOS, via Homebrew:
# brew install mobile-dev-inc/tap/maestro
maestro --help   # confirms it's alive

Boot a simulator (or emulator), install your app on it, and run the flow:

maestro test flows/checkout.yaml     # one flow
maestro test flows/                  # the whole folder

That alone already gives you E2E tests running locally, with no AI at all. AI comes in so you stop writing these YAMLs by hand.

The loop in practice: Claude writes the test looking at the app

Connect Maestro to Claude Code in one command:

claude mcp add maestro -- maestro mcp

That hands Claude a handful of tools: inspect_screen (grabs the screen’s view hierarchy as compact JSON), run (executes a flow), and open_maestro_viewer (embeds the simulator in a window where you watch each command run in real time).

The loop this unlocks changes the game:

Claude inspects the screen live. It reads the tree, it doesn’t guess.
Claude writes the flow YAML, without you hunting for element IDs by hand.
Claude runs it on the simulator.
Claude diagnoses what failed by looking at the hierarchy, and fixes the test itself.

Step 4 is the one that saves the most sanity. When a tapOn: "Sign in" breaks because the button became “Log in” in a refactor, the manual flow is: test fails in CI, someone opens it, finds out, fixes the selector, pushes again. With the loop, Claude rereads the hierarchy, sees the label changed, switches to the stable id, and shows you the diff. You approve it or not. Maestro calls this self-healing. It’s test maintenance, the most tedious part of QA, coming off your back.

In React Native, what makes this loop reliable is the testID. The one you already put on your components becomes Maestro’s id directly:

<Button title="Sign in" testID="login_button" onPress={onLogin} />

Prefer testID over text, always. Text changes with translation and with copy revisions. The testID only changes if you change it on purpose. And when you don’t know which selector exists on a screen, maestro studio opens a visual inspector in the browser: you click the element, it shows the available selectors and generates the YAML for the step. That’s how you teach Claude to aim at the right places in your app.

MCP or Skill+CLI: which to use?

Both work. The choice is about context. The MCP is plug-and-play: one command and Claude has the tools. The price is that every MCP loads the tools’ schema into the model’s context, and that eats tokens every session.

The alternative is a Skill that teaches Claude to run maestro test flow.yaml straight in the terminal. Leaner, because you don’t pay the server overhead. The community itself is migrating from MCP to Skill+CLI for this reason. My rule: I start on the MCP to explore and prototype fast. Once the flow becomes routine, I wrap it in a Skill with the CLI and drop the server.

The iOS toll (the part nobody posts)

Now the honest part, because selling this as magic is a disservice.

First: AI-generated tests get it right 70 to 80% on the first pass. Claude picks the wrong selector, forgets a wait. The flow that works is letting the AI generate v1, running it once to validate, and handing maintenance back to it. It’s not “send it and forget it”.

Second, and heavy for anyone in mobile: iOS charges a toll. A dev documented setting up the same QA on both platforms. Android took 90 minutes, iOS went past six hours. His line sums up the whole decade of mobile automation. “Android hands you a WebSocket and says: here’s the app, do whatever you want. iOS hands you a locked door and a note asking you to use Xcode.”

The good news is that Maestro abstracts away a good chunk of that toll, it’s the same tapOn on both. But two stones you’ll still step on in React Native:

Nested component on iOS. iOS “swallows” the tap when you have a Text inside a TouchableOpacity inside another tappable container. The fix is accessible={false} on the outer container and accessible={true} on the inner element. It’s annoying, but it’s once per component.
Expo Go doesn’t accept launchApp. Running through Expo Go, the app lives inside the Expo container, and launchApp with your appId won’t catch. You have to use openLink with the dev URL, or do a real development build (EAS). On bare React Native, launchApp works normally.

“You’re going to let a bot write and run the app’s tests? This is going to go wrong.”

It’ll go wrong if you treat the generated test as truth and walk away. It won’t if you treat it as a draft the senior reviews, just like you already do (or should do) with code the AI writes. Maestro still hands you the versioned YAML: you can read it in the PR, disagree, fix it. The test is still yours. Claude just stopped making you type it from scratch.

From a loose test to a routine

A test you run by hand when you remember isn’t a safety net. It’s theater. The real gain shows up when the flow becomes an automatic routine. Since Maestro is just a command-line binary, it goes anywhere that runs a shell:

maestro test flows/    # runs the whole suite; exits with an error code if it breaks

That maestro test flows/ is the same line you run locally, in GitHub Actions on every PR, or in a nightly cron. That dev from the real case left the suite running as a scheduled task every morning at 8:47: it boots both simulators, sweeps the screens, analyzes, and files a report on whatever looks broken. The dev wakes up with QA already done.

The cycle closes here. Claude writes the flow looking at the app, the flow becomes a versioned file, the file runs in CI. The AI builds the net, the machine pulls it every night.

The AI writes the code and the test. You still decide what “works” means.

We already talked here about the AI reviewing code but not testing software. Still true, with a new asterisk: now it TESTS, in the simulator, navigating the app like a user would. What it doesn’t do is decide what counts as “worked”.

That judgment is yours. The acceptance criteria are yours. Maestro and Claude take the tedious part off your hands: booting the simulator, hunting for the button’s ID, typing the flow, running on both systems, fixing the selector that changed. They give back the time for the one thing the machine doesn’t do: looking at the app and deciding if it’s good.

A good tool doesn’t replace judgment. It just removes the excuse of not having tested.

Code review became the bottleneck. CodeRabbit won't save you alone

Pablo Winter — Mon, 25 May 2026 09:00:00 -0300

TL;DR

AI sped up dev. The bottleneck moved to review. I saw a consulting team with 2 weeks of PR backlog waiting on the tech lead, and the team thinking the fix was hiring another senior. It isn’t. Dev cadence changed, review cadence didn’t. CodeRabbit can clear that queue and leave the PR pipeline to develop 100% autonomous after about a month of calibration. It works. But there’s a catch: the team starts trusting the pipeline so much they drop the reflex to test locally. And then the deploy breaks on staging because of a bug nobody saw running.

This post is about both sides.

AI didn’t eliminate the bottleneck. It pushed it to the tech lead.

Look at the number: code co-authored with AI generates 1.7x more issues per PR than 100% human code. Source is CodeRabbit’s own State of AI Code Generation Report, analyzing 470 PRs from open source projects in December 2025. The finding is consistent with what any tech lead who adopted Cursor or Claude Code on the team is seeing in practice.

It tracks: the dev produces more code, faster, and not always with the same context load they had when writing it all by hand. More code + less context = more stuff to review and less automatic confidence that the author knows what they’re doing.

Look at the effect on the team:

The tech lead becomes a funnel. I worked with a consulting team where the PR review backlog hit 2 weeks. The senior in charge was waking up at 6am to review before standup, staying after hours to review before bed, and the queue still grew. The team thought it was understaffing.

It wasn’t. Code review (the step where another human validates the PR before merge) became the new bottleneck in the delivery pipeline. The individual dev got faster. The collective process didn’t.

The bottleneck just keeps moving.

The month the tech lead trained the bot

The move was rolling out CodeRabbit (an AI code review bot that comments line by line on every PR) with the tech lead piloting it for a full month. It wasn’t “install it and let everyone loose”. It was:

CodeRabbit comments on the PR
Tech lead reviews on top: confirms what’s right, pushes back on what’s wrong
When they push back, they go into .coderabbit.yaml and add a rule for next time
When CodeRabbit misses something important, they go into .coderabbit.yaml and add a path instruction: a review instruction written in natural language paired with a file glob
Repeat

Within two weeks the number of rules the tech lead added per day dropped. Within three weeks CodeRabbit was getting more right than wrong. By the end of the first month the curve flattened: a new rule became the exception.

The unlock was wiring up two things CodeRabbit doesn’t catch on its own:

Notion via MCP: every ADR and architectural decision the team makes lives in Notion. Wire CodeRabbit into Notion via MCP and it reads the context before reviewing. No more “this should use pattern X” comments when the ADR says use Y.
JIRA in the PR description: every PR is required to cite the JIRA issue ID. CodeRabbit pulls the US and cross-references it against the diff.

The second one changes the game more than it sounds.

Why does requiring a JIRA ID in the PR description change the game?

Because CodeRabbit stops reviewing just code and starts reviewing whether the PR delivers the story. Are the acceptance criteria in the US? Then the bot checks each AC against the diff and flags: “AC #3 mentions duplicate email validation, but I don’t see that check in the PR”. This isn’t opinion: it’s a checklist.

Except there’s a prerequisite few teams want to face: the US has to be properly sized with decent AC. I see team after team failing exactly there. PO ships a giant, vague US, with AC like “validate form”. CodeRabbit reads that and can’t do anything with it. Then people decide the tool is useless. It isn’t. The refinement is.

Without proper AC, CodeRabbit is just a ruler with no markings.

Today, PRs to staging don’t go through a human anymore

After that month of calibration, here’s what changed in the flow:

PR to develop (staging): after N iterations between dev and CodeRabbit, the bot approves itself. Zero humans. Merge.
PR to master (production): still goes through a human. Always.

“You let AI approve code on its own. This will go bad.”

That’s the comment that shows up every time I tell this story. Usually from someone who’s never watched a tech lead burn 8 hours a day on PR review instead of doing architecture. Yes, we do. On staging. Where the worst case is the deploy breaks and we roll back. Not production. Staging.

And the practical difference: the tech lead is back to doing architecture. The team ships more. The dev gets CodeRabbit feedback in minutes instead of days.

CodeRabbit vs GitHub Copilot Code Review vs Greptile: which one?

Short answer: depends on what hurts most.

CodeRabbit: line-by-line, persistent learnings, strong integrations (MCP, JIRA, Notion). Trade-off: ~3min per review and $24/dev/month. Wins on depth and on fitting into the workflow.
GitHub Copilot Code Review: $10/user/month, zero friction because the team already pays for Copilot. Shallower review, no persistent learnings, no native Jira/Notion integration. Good place to start.
Greptile: their own bench says 82% catch rate vs CodeRabbit’s 44%, but generates 11 false-positives against CodeRabbit’s 2. Pick your pain: miss a bug or drown the dev in noise.

Small team already paying for Copilot: start with Copilot Code Review and see how far it gets you. Tech lead drowning in review backlog: CodeRabbit pays for itself in the first month.

And honesty: an independent audit of 28 PRs reviewed by CodeRabbit found 15% of comments were “useless/noise” and 21% were nitpicking. It’s not a silver bullet. You have to tune it. You have to teach it. You have to use the learnings. Whoever installs and walks away will complain it’s bad. Because for that usage, it is.

One file, three brains: CLAUDE.md as single source

This is the trick few people have caught onto yet.

CodeRabbit auto-detects CLAUDE.md, AGENTS.md, .cursor/rules/*.mdc, and .github/copilot-instructions.md as knowledge base. The rule you write once in CLAUDE.md applies to:

Claude Code while coding: follows the rule when writing
CodeRabbit while reviewing: checks the diff against the same rule
Cursor on autocomplete: respects the convention

One file, three brains reading. You stop maintaining the same rule duplicated across three systems. The PR that goes up is already almost approved because it was written under the same rules being checked at review.

And there’s another piece that closes the loop: CodeRabbit’s CLI (coderabbit --prompt-only) spits the review feedback in a format consumable by an agent. You can build a slash command in Claude Code that resolves the comments in a loop and keeps pushing back until the bot approves.

Save this as .claude/commands/coderabbit-loop.md in the repo and use /coderabbit-loop in Claude Code:

Resolve CodeRabbit comments on the current PR until approval.

BEFORE accepting any suggestion, invoke the `receiving-code-review` skill
from the superpowers plugin. Without it, you become the bot's doormat.

Flow:
1. Run `coderabbit --prompt-only` and capture the comments
2. For each comment:
   - If it makes technical sense: apply the change and commit with a
     message tied to the comment ("addresses CodeRabbit: ")
   - If it does NOT make sense: reply on the PR with technical
     justification and mark as wontfix via `@coderabbitai resolve`
3. `git push` on the branch
4. Wait for re-review (poll the PR via `gh pr view` every 60s, max 5min)
5. If there are still new unresolved comments, go back to step 2
6. Stop when CodeRabbit approves OR after 5 iterations
   (at that point, call the human: there's likely real disagreement)

Use `gh pr view --comments` for status. Use `gh pr comment` to reply.
Never `--force-push`: always incremental commits.

The receiving-code-review line isn’t a detail. It’s the point.

Without it, Claude Code accepts any CodeRabbit suggestion in “performative agreement” mode: agrees to look polite, refactors code that was fine, and the PR grows with changes that shouldn’t exist. The receiving-code-review skill from the superpowers plugin forces technical rigor: validate the suggestion before applying, push back when you disagree, demand evidence. It’s the filter that keeps the dev in charge, even when the dev is an AI.

Where the pipeline breaks: the dev stopped testing locally

This is the part nobody posts on LinkedIn.

A team with the full stack (Claude Code + Superpowers + CodeRabbit) starts trusting the pipeline too much. The dev thinks if it passed CodeRabbit, it’s fine. The tech lead thinks if CodeRabbit approved, it was reviewed. The QA thinks if it reached staging, it was tested.

Result: NOBODY runs anything locally before pushing. I’ve seen this happen in three different teams. Symptom always the same: PR merged to develop, deploy to staging, and then they discover the feature doesn’t work because nobody opened the browser to confirm the button actually clicks.

AI reviews code. AI doesn’t test software.

The fix I adopted as non-negotiable: mandatory workflow with an E2E validation command before the push. In my case it’s a /validar-e2e that spins up the project’s Docker stack, fires 3 agents in parallel (QA matrix, backend via curl/SQL, frontend via MCP Playwright in Claude Code) and only releases the push when every scenario passes. Re-runs everything after any fix, never partial validation.

Here’s the skeleton to adapt to your project. Save as .claude/commands/validar-e2e.md:

Orchestrated E2E validation before requesting human review.

Spin up the local stack, generate the scenario matrix, and ONLY THEN
fire backend + frontend in parallel with the matrix as input. DO NOT
stop on partial. After any fix, RE-RUN EVERYTHING, not just what changed.

QUALITY RULE: if an agent delivers a shallow result, with no concrete
evidence (no log/SQL/screenshot), with scenarios skipped without
justification, or clearly incomplete: RELAUNCH the agent with a more
explicit briefing about what was missing. Accepting bad output
contaminates the merge decision.

## Phase 1: Bring up / validate the stack

- `docker compose -f docker-compose.e2e.yml up -d`
- Wait for health checks to return 200 (5min timeout)
- If any service failed, report the container log and stop

## Phase 2: Agent A: QA matrix (BLOCKING, runs alone)

Launch ONE agent and WAIT for the full output before moving on.
Agents B and C depend on this matrix: without it, they test blind.

Agent A briefing:
  Produce a matrix with ≥20 scenarios based on the commits on this
  branch vs develop. Categories: happy path, regression, edge cases
  (null/empty/limits), error (DB unavailable, auth failure), migration
  (idempotence). For each: ID, description, severity (P0/P1/P2),
  steps, expected result. Save to
  `docs/specs/-qa-matrix.md`. Report count per category +
  the 3 priority P0 scenarios + the critical UI flows to be covered
  by frontend.

## Phase 3: Agents B and C in parallel (fed by the matrix)

RULE: ONE message with 2 simultaneous Agent tool calls. Paste the 3
P0 scenarios (Agent A output) into B's briefing and the critical UI
flows into C's briefing. Limit ≤80 tool calls per agent (above that
you get socket errors: relaunch with smaller scope if it crashes).

### Agent B: Backend E2E
Run the P0 scenarios below via curl against the local stack. Validate
the DB after each call (psql/mongosh/redis-cli depending on stack).
Also run unit tests for the changed branches. Report PASS/FAIL with
evidence (1-2 lines of log/SQL) in ≤600 words. Don't rebuild Docker,
don't touch production code.

  QA P0 scenarios: 

### Agent C: Frontend MCP Playwright
Run the critical UI flows below in the browser via MCP Playwright.
For each: screenshot of the state, console inspection (JS errors),
network request validation. Report regressions in ≤700 words with
screenshots.

  QA critical flows: 

## Phase 4: Consolidate

- B and C both PASS with evidence → release `git push` and open the PR
- Any FAIL → fix the code and GO BACK to Phase 3 (re-run B and C with
  the same matrix; only re-run Agent A if the fix changed scenarios)
- BLOCKED → diagnose infra/context before trying again
- Socket error on an agent → relaunch with reduced scope (≤50 tool calls)
- Shallow result / no evidence → RELAUNCH the agent with a reinforced
  briefing demanding exactly what was missing (logs, SQL queries,
  screenshots, specific assertions). Don't accept PASS without proof.

And it’s not just bureaucratic friction: it’s how you keep the reflex. Whoever tests locally catches the bug in 30 seconds. Whoever waits for staging catches it in 30 minutes. Whoever waits for production pays much more.

The pipeline is yours. The AI is just the engine.

CodeRabbit + Claude Code + Superpowers is a stack. A good stack. Removes a real bottleneck. Gives tech lead time back for architecture, zeroes out the review backlog, and PRs come out rounder because the rule set is single.

But it’s a stack. Not a process.

Process is the discipline of well-scoped US, well-written AC, mandatory local testing, and the humility to accept that AI speeds up what’s right and speeds up what’s wrong right alongside it.

Anyone who confuses stack with process is going to find out the hard way. Probably on a Friday deploy.

MCP Playwright: real-quality local validation before the PR

Pablo Winter — Sat, 16 May 2026 12:00:00 -0300

Recurring scenario: you finish a frontend feature, run git diff, everything looks fine, you commit. Five minutes later someone opens a PR and says “the button disappeared on mobile.” Welcome to the visual regression hole. Question: can you catch this before the PR? Dry answer: yes. And the cheapest path today runs through MCP + Playwright.

TL;DR: MCP Playwright is not a new testing framework. It doesn’t replace CI/CD. It doesn’t replace the E2E suite your engineer wrote. It’s your way of asking Claude to test locally for you, and hand you screenshots as proof.

The dev flow has always been: code, write unit tests, run the app locally and click through it manually, open the PR. The “run the app and click through it” step was the one that got skipped most. “Unit passed, ship it to CI.” Then a bug hits production that CI didn’t catch because CI doesn’t cover every possible path. With MCP Playwright, that step is no longer yours. It becomes the AI navigating your app, validating the flow, taking a screenshot of each relevant state. You gain time. The PR gains evidence. CI keeps doing its job.

What MCP is, without the jargon

MCP: Model Context Protocol is an open protocol created by Anthropic to connect LLMs to external tools. Think of it as USB for AI: one standard, plug in, and any compatible LLM talks to any “MCP server” on the market.

Before MCP, integrating AI with external tools was artisanal. Each client (Claude Code, Cursor, Continue) had its own way of invoking tools. Each tool needed a specific adapter. Chaos.

MCP standardizes that. There are three pieces:

Client: the app where the AI runs (Claude Code, Claude Desktop, etc.)
MCP server: a separate process that exposes tools via protocol. Can run locally, remote, in containers, anywhere.
Tools/Resources: what the server exposes. “Navigate to URL X”, “read this file”, “execute this query.”

The client asks the server what it offers. The server responds with a list of tools. The AI picks a tool, sends parameters, the server executes, responds. Simple. Standardized. Universal.

There’s an MCP server for practically everything today: GitHub, Linear, Notion, Postgres, browser via Playwright, filesystem, Slack. You plug in what you need. The AI then operates those tools as if they were native extensions of the client itself.

Playwright as an MCP server: why it matters

Playwright is Microsoft’s browser automation stack. Headless or not. Cross-browser (Chromium, Firefox, WebKit). Consistent API, performant, excellent DX. What Selenium always wanted to be and never managed.

When someone packages Playwright as an MCP server, the following happens: Claude gets eyes in the browser. Literally. It can:

Open a page at a URL
Take screenshots
Read the DOM via accessibility snapshot
Click an element
Fill a form
Wait for an element to appear
Check the console for errors
Inspect network requests
Execute arbitrary JavaScript in the page context

All of this through commands the LLM picks based on context. You don’t need to write a test spec. You describe in natural language (“validate that the post card opens correctly on mobile at 375px”) and Claude assembles the sequence: navigate, resize viewport, click, wait, screenshot, verify.

For those who haven’t used it: it feels like magic. For those who have: it becomes a habit in 3 days.

“But isn’t this just another Playwright wrapper?” No. A wrapper requires you to write code. MCP Playwright lets the AI choose the right step based on the task context. The difference isn’t technical: it’s abstraction. You move from “how” and stay at “what.”

Real flow: validating post UX before committing

To illustrate, here’s the flow the Nextside team uses on this blog. Every time a new post comes out of the revisor agent encoded via Claude Code superpowers ready to commit, a dedicated UX review agent launches using MCP Playwright. Sequence:

Start Hugo locally: hugo server -D --port 1313
Launch the agent: describe the task: “validate post X in light/dark and at mobile 375px/desktop 1280px”
Claude navigates via MCP Playwright: opens localhost:1313/posts/.../{slug}/, waits for load, takes screenshot
Inspect console: checks for JS errors, font warnings, or broken image notices
Toggle dark mode: clicks the theme toggle, waits for transition, takes screenshot
Resize to mobile: resizes viewport to 375px, screenshot
Reports: markdown with embedded screenshots + checklist (✓ contrast, ✓ typography, ⚠ long code overflows on mobile, ✓ ember glow only in CTA)

Total time: 30 to 90 seconds. Cost: zero extra infra. Output: a report I, as a human, read in 2 minutes and decide whether to commit or fix.

Compare with the old flow:

Open manually in Chrome: 15s
Open DevTools, simulate mobile: 20s
Check dark mode: 10s
Check console: 10s
Forget to test at least one combination at least once a week: guaranteed

And here’s the real gain. It’s not speed: it’s consistency. Claude doesn’t forget to test dark mode. Doesn’t skip mobile in a rush. Doesn’t say “oh, I’ll check the console later.” Every time it runs, it runs everything.

Automated discipline beats tired human discipline.

Before vs after: what changes in the dev flow

Look at the traditional flow. What we’ve always done:

Code the feature
Write unit tests
Run the app locally and click through it: navigate, click, validate visually
Open the PR
CI runs full Playwright + unit suite
Human reviewer looks at the code

Step 3 is where time evaporates. And it’s the most skipped: “unit passed, ship to CI.” Then a bug hits production that CI didn’t catch because CI doesn’t cover every possible path.

With MCP Playwright, step 3 becomes:

3. Ask Claude to test it: “validate the checkout flow with a coupon on localhost:3000, give me screenshots of each step”

And Claude opens the browser via MCP, navigates, fills in, clicks, verifies, takes a screenshot of each state, reports console errors if any. You get back: “it worked. Evidence in /tmp/checkout-*.png.” You attach the screenshots in the PR. The human reviewer opens the PR with visual proof already in hand. CI keeps running the full suite, that doesn’t change. What changes is your manual test step before the PR.

So this doesn’t replace my E2E tests?

No. And it shouldn’t. Your traditional E2E runs in CI without needing AI, works fine, validates regression deterministically. That’s work the engineer writes once and runs a thousand times. MCP Playwright is different: it’s your local exploratory testing, automated by AI, with visual proof. It’s the step you used to do by clicking, now delegated.

Concrete scenario: PO writes, AI validates

Look at how this turns into a real flow. Thursday morning, the PO writes a Gherkin scenario in Notion:

Feature: Checkout with discount coupon
  As a customer
  I want to apply a coupon at checkout
  So I can pay less for the order

  Scenario: Valid coupon applies discount
    Given I am on the checkout page
    And my cart has 2 items totaling R$ 200
    When I enter the coupon "NEXTSIDE10" in the discount field
    And I click "Apply"
    Then the total should drop to R$ 180
    And a message "Coupon applied: 10% off" should appear
    And the "Place order" button should remain enabled

The dev opens the terminal, and instead of running the app and clicking through each step herself to confirm the scenario passes (that manual pre-PR click-through everyone skips), she hands it to the AI:

Validate the Gherkin scenario below against the app running at http://localhost:3000.
Use MCP Playwright. Report each Then with ✅ or ❌ + a screenshot 
when something fails. Don't fix the code: just audit.

The AI with MCP Playwright:

Opens the browser at http://localhost:3000/checkout
Confirms it’s on the checkout page (networkidle + Checkout visible)
Reads the DOM and confirms 2 items in the cart totaling R$ 200
Fills the “coupon” field with NEXTSIDE10
Clicks the “Apply” button
Waits for DOM change (expect(total).toContain('180'))
Checks visibility of the “Coupon applied: 10% off” message
Verifies the “Place order” button remains enabled

Reports back:

✅ Scenario: Valid coupon applies discount
  ✅ Given: on the checkout page (h1 visible, correct URL)
  ✅ And: 2 items, total R$ 200 (read from .cart-total)
  ✅ When: coupon NEXTSIDE10 applied
  ✅ Then: total updated to R$ 180
  ✅ And: success message visible
  ❌ And: "Place order" button is DISABLED

Final state screenshot: /tmp/checkout-disabled-btn.png
Suspicion: regression in coupon-success-handler that set disabled=true 
by mistake after applying the discount.

Total time: 35 seconds. No E2E test written, no stub, no mock. It validated against the real app, on your localhost, before the PR went to review.

But doesn’t this replace real Playwright in CI/CD?

It doesn’t. CI/CD keeps running the full suite on every PR. This flow is the pre-flight: before you open the PR, before CI spends 6min, before the human reviewer opens a tab to look, you already know whether the PO’s scenario passes or fails. The regression above (button DISABLED by mistake) is exactly the kind of bug that hits production 2 sprints later because nobody tested that manual path.

The PO’s Gherkin became executable input. Acceptance documentation became running acceptance test. Without anyone writing test code.

What changes vs traditional E2E testing

Here’s an important point so you don’t get confused. MCP Playwright doesn’t replace your E2E suite in CI. ABSOLUTELY NOT. They solve different things, and the confusion usually starts because the name “Playwright” shows up in both.

Traditional E2E is what the engineer writes in code, versions in the repository, and CI runs automatically on every PR. That doesn’t change. That’s still there.

MCP Playwright is step 3 in the dev flow: that manual click-through you used to do (or skip) before opening the PR. Except now the AI does it in your place.

Traditional E2E (Playwright spec running in CI):

Runs automatically on every PR: blocks the merge if it fails
Specified in code: explicit assertion, versioned, reviewed
Covers the full regression suite: doesn’t depend on you remembering
Slow: minutes per execution, requires CI infra

MCP Playwright in Claude locally:

Runs when you ask: blocks nothing by default
Specified in natural language: flexible but not versioned
Covers what you describe in the moment: depends on the instruction
Fast: seconds per execution, zero infra

Ideal use case: MCP Playwright is for the first validation layer, BEFORE you ask for human review. It’s the sanity check you’d do with your own hands, automated. It’s not the CI safety net. It’s the pre-flight check.

A real E2E suite is still needed for:

Blocking regression on PR
Critical coverage of payment flows, auth, etc.
Executable documentation of expected behavior

MCP Playwright is needed for:

Quick sanity check during development
Visual validation of a feature under active change
“Did I break anything?” before asking for review

They’re complementary, not rivals. Whoever replaces their E2E suite with MCP Playwright will miss it the moment a big refactor happens and nothing breaks in CI but everything breaks in production.

Limits and pitfalls

Hold on. There are pitfalls:

Not deterministic like code-based tests: you describe “validate the card,” Claude interprets. Two runs may check slightly different things. For sanity check: fine. For blocking regression: no.
Token cost: each screenshot Claude consumes becomes input. In a long session, that adds up. Curate what you send it to inspect.
Silent failures: if Claude didn’t see something, it doesn’t report it. False negative. You need to instruct it well on what to look for.
MCP server setup: installing the local MCP server, configuring it in Claude Code, making sure the browser is available. The first time takes effort. After that, you forget it’s there.
Local-only: MCP Playwright in Claude Code runs on your machine. Not a solution for QA in a shared environment. For that, you still need traditional Playwright in CI.

And there’s a culture pitfall: devs get lazy writing real tests because “Claude tests for me.” That’s a trap. MCP Playwright complements testing, it doesn’t replace it. Whoever uses it as a substitute will learn the hard way: when a critical feature breaks in production with no test covering it.

“But if MCP Playwright is so good, why do we need CI?” Because CI blocks what humans forget. MCP Playwright only runs when you ask. CI runs always. CI is the insurance, MCP is the pre-flight. Remove the insurance, and the first crash will remind you.

What this says about the future of local QA

Here’s what matters.

For a long time, local frontend validation was bad. You opened the browser, opened DevTools, remembered (or didn’t) to test mobile, remembered (or didn’t) to test dark mode, remembered (or didn’t) to check the console. Every time. Manually. Getting tired.

Result: visual bug became production bug. Not because the dev is bad, but because the human brain is not a reliable checklist machine after 4 hours of pair programming.

MCP Playwright changes that because it lets the checklist become code that another entity, the AI, runs for you. You never forget to test dark mode again. You never commit without seeing the console. Not because you got better, but because the process now runs itself. It’s the same logic we apply to documenting technical decisions in ADRs in Notion: take it out of human memory, put it in a format that survives fatigue.

That’s what excites me most about MCP in general: it’s the first time I see automation of tedious tasks with AI delivering REAL results, not promises. Playwright is just the most mature example. There will be MCP servers for everything you hate doing but have to do. And when you can evaluate new tech in 2 weeks instead of buying the whole idea, Discovery is the right format: you don’t need to bet 6 months to know if MCP fits your pipeline.

And the team that adopts it first will gain consistency that the team that doesn’t adopt will never be able to replicate through sheer willpower.

That’s why the Nextside team runs MCP Playwright in every UX review agent. Not as an AI gimmick. As a way of ensuring the boring checklist happens every time, without depending on me to remember at 11pm on a Friday.

The AI gets tired less than you. Use that to your advantage.

Claude Code superpowers: the plugin that changes how a team ships

Pablo Winter — Sat, 16 May 2026 11:00:00 -0300

TL;DR: can you ship quality software with plain Claude Code? Yes. But there’s fine print.

The fine print is: it depends on your seniority level to cover what the AI doesn’t, and on the methodology you can keep in your head. For 1-2 parallel tasks, vibe coding with Claude Code does the job. For 5-6 simultaneous tasks, where Nextside lives, the human brain can’t hold it. That’s where encoded methodology comes in.

Superpowers is encoded methodology packaged as a plugin: skills, agents, slash commands, hooks. Instead of reinventing SDD (Spec-Driven Development) and your own harness engineering, which costs weeks of R&D, you use what thousands of devs are validating in parallel. Plugin bug fix lands for you for free. New feature lands for you for free. It’s open-source working the way open-source should.

I tested it. We tested it. This blog you’re reading was built with Claude Code + superpowers from start to finish: design system, Hugo layouts, agent pipeline, frontmatter, this very post. And what caught my attention most wasn’t speed. It was discipline.

Plain Claude Code with vibe coding works, until it breaks

Fabio Akita wrote about Agile Vibe Coding and he’s right. You can ship a whole feature in 30min using plain Claude Code, talking to the AI, iterating fast. Vibe.

And it works. For 1 task. For 2 tasks.

So why the plugin?

Because the real work at Nextside isn’t 1 task. It’s 5. Sometimes 6.

Vibe coding with 1 context = productive. Vibe coding switching context every 15min = your head in pieces by 5pm, with nothing solid shipped.

When parallelism enters, vibe isn’t enough. You need:

Forced brainstorming before coding: so you don’t start the wrong task
Mandatory tests at code-time: so you don’t come back to debug in 2 days
Plan written by an agent: so you can read it later and remember where you stopped
Automatic UX review: so you don’t forget to check the visual result
Skill with checklist: so each task type runs the same way

You can build all of this yourself. You’ll spend 2-3 weeks, validate with your team, debug the first iteration. Or use the superpowers plugin that already has it, and get new features other engineers already validated.

The plugin doesn’t take the vibe away. It takes the mess out of parallelism. You’re still in command, just with guard rails where human fatigue would already have betrayed the result.

What superpowers is, without the hype

Superpowers is a plugin for Claude Code (claude.ai/code, Anthropic’s CLI) that adds three concrete things:

Skills: markdown files that describe “how to do X.” Each skill has a trigger (when to use it), steps (what to do), and rules (what not to skip). Claude reads the skill before executing the task.
Agents/Subagents: specialized invocations. You launch a “UX review subagent” that has its own context, its own prompts, and its own tools. Doesn’t pollute the main context.
Slash commands: shortcuts you type (/code-review, /ship, /init) that fire complex flows. Each one reads the repo, executes steps, and reports back.

Sounds like saved prompts? It is. But the difference isn’t the content: it’s the ritual. Skill enforced means Claude reads the skill BEFORE it starts working. There’s no chance of skipping the TDD step. No chance of skipping the brainstorming checkpoint. The skill is an automatic trigger.

And that’s where the shift happens.

How it changes the real workflow

Without superpowers, Claude Code is a good generalist AI. You open it, describe the task, it tries to solve it. If you forget to ask for tests, it doesn’t write tests. If you forget to ask for brainstorming before coding, it jumps straight to implementation. Result: lots of generated code, lots of thrown-away code.

With superpowers, the game is different:

TDD enforced: the test-driven-development skill forces Claude to write a failing test BEFORE writing the implementation. Always. For every bugfix, every feature. Non-negotiable.
Brainstorming before code: the brainstorming skill requires that, before any creative work, Claude explores the problem with you. Asks questions. Lists alternatives. Only then proposes a solution.
Systematic debugging: found a bug? The systematic-debugging skill forces methodical investigation instead of guessing. The first hypothesis isn’t the bet. It’s the starting point of a cause tree.
Verification before completion: Claude can’t say “done” without running verification. Runs the tests, shows the output, then asserts. Goodbye “should work.”

Notice the pattern: each skill is a way of encoding senior engineering discipline. What experienced devs do out of habit (TDD, brainstorming before code, methodical debugging, verify before asserting) becomes a rule the machine executes.

And here’s the key point: this isn’t about giving AI superpowers. It’s about giving AI the habit set of your best senior dev.

How Nextside used it to build its own blog

The Nextside team built this blog (blog.nextside.tech) using Claude Code + superpowers. Stack: Hugo + Hextra theme, custom CSS design system, editorial agent pipeline, bilingual pt-BR/EN.

Typical flow for a design system feature:

Brainstorming session: I describe “I need a hover state for the post card.” Claude (via brainstorming skill) asks 3-4 questions: “ember glow or just elevation?”, “mobile too or desktop only?”, “behavior in dark mode?” Only after that does it propose an approach.
Written plan: the writing-plans skill forces Claude to write a detailed plan before coding. The plan goes into a spec file. I review it. Approve or request adjustments.
Execution with TDD: the executing-plans skill follows the plan. Each plan step becomes a checkpoint. The TDD skill forces a test before code (when applicable: in pure CSS, it becomes visual verification).
Automatic UX review: before the commit, it launches a dedicated UX review agent that opens the site in the browser via MCP Playwright, navigates, takes screenshots, and flags problems.
Commit + push: only after everything is green.

Notice: zero disorganized vibe coding. Zero “let me try something.” Zero “should work.” It’s a pipeline.

And the result comes because the pipeline is repeatable. The next feature goes through the same checkpoints. Same skill. Same agent. Doesn’t depend on my memory of “what did I ask for last time?”

That’s what changes for a team. When knowledge is encoded in a skill, any dev on the team invokes the same Claude and gets the same standard. There’s no dev “who knows how to use Claude well” and dev “who doesn’t.” The knowledge lives in the plugin.

What improves vs plain Claude Code

Concrete before-and-after:

Real speed (not perceived): plain Claude delivers too fast. You think you saved time, but spent 2h redoing it. With superpowers, the first delivery takes a bit longer, because there’s brainstorming, a plan, TDD, but it’s the delivery that sticks.
Less slop: slop is generated code that looks right but isn’t. Without superpowers, slop shows up constantly. With superpowers, the verification step catches it before the commit.
Reproducibility: another dev on the team invokes the same /code-review and gets a review with criteria identical to mine. Doesn’t depend on the prompt I wrote at 3am on a Saturday.
Faster onboarding: a new dev on the team doesn’t need to memorize process. They install superpowers, read the skill and slash command catalog, and already work the way the team works.

That last one surprised me the most. I always thought “team process” meant a doc in Notion. Turns out it doesn’t: it’s a skill in Claude. Nobody reads Notion docs. A Claude skill executes every time the task starts.

This matters: process documentation is fiction. Executed skill is actual process.

Honest limits (it’s not magic)

Hold on. Superpowers doesn’t solve everything:

Doesn’t replace senior devs: it replaces the grunt work of senior devs. Real architectural decisions still require a human in the loop. Who picks the stack, who decides the performance vs DX trade-off, who makes the product call, that’s people.
Slip can still escape: the verification step isn’t omniscient. If the test is wrong, the “all green” is a false positive. You still need to look.
Context cost: skills fill the initial context. If you have 30 skills loaded and the repo is huge, performance drops. You have to curate active skills.
Doesn’t learn on its own: superpowers doesn’t evolve by itself. If a team pattern changes, someone has to update the skill. Without maintenance, it goes stale, and then the AI executes old process with full conviction.

And the critical point: superpowers is a lever, not an autopilot. You still need to think. You need to review the plan Claude wrote. You need to decide when the brainstorming session has gone on too long. The skill is the ruler, but you’re the one holding it.

“But if the AI does everything, what’s the dev’s role?” Good question. Answer: the dev becomes architect + reviewer + taste dictator. No more typing boilerplate. Decides the what, reviews the how, and calibrates the tone. It’s a more senior role, not less.

What this plugin says about the future of work

Here’s the point that matters most.

For years, team process meant documents. ADRs in Notion. Checklists in Confluence. Playbooks in Google Docs. All passive. All ignored after the second week.

Superpowers changes that because it turns process into code the AI executes. The skill isn’t a doc: it’s an instruction that fires every time the task starts. Nobody needs to remember to “run the playbook.” The AI runs it by itself.

This has a big implication: the engineering knowledge that used to live in senior devs’ heads now fits in a markdown file that another team member invokes via slash command. Encoded knowledge, executed by machine, scaled to the whole team.

Not magic. Doesn’t replace seniority. But it’s the first time I’ve seen “team process” leave the page and become real, repeatable behavior, without depending on someone to police it.

And that changes the game. Full stop.

That’s why the Nextside team runs Claude Code + superpowers in every 4-week Sprint. Not as a productivity tool. As a way of ensuring the Nextside way of working happens every time, without a human having to remember.

Those who document process in PDF are fighting an old war. Those who encode process in skills are shipping while the others write playbooks.

ADRs in Notion, without the bureaucracy

Pablo Winter — Sat, 16 May 2026 10:00:00 -0300

Every time someone says “ADR” in a meeting, half the room pictures a Sharepoint spreadsheet, an architecture committee, and a 14-page document nobody reads. I thought the same thing. So here’s the real question: is ADR worth it for a small team? Short answer: yes, but not the way the book says.

ADR: Architecture Decision Record is a record of a technical decision. Short. Dated. Immutable. You decide something important today, write down why you decided it, and six months from now when someone asks “why on earth did we pick Postgres over Mongo?”, the answer is right there. No need to re-summon the lost meeting buried in February’s calendar.

The point isn’t the template. The point is not losing history.

Why most teams fail at ADR

Most teams that try to adopt ADR copy Michael Nygard’s template (or the AWS prescriptive guidance one, or ThoughtWorks’) in week one, write 3 ADRs in 4 days, and abandon ship by day five. I’ve done it. Small teams have zero patience for ritual.

The problem is simple: the traditional template has 6 sections (Context, Decision, Status, Consequences, Alternatives Considered, Stakeholders). In a 4-person team with tight deadlines, nobody writes out “Alternatives Considered” with bullet points. Nobody. You open the doc, stare at 6 empty headers, close the doc, and go back to the code.

Result: the ADR becomes a joke. “Hey, remember when we were going to document decisions? Good times.”

“But if you don’t document properly, how do you keep any history?” Fair question. Answer: we DO document, just in a format that fits a small team. Not in the format that fits a corporate architecture book.

And that’s where the difference lives. ADR for a small team isn’t “Architecture Decision Record” in the pompous sense. It’s a note to your future self. You’re writing for the version of you that shows up 4 months from now, who forgot why you chose Redis over Memcached. That’s all.

How we do it at Nextside, in Notion

No separate repo for ADRs. No docs/adr/0001-use-postgres.md. Just a database in Notion called Decisions. Simple schema:

Title: short declarative phrase: “Use Postgres as the main database”, “Adopt Hugo instead of Next for the blog”
Status: Proposed / Accepted / Superseded / Rejected
Date: when the decision was made
Tags: area (backend, frontend, infra, process)
Body: 3 sections: Context (1-3 paragraphs), Decision (1 dry paragraph), Consequences (short bullets: what we gained, what we gave up)

That’s it.

Notice what’s NOT there: “Stakeholders”, “Voting”, “Alternatives Considered” as a formal section. If alternatives matter, they become a paragraph inside Context. If they don’t, they don’t appear at all. The criterion is simple: the ADR exists so someone can understand the decision 6 months from now, not to defend it in an Audit.

The golden rule we follow: if you decided something that would be expensive to reverse, write an ADR. If you can undo it in a 50-line PR, write nothing. Documenting everything is the same as documenting nothing: it becomes noise.

Concrete example (real-ish decision)

Typical scenario: the team needs to pick between two ORMs for a new Node project. Prisma vs Drizzle. Discussion runs 40 minutes on Slack. Someone opens Notion and writes:

Title: Use Drizzle as the ORM for project X

Status: Accepted

Date: 2026-04-12

Context: Project X needs an ORM with solid TypeScript support, versioned migrations, and predictable performance on analytics queries. We evaluated Prisma (more mature, better DX, but Rust runtime engine weighs on cold-start in serverless) and Drizzle (newer, zero-cost abstraction, SQL-first). The team already has familiarity with raw SQL.

Decision: Adopt Drizzle. SQL-first fits the team’s profile, cold-start in serverless is a concrete problem for this project, and the learning curve is smaller than the DX gain from Prisma.

Consequences:

Faster cold-start on Vercel/AWS Lambda

We lose some advanced features Prisma has out-of-the-box (Prisma Studio, better introspection)

Migrations are more manual: requires greater discipline from the team

If it goes sideways, we can migrate to Prisma. Drizzle is thin, no heavy lock-in

Done. 180 words. 4 minutes to write. Six months from now, when a new dev joins and asks “why Drizzle?”, the answer is right there, dated, with context.

That’s the whole secret. No magic.

What happens when you DON’T do this

What happens is what I call history losing. Decision made, nobody wrote it down, 6 months later the entire team has forgotten. Then comes the temptation to revisit. “Hey, should we have used Prisma?” 40-minute discussion. Again. Same 4 people. More or less the same arguments. Identical conclusion.

You just paid the price of that decision TWICE.

Worse: sometimes the conclusion is different, because someone forgot the critical argument that tipped the scales the first time. So the team switches from Drizzle to Prisma, refactors everything, and 3 months later hits the same cold-start problem that originally motivated Drizzle. Back to Drizzle. Another 3 months burned.

This is the worst thing that can happen in a small team: repeating the same mistake because nobody wrote down the previous mistake. Big companies can absorb it. A 4-person team can’t.

Institutional memory in a small team isn’t Confluence. It’s habit.

ADR doesn’t replace conversation. Doesn’t replace pair programming. Doesn’t replace an RFC for something big, which fits better in a dedicated Discovery. But it does replace the “wait, let me try to remember why we decided that…”. And that “wait, let me try to remember” costs more than it looks. It costs interrupted context, rework, and trust in the team’s own history.

“But nobody’s going to go back and read the ADR!” They will. I do. Every time I come back to an old project and ask myself “why?” The ADR is the shortcut to the why. Without it, the shortcut disappears.

How to start tomorrow (without turning it into a heavy process)

If you’ve never had ADR in your team and want to start, three steps:

Create a database in Notion (or Linear, or Trello, or a docs/decisions/ directory in the monorepo). The tool doesn’t matter. Having ONE place does.
Define the trigger rule: any decision that would cost 1+ day to reverse deserves an ADR. Framework choice, database, auth pattern, queue choice, error pattern. Variable naming does NOT.
Add a trigger to the PR template: optional question in the template: “Does this PR introduce an architectural decision? If so, link the ADR.” Soft enforcement. Without it, the habit dies in the second week.

In 3 months, the team has 10-15 ADRs. In a year, 30-40. It’s not volume. It’s context density. Each ADR is a clear signal of “here we made a decision that mattered.”

And here’s the detail nobody talks about: the real value of the ADR isn’t in the document. It’s in the act of writing. When you sit down to explain the decision in 3 paragraphs, you discover that half the decision was implicit and poorly formed. The ADR forces clarity. It’s the pair programming of technical decision-making.

For those who want to go further and encode the process into a skill that AI executes, so that writing an ADR becomes an automatic trigger, that’s the natural next step. But start with the human habit. A skill without habit behind it is theater.

That’s why I don’t even mind if nobody reads it afterward. It was already worth it just for the act of writing.

The ADR INDEX: the part nobody talks about yet

Here’s what changed for me in 2026.

ADRs are great for humans. New partner joins the team, opens docs/adr/0042-prisma-vs-sequelize.md, gets the decision in 5min. Good.

But now AI also reads your repo. And it needs an index, not brute-force search.

Can’t AI just grep the directory?

It can. And fills the context with 47 irrelevant ADRs to answer one question. Costs tokens, costs quality, costs time.

The solution came from Claude Code itself: its auto-memory system uses a MEMORY.md file that’s just an index: each line points to a detailed memory file with a 1-line description. When Claude needs to decide something, it reads MEMORY.md (200 lines max), picks the relevant memory, and only then opens the detailed file.

The parallel for ADRs is exact. In your Notion (or docs/adr/INDEX.md if you use a repo), create an INDEX page at the same level as the ADRs:

- [ADR-0042 Prisma over Sequelize](./0042-prisma-vs-sequelize.md): Postgres with strong typing; rejects Sequelize over migration pain
- [ADR-0043 Server Components on Next 15](./0043-rsc-next-15.md): Default; "use client" only where real interaction exists
- [ADR-0044 No Redux](./0044-sem-redux.md): Zustand for small global state; URL state for the rest

One line per ADR. Description that fits in search.

Now Claude (or any AI) hits your repo, reads INDEX.md in 2s, picks which 2-3 ADRs are relevant to the problem at hand, loads only those in context. The difference between 3 ADRs read and 47 is the difference between useful AI and confused AI.

And the best part: you get the index for free. New humans use it too. No extra cost.

Without an INDEX, your ADRs become a cemetery of great documentation nobody reads: neither human, nor AI.

Wiring ADRs into your Claude

You have the INDEX. Humans use it. Good. But the real trick is making the AI use it the right way, without you having to remember to remind it.

This blog runs on Claude Code + superpowers. When we execute a superpowers spec, the skill that forces brainstorming, written plan, TDD, verification, architectural decisions show up naturally along the way. “Drizzle or Prisma?” “Server Component by default?” Each one is an ADR candidate.

But AI forgets.

Ask it to write something down once, it does. Next session, gone. That’s why the note has to become a system instruction, not a request.

CLAUDE.md points at the ADRs (and the INDEX)

Claude Code loads a CLAUDE.md file from the project root in EVERY session. It’s the project’s default memory, the AI equivalent of “read this before anything else.” You don’t have to remind it. It reads on its own.

Drop this near the bottom, no ceremony:

## Architecture Decision Records

Check `docs/adr/INDEX.md` before making any significant technical decision.
- If an existing ADR covers the topic, follow it.
- If the decision is new and expensive to reverse, propose a new ADR at the end of the plan.
- Every new ADR lands in the INDEX in the same PR.

Done. 4 lines. The AI now consults the INDEX whenever it enters planning mode.

The detail that matters: don’t stuff your CLAUDE.md with 47 ADRs inline. Point at the INDEX. CLAUDE.md is loaded in EVERY session: every token spent there steals context from something useful. Keep it light. Point. Trust the INDEX to do the rest.

What if the AI ignores the instruction?

It will, once in a while. That’s where the second pillar comes in.

Slash command `/adr` to enforce the ritual

CLAUDE.md is passive reading: the AI uses it if it remembers. A slash command is ACTIVE: you trigger it, it executes. In Claude Code, you just create .claude/commands/adr.md:

Plan a new task:

- Read `docs/adr/INDEX.md` and identify ADRs relevant to: $ARGUMENTS
- Load only the relevant ADRs into context (not all of them)
- If the task introduces a NEW architectural decision, propose a draft ADR before the technical plan
- If the task changes or supersedes an existing ADR, flag it explicitly
- Every new ADR must be confirmed by me before becoming a file in `docs/adr/`

Daily flow becomes:

/adr Migrate auth from JWT to session cookies

Claude reads the INDEX, identifies ADR-0023 (the one that originally chose JWT), loads only that one, and proposes ADR-0044 superseding it. You review. You approve. You move to implementation.

Without /adr, you’d depend on remembering to tell the AI to consult history. With /adr, the ritual lives in the slash command. The AI doesn’t skip. You don’t forget.

Integrating with superpowers

This is where it gets beautiful. If you already run superpowers, the writing-plans skill forces a written plan before code. The brainstorming skill forces exploration before implementation. Wiring ADRs into that flow is one line in CLAUDE.md:

## Inviolable rules
- Every plan generated by the `writing-plans` skill must reference relevant ADRs up front.
- Every architectural decision detected by `brainstorming` becomes an ADR candidate. Propose a draft to the user.

The superpowers skill already has the “before you code, plan” trigger built in. Now the plan ships with relevant ADRs already cited. And a new decision ships with a draft ADR ready for the human to approve.

ADR stops being a separate task you forget. It becomes a natural byproduct of the spec → plan → code flow. Free of charge.

Where to put what

Claude Code loads CLAUDE.md at three levels: global (~/.claude/CLAUDE.md), project (./CLAUDE.md), and subdirectory (./module/CLAUDE.md). More specific beats more general.

For ADRs, the rule I use:

Global: no ADRs here. Your personal code conventions, sure. ADRs belong to the team, not to you.
Project: references docs/adr/INDEX.md. Lists the 3-5 most critical ADRs explicitly (database, framework, auth pattern) so the AI doesn’t even need to open the INDEX in 90% of cases.
Subdirectory: only if a module has decisions that apply only there. Rare. Don’t force it.

Most teams only need the project level. Don’t overengineer.

Three traps

Don’t paste ADRs inline into CLAUDE.md. It becomes an 800-line file, AI performance drops, and you lose the entire point of the INDEX.
Don’t let the AI write ADRs alone without human approval. An ADR is a decision. Decisions need humans. The AI proposes a draft, the human approves. Always.
Don’t forget to update the INDEX when you create a new ADR. The INDEX is the contract. If the ADR exists but isn’t in the INDEX, it doesn’t exist for the AI.

Skill without human ritual is theater. Human ritual without skill is fatigue. The two together is how an ADR stays alive in a small team running heavy AI.

What ADR actually protects you from

ADR doesn’t protect you from making the wrong decision. ABSOLUTELY NOT. You’re going to make the wrong decision anyway. Every team does. What ADR protects you from is making the SAME wrong decision twice. Which is a different thing.

A good team isn’t the one that gets everything right. It’s the one that makes fewer mistakes with each iteration. ADR is the record that lets you know which mistake you made and why, so you don’t make it again the next time a similar decision comes up. It’s the equivalent, in product decisions, of the local validation with real quality we do via MCP Playwright in frontend: you don’t prevent every mistake, but you ensure that mistakes become registered learning.

In a small team, the margin to repeat mistakes is zero. Every week burned on a redone decision is a week you didn’t have. Discovery, MVP, refactor: there’s no slack.

That’s why the Nextside team writes ADRs in Notion. Short. Dated. Honest. No giant template. No ceremony. No extra meetings.

ADR isn’t for impressing an auditor. It’s for the team. And the team is small. And time is short.

Those who don’t record history are doomed to repeat it. And repeating mistakes is a luxury a small team simply can’t afford.

About this blog

Mon, 01 Jan 0001 00:00:00 +0000

Why this blog exists

Nextside ships 4-week Sprints, technical Discoveries and Audits. This blog is where we tell you — in first person, without corporate fluff — how we think, what works, and what doesn’t.

Who writes

Senior partners and engineers at Nextside. Every post has a name, a face and a LinkedIn at the bottom. No ghost-writers.

What we write about

Technology — applied AI, stacks, architecture, technical decisions
Management — how we run small teams with expensive people
Fast shipping — the Sprint method, fixed scope, 4-week MVPs
Cases — client stories (authorized), behind-the-scenes, numbers

Who to talk to

Want to hire a Sprint, Discovery or Audit? → nextside.tech
Want to work with us? → Nextside LinkedIn
Just want to chat about a post? → Author’s LinkedIn or X

No newsletter, no pop-up, no formula. You read, you decide if you like it, you come back.

Bruno Raphael

Mon, 01 Jan 0001 00:00:00 +0000

I’m a Nextside partner and an engineer with 10+ years of experience. I’ve worked with geoprocessing and GIS, with mobile — from native Android to React Native — and with distributed backends: microservices in Node.js/NestJS and Java/Spring Boot, payments, distributed locks, and hexagonal architecture running on Kubernetes on AWS.

Here on the blog I write about mobile development, distributed systems architecture, and what I’ve learned building software that can’t get the numbers wrong.

Lucas Israel

Mon, 01 Jan 0001 00:00:00 +0000

I’m part of Nextside and have been building digital products for over 14 years. I started as a developer and became a systems architect — I went through electronic toll collection (COMPSIS), was CTO of one of Brazil’s first legaltechs (Justto, which processed over R$1 billion on the platform and was eventually acquired), and today I work on technology and product at Projuris.

I like turning an idea into a solution that creates value fast — or killing the idea fast so it doesn’t burn investment. Focus on SaaS platforms, AWS architecture, AI applied with pragmatism, and small high-performance teams. I also build ventures, assess new businesses, and invest in startups. I like to measure, test, and adjust.

Here on the blog I write about architecture that survives production, AI applied with judgment, and how to get an MVP out of the garage without turning it into duct tape.

Pablo Winter

Mon, 01 Jan 0001 00:00:00 +0000

I’m a Nextside partner and CTO of digital products for mobility and revenue collection. 10+ years of engineering — Java/Spring Boot, Node.js, Next.js, Python. Focus on hexagonal architecture, event-driven systems (SQS, SNS, RabbitMQ, Kafka), and senior integrations with ERPs and gateways.

Here on the blog I write about AI applied with judgment, managing small teams, and why fast delivery isn’t magic.

Nextside Blog

Pricing a client's tool: margin isn't the cheap dev

TL;DR

The discount you squeeze today comes back as rework tomorrow

The bill nobody opens: what makes a tool’s price

How do you price a tool without knowing what the dev will cost?

The real margin: you resell predictability, not dev hours

What margin can an agency charge reselling development?

Predictability is the product. The rest is luck.

I inherited a vibe-coded MVP to scale: the honest diagnosis

TL;DR

Rewriting from scratch is the most expensive mistake there is

Before you touch a single line, the test suite

The diagnosis: reading the debt statement

How do I know whether the architecture can be saved or not?

What to save and what to rewrite without mercy

Is vibe coding fit for production?

Stabilizing without stopping the business: surgery with the patient awake

Working isn’t done. But it isn’t junk either.

A client wants a custom tool and your agency can't build it

TL;DR

Why this is an opportunity, not a problem

The three obvious ways out (and where each one breaks)

Turning the request down

Calling a dev shop

Hiring a dev (or a team)

The fourth way: deliver under your brand, without becoming a software company

Where this breaks too (because everything has a trade-off)

How to decide, in practice

FAQ

How much does it cost to deliver a tool like this through a partner?

Will the client know I outsourced it?

What if the tool breaks down the line?

Wouldn’t it be safer to just hire a dev?

How long does it take?

Read also

Your MVP doesn't turn to junk for being fast, but for cutting the wrong thing.

TL;DR

The MVP doesn’t turn to junk for being fast. It turns for merging everything and keeping the feature.

What to cut without mercy

What to keep always (keeping the separation is cheap; redoing it is what costs)

How do I keep a separation for something I don’t even know will scale?

The MVP is phase 1, not the prototype (phase 2 is the proof)

The MVP that scales is the one you don’t have to rebuild

Your codebase is the new prompt: the MVP that scales (or turns to junk)

TL;DR

Junk isn’t what was built fast. It’s what was built blind.

AI reads your repository, not your prompt

Shouldn’t AI be smart enough to find it on its own?

Organize by feature, not by layer (and forget the architecture’s name)

Monorepo and ADRs: stop making the AI (and your team) guess

The MVP that scales is the one AI still understands tomorrow

Spec-driven development: getting unstuck from vibe coding

TL;DR

Vibe coding doesn’t fail because it’s AI. It fails because it’s ambiguous.

What spec-driven development changes

Does SDD work with AI agents like Claude and Copilot?

Where this breaks

Does spec-driven development replace vibe coding?

How to get out of vibe coding without stopping the team

Does SDD make development slower?

The spec is the context that doesn’t evaporate

The spec was the easy part. SDD's bottleneck is execution

TL;DR

Nobody tells you the bill for execution

The three phases of whoever carries the context

Why can’t the reviewer be the author?

It’s expensive, and the ROI is niche

The tool changes, the physics is the same

The work you thought was thinking

Maestro + Claude Code: your app tested in the simulator like Playwright tests the web

TL;DR

It’s not “giving access to the screen”. It’s reading the tree.

Maestro: one YAML, iOS and Android, zero instrumentation

From zero to the first test

The loop in practice: Claude writes the test looking at the app

MCP or Skill+CLI: which to use?

The iOS toll (the part nobody posts)

From a loose test to a routine

The AI writes the code and the test. You still decide what “works” means.

Slash command `/adr` to enforce the ritual