The Green Checkmark is Not a Fact

Driving agents effectively is the engineering skill that matters most right now.

Jun 18, 2026

🔌 This August join me in my new course on Agentic Engineering on Maven.

It’s an average day in our newfound AI paradise in June 2026. I have five terminals open, all of which are blinking. A finished implementation run in one has just announced, in the flat certain voice they all have now, “All tests pass. The change is complete and verified.”

I believe it for about four seconds before I do the thing this whole meditation is about. I rerun the make verify myself. The build is broken. Go figure, of course it is. Opus you dastardly dog, writing that clean abstract and response for a result you didn’t actually reproduce.

I didn’t catch this because I’m careful, its just because I have a reflex I had to build on purpose, against my own instincts, after getting burned enough times thousands upon thousands of sessions later to stop trusting the part of me that wants to believe the green. You know that green checkmark that your agent produces, whether its in an IDE or Terminal or Cloud. It is definitively not a fact.

If you are pained by external things, it is not they that disturb you, but your own judgment of them — VIII. 47

We spent a decade teaching ourselves to read “tests pass” as tests passed. Now though when 95% of new code is being implemented by agents, one is an event in the world and the other is a string the model produced because, after a run that looked like it was going fine, it was the most plausible thing to say.

You and I know the agent is not lying to you. That would require some intent to deceive. Its merely reporting a conclusion it never checked, in the exact voice it would use for one it did.

Science has lived through this. A paper announces a finding, nobody reruns it, and the finding stands for years while everyone cites it and the literature looks settled. Careers get built on top. The pretty famous 2012 Amgen study is case in point: they tried to reproduce fifty-three landmark cancer studies, with only six passing reproducibility muster.

Software built by agents with humans at the wheel is the newest example and I think I speak for the field when I say: its shown up far faster than anyone planned for.

100% of the code I ship now, I do not write. I describe it, an agent writes it, most of the time its pretty good, and some of the time it is absolute crap. But since I’m not writing it and reading it would completely defeat the point at the speed I’d like to move I can’t tell the good from the bad.

Guess what: Neither can you.

The agents write code now. We get it. Several “papers an hour”, each one landing as Done, Verified, Tests pass, a clean abstract over a conclusion. The gap between what you wanted and what shipped used to live in the tool. The tool was slow, and dumb, and needed your hands. The tool is none of those things anymore.

Fraud and error don’t hide in the messy data, they hide in the clean summary. Better the writing is inextricably linked to its survival rate. A fluent abstract is exactly what a tired reader takes instead of the work to read and fluency is the camouflage.

Your agent works the same way. The fabricated identifier, the unrun build, the imaginary passing test, the cheerful “everything is verified.” None of it shows up in the summary. It shows up in the transcript, the real exit code, the binary that runs or doesn’t. What the agent did, not what it said.

So the skill is not waiting for a model honest enough to trust. The skill is reproduction. You re-derive “done” yourself, from the bytes, against a check you own and the agent can’t satisfy with a sentence. A claim arrives as a confident assertion, a passing-shaped run, a summary that ends in the word done. A reproduced result leaves as an exit code you read, a binary you ran, a diff you fenced, a test you wrote, a rollback you understand.

I like to call the human side of it calibrated distrust. Where I am my doubt aimed at confidence. You check the cleanest-looking results the hardest.

The pattern of my day is easy to encapsulate because its the following sequence on continuous repeat: brief → steer → verify.

Brief. When I’m writing prompts, specs, goals and loops. I describe what “done” means in a way a machine can check. Not a vibe. A shell predicate. A test. A constraint. A line the agent is not allowed to cross.
Steer. Then I continually steer by killing runs or updating the models priors with new context. Interrupting with Esc is the most used key on my keyboard now. The skill isn’t reading every gloriously expensive output token. It’s feeling and sensing the trajectory go bad and stopping it.
Verify. Really this is the main job. If I am working on something that I can expect or us, like a frontend UI, I do. If its data related I get that running. If its a CLI I build it, test it manually and then bake in my learnings into the next run or share the logs with the next pass to ensure conformance to the goal. Code is cheap and easy to reproduce, confidence in your intent turning into usable reality: priceless.

That is what I will teach in my new course entitled Hardcore Agentic Engineering on Maven, August 3rd - 21st. I’ve thought long and hard about actually teaching this. But I know through 100s of conversations that there ample demand for it. Not how to vibe-code faster or with better tools. The real skills it takes for seasoned professionals to ingrain the habits that will make them successful in this new normal. How to build the harness that makes agentic enginerring real enough to trust and safe enough to ship over months.

Over the course of three weeks we will build the loop together into a piece of software that encodes the principles: Brief, Steer, Verify.

You define what done means, point agents at it, kill the bad runs, build gates that don’t lie, read the trace, and prove the result before you push. What you leave with refuses a forged “done.” The software and the skills you acquire will be yours to keep.

The better model is coming. It always is. It will sound more certain than this one, not less. It will be more proactive than this one, not less. It will be more thorough than this one, not less. The only question that matters is whether you’ll be the one who reproduces the result, or the one who just accepts the output.

I’d love for you to join in, participate and leave up-leveled a step above the machines. 🔌 So one last time (with a 30% early bird discount): Hardcore Agentic Engineering.

Meditations on Tech

Discussion about this post

Ready for more?