← Back to all posts

From solve_captcha to Symbiosis: What My 2014 DSL Was Trying to Tell Me

I built a JSON-driven web scraper in 2014 called Krake.io. You wrote a recipe describing which buttons to click, which fields to fill, which DOM elements to extract, and a Rails app would walk the recipe against a real browser and hand you back rows of data. There was a Sinatra worker, a Redis queue, a small army of Chrome instances. It paid the bills for a few years.

The DSL — a small JSON vocabulary, the kind of thing people in the trade call a domain-specific language — had eleven action types: click, insert, scroll_bottom, wait, trigger_change, a few others, and one weird one that I added near the end and never quite knew how to explain. It was called solve_captcha. The implementation was three lines long: poll the DOM in a one-second loop until the CAPTCHA element disappeared, then continue the recipe. The semantics were stranger than the code. The bot would just sit there. A human, watching the screen, would walk over and click the squiggly letters, and the recipe would resume.

I didn’t realize at the time that solve_captcha was the single most important action in the DSL. It was the only one that acknowledged that the human existed at all. Every other action assumed the bot was alone in front of the browser. solve_captcha was the only place where the contract was “a person and a bot are working on this together, the bot just doesn’t know how to talk to the person.” Twelve years later I’m finally building the thing that action always wanted to be.

The cost of pretending the human isn’t there

Every browser-automation tool I’ve touched since 2014 has bet the same way: remove the human. Selenium, Puppeteer, Playwright Codegen — you record a session, replay it later, and if anything changes you debug the recording. Modern agentic tools — Browser Use, Stagehand, Skyvern, Multi-On, Open Operator — bet harder. They want the LLM to be the human. Spin up a headless Chrome, log in with stored credentials, complete the task unsupervised, ping you on Slack when it’s done. The pitch is “your hands are free.”

Anyone running these tools at any kind of scale knows the dirty secret. The hands aren’t free. They’re just hidden behind retries and timeouts. The 2FA prompt still fires; somebody has to type the code. The CAPTCHA still shows up; somebody has to click the boats. LinkedIn anti-bot still fingerprints the session; somebody has to log back in. The site rearranges its DOM next Tuesday; the recipe breaks silently and somebody — usually you, at 9pm — has to figure out which selector drifted. The human is in the loop. We just pretend we aren’t.

A small personal example. Earlier this year I tried Open Claw, one of the newer agentic tools that integrate with WhatsApp through its mobile web view. The pitch was that an LLM would handle some of my outbound messages automatically. The execution was a black box. My contacts started messaging me to ask why Open Claw’s own sync-error prompts were periodically being posted into their threads as if I’d typed them — the tool’s internal error toasts, leaked out of whatever boundary should have contained them, surfacing into people’s chats at hours I wasn’t around. There was no surface inside the tool for me to approve or veto a specific send before it went out, no clean log of what it was doing or when. Turning it off was just as opaque — I ended up asking one of my Claude sessions to walk me through how to stop it. Inside a day I was done. The technical failure mode was the generic one of any unobservable agent: internal errors leaking past the wrong boundary, no rollback, no rehearsal, no audit. The felt failure mode was that I had outsourced my voice to a black box and my contacts paid for the gap. The fact that I had to enlist a second AI to extract myself from the first one is, in retrospect, the punchline that organises the rest of this post. human_intervention, recipe_progress events, and a chat surface that shows you the draft before the send aren’t in krake_browser’s design just for safety. They’re there because anything less is a usability bug masquerading as automation.

The funny thing is that nobody actually wants the human removed from real operational work. The work I do most days is talking to retail partners about whether their cacao shipments arrived, drafting check-ins, nudging an Apps Script when it stalled, walking a contributor through their first Edgar submission. The work isn’t labor I want offloaded; it’s judgment I want augmented. The right shape for the tool isn’t a bot pretending to be me. It’s a copilot sitting next to me at the same browser, doing the rote parts and asking when the next decision needs my eyes.

What I’m building

The project is called krake_browser and the design is deliberately one component longer than feels strictly necessary. There is a Chromium instance running on my laptop with a persistent user-data directory. It is logged into WhatsApp Web, Instagram, Facebook, the FDA establishment-registration portal, and a handful of other operational surfaces. The sessions never expire because the browser process never shuts down. When my MacBook reboots, the launcher script brings the same Chromium back up against the same profile and within thirty seconds I am back in WhatsApp without scanning a single QR code.

A Sinatra app (lineage from the same Sinatra worker that ran Krake.io in 2014) attaches to that Chromium over the Chrome DevTools Protocol — CDP for short, the wire format Chrome exposes so outside tools can drive it, and the same wire every Playwright or Puppeteer script in the world is using under the hood. The Sinatra app exposes itself to LLMs through an MCP server — MCP being the Model Context Protocol, an open standard a growing list of LLM clients (Claude, Cursor, Kimi, Codex) use to call external tools and read external data. When I am chatting with one of those LLMs and I say “send Kirsten a check-in on WhatsApp,” the LLM calls list_recipes() against the MCP server, finds partner_followups/check_in, then calls run_recipe(name, vars) with Kirsten’s details. The Sinatra app walks the recipe step by step in the live browser. I am watching the whole time.

The recipe DSL is the original Krake DSL, near-verbatim. click, insert, scroll_bottom, wait, the rest of the eleven actions, the {{VARIABLE}} substitution, the columns[] extractor — all preserved. Recipes from 2014 that I have lying around in dead repos should run with at most cosmetic changes. The only new action is one that solve_captcha always wanted to grow into:

{
  "action": "human_intervention",
  "prompt": "Drafted message to Kirsten:\n\n> Hey Kirsten, quick check-in re. inventory — got 5 min?\n\nReview in the WhatsApp tab and hit Continue to send.",
  "screenshot": true,
  "ack_required": true,
  "timeout_seconds": 600
}

When the recipe hits one of these, the Sinatra app pauses, captures a screenshot of the tab, fires an intervention_required event back through MCP. The LLM surfaces the prompt and the screenshot to me in chat. I read it, possibly edit the drafted message directly in the WhatsApp text box, and tell the LLM whether to continue. The LLM calls ack_intervention(token, "continue") and the recipe resumes from where it stopped. solve_captcha was this pattern with the ack channel missing — the recipe would resume when the CAPTCHA DOM disappeared, with no way for me to say “no, abort” or “edit this first” or “skip this step.”

The repos are KrakeIO/krake_browser for the engine, KrakeIO/krake_recipes for the public recipe library (WhatsApp, Instagram, LinkedIn, Facebook, FDA), and TrueSightDAO/tdg_recipes for the DAO-specific ones that I use to run partner check-ins and Edgar submissions. All three are public as of this post. What’s in them today is the design: a README, an architecture doc, the recipe DSL spec, the JSON schema, and a handful of sample recipes. The engine itself — the Sinatra app, the CDP attach, the MCP server, the recipe executor — is the next stretch of work and you can watch it land in the commit log.

Iron Man, not Jarvis

Tony Stark in the Iron Man suit
Tony Stark in the suit. You never see him sitting in his mansion watching Jarvis fly the suit on his behalf — except in the kind of extreme circumstance the writers explicitly mark as broken.

If you’ve watched the Iron Man movies, the shape of this design will already be familiar. The default is Tony in the suit. Jarvis is the layer doing the targeting computation, the diagnostic readout, the route planning, the rapid threat assessment, the “sir, you have a missile lock at ten o’clock” — but the man is always the one in the cockpit. The whole point of the suit is to augment the man’s capability, not to substitute for him. Marvel never even tried to write Jarvis as Iron Man’s replacement; the suit-without-Tony plot would have failed for the same reason that headless agentic browsers fail in practice. The agency lives in the human; the AI is the substrate that makes the agency go further.

krake_browser is the same shape, just for the destinations you operate from a keyboard. You stay in the seat. The recipes are the targeting computer, the route planner, the threat-assessment readout. The browser is the suit. The LLM is Jarvis — helpful, fast, indispensable for the rote work, and pointedly not the one whose name is on the helmet.

An old paradox in a new browser

The division of labor I keep coming back to — LLM does the rote driving, human does the judgment and the gnarly DOM recognition — has a name that’s older than I am. In 1988 Hans Moravec made the observation that high-level reasoning takes very little computation while low-level sensorimotor perception takes enormous amounts. Chess and theorem-proving were tractable for early AI; walking across a room and picking up a coffee cup were not. The intuition felt upside-down because evolution spent a few billion years optimizing the sensorimotor stack and only a hundred thousand years on abstract reasoning. The expensive things, in compute, turned out to be the cheap things in human experience.

Operating a browser is more sensorimotor than people give it credit for. Recognizing “the Connect button, even though they moved it under the three-dot menu last week” is closer to recognizing a friend in a crowd than to solving an equation. LLMs have gotten remarkably good at it — Browser Use and Stagehand do impressive work with intent-grounded clicks — but they pay an LLM call’s worth of latency and dollars for each one, and they still get fooled by visual decoys, mid-DOM A/B tests, the occasional rage-clickable popup. The human sitting at the same browser doesn’t. The perception is free. What the human is worse at, comparatively, is the rote, structured part — typing the same check-in message into the same WhatsApp tab for the eighth time that day. So the recipe is Moravec’s paradox shaped into a tool. The strict DSL handles the repetition; human_intervention hands the wheel back for the parts that need a person.

Guideposts, not selectors

The strict-DOM-selector model is the part of the 2014 design that hasn’t aged well. Krake.io recipes assume that button[aria-label='Connect'] means the same thing today that it meant when you wrote the recipe. LinkedIn alone disproves this assumption on a monthly cadence. The recipes I’ve carried over from old projects break silently. By the time I notice, the operational data they were collecting has had a two-week hole.

The next layer up — what I’ve started calling guideposts — treats each action as a goal rather than a contract. Each step gets three new optional fields alongside the strict selector: an intent (“Open connect-with-note modal”), a natural-language hint (“Click Connect; if hidden, it’s under the three-dot ‘More’ button”), and an expected_state the executor can check after the action (“Textarea for personal note is visible”). The executor tries the strict selector first because it’s fast, cheap, deterministic, and works most of the time. When it fails or the expected state doesn’t match the actual DOM, the executor escalates: feed the intent and the hint and a trimmed accessibility tree to an LLM, ask it to find the right element. Strict and flexible coexist in the same recipe. You only pay the cost of LLM grounding when the cheap path actually breaks.

The actually-interesting part

None of the above is novel. Stagehand and Browser Use have been doing intent-grounded browser actions for over a year now. What I haven’t seen anyone do, and what I think is the interesting part, is closing the loop on selector drift through the same human who is sitting next to the browser anyway.

The mechanism is a recipe-execution-bounded observation mode. Default state: the engine is not watching me. When a recipe hits a step it cannot execute — selector missing, expected state wrong — the executor fires a human_intervention with “I lost the Connect button.” While I am completing that step manually, the engine flips on a CDP event subscription. Every click I make, every text I type, every navigation I trigger, gets recorded into a buffer along with the stable selector path the engine can infer from the event target. When I finish my recovery and hit Continue, the engine flips the observation off again. The LLM then diffs my observed clicks against the broken recipe and opens a pull request against the recipe’s home repo. The new selector goes into a learned_selectors[] chain alongside the old one; the old one stays as a fallback because sites sometimes A/B-test their DOMs for weeks.

I review the PR like I’d review any other contributor’s PR. I merge. The next scheduled git pull propagates the fix to my engine. The next time the recipe runs, the new selector wins. The cycle is human-in-the-loop because I’m the one doing the manual recovery, but the LLM is doing the work of noticing what I did and writing it down in a place future runs will inherit.

This is the only piece of the system that tries to dissolve a paradox rather than route around it. Polanyi’s old observation — that we know more than we can tell — lands hard on browser automation. When LinkedIn moves the Connect button I can find its new location instantly; I couldn’t write down how I knew where to look. That recognition is tacit knowledge in exactly the sense Polanyi meant it. The teach loop converts the tacit into the explicit: I demonstrate by doing, the engine observes, and the new selector lands in learned_selectors[] as a thing the executor can reuse on its own next time. Tacit knowledge, externalized one PR at a time.

The thing I like about this is that the recipe gets more reliable over time through normal use, not less. Selector drift becomes self-healing through symbiosis. Every other browser-automation tool I’ve seen degrades silently as sites change. This one repairs itself by leveraging the work I was doing anyway. It also means that the recipes don’t need to be written perfectly — a good-enough recipe with weak selectors becomes a great recipe after the first few real runs, because every break is a teaching moment.

Why 2014 me would be confused

2014 me would have built krake_browser by hard-coding a CAPTCHA detector, then a 2FA detector, then a login-wall detector, then a soft-block detector, in an ever-expanding list of special-cased “the bot can’t do this’ cases. The DSL would have accreted twenty more action types, each one solving a specific anti-bot pattern. The thing would have been brittle and would have shipped slower with every release because the surface area kept growing.

2026 me has the LLM as a copilot at both ends of the loop. At runtime it picks the recipe, supplies the variables, surfaces my interventions, and acks them on my behalf when the recipe and the conversation both say it’s clear. At maintenance time it watches me recover from breaks and writes the fix back into the recipe library. The DSL doesn’t need to anticipate every site change because the LLM can re-ground at runtime and the human can teach it at recovery time. The action vocabulary stays small and stable; the intelligence sits in the LLM, the human, and the feedback loop between them.

I will say, in the same tone I used in the post from last week about why the TrueSight stack is so deliberately thin: the boring move here is the right one. Most of krake_browser is a hosted Chrome, a Sinatra app, a CDP attach, and a JSON DSL from 2014. The new part is two action types and a recovery-mode flag. The intelligence comes from the people and the LLMs around the system, not from anything novel inside it. That ordering is the whole point. solve_captcha was the first action that knew the truth. I just spent twelve years catching up to it.

The same primitive, two political configurations

It’s worth saying out loud what the teach loop’s technical primitive actually is: observe a user’s interactions with a browser, infer patterns, use them to make automation more reliable. That is the same primitive Meta started capturing employee mouse movements and keystrokes for last month, per Reuters. The technical similarity is real. The configurations are opposite in every way that matters.

Meta’s version: capture is on by default, broad in scope, centralized, and the value accrues to Meta’s models. The employee is the sensor; the company is the beneficiary. The trust model is the same one every workplace-surveillance system has run on since punch cards — you are being measured for your employer’s purposes, and your participation is a condition of employment. The krake_browser version: capture is off by default, narrow in scope (bounded to the few seconds after a recipe step fails and the human is recovering), local (the data never leaves the laptop), reviewed (every captured selector ships as a pull request the operator merges or rejects), and the value accrues to the operator — their recipe gets better; their downstream operational work gets easier. The operator is the sensor and the beneficiary; the LLM is the scribe in the middle.

I don’t think the technical primitive is the problem. The same observation channel that lets Meta monitor its employees can let me repair my own recipes — depending on who owns the laptop, who reviews the inferences, and where the value ends up. The interesting question is whether tools built around this primitive default to the Meta configuration because that’s where the money is, or to a configuration where the human at the keyboard is the one being made more capable. krake_browser is a small bet on the second.

Where this is going

The three repos are scaffolded. The READMEs, the architecture doc, the recipe schema, the first four sample recipes (one for WhatsApp, one for LinkedIn, one for Instagram, one for the FDA portal) are in. The engine itself — the Sinatra app, the CDP attach, the MCP server, the recipe executor that walks actions[] and pauses on human_intervention — is what I’m building next. The validation criterion is the smallest thing I could plausibly demo: one WhatsApp message, drafted by an LLM, approved by me in chat, sent through my own logged-in WhatsApp Web. After that gates, the guidepost layer and the teach-by-narration loop become engine v0.2. After that works on real partner check-ins for a few weeks, there will probably be a follow-up post with a recorded screencap and a more honest assessment of what worked and what was thinkier-than-it-needed-to-be.

The direction beyond v0.2 I want most for my own work is form comprehension. Most operational forms — FDA facility registration, partner onboarding paperwork, the eight different shipping-partner intake pages I’ve filled out this year — are mostly an exercise in reading a lot of text just to type in three answers I already know. A v0.3 llm_fill_form action would let the recipe say “here’s the form, here’s the context, fill it in.” The LLM reads the page, pulls from the recipe’s context block plus optional external lookups, fills every field, and attaches a confidence score and a one-line reasoning per field. The human_intervention that follows surfaces only the low-confidence fields for explicit review; everything above the threshold is pre-approved. I read three fields instead of fifty. The same human-approval-at-the-end pattern that makes human_intervention safe makes this safe too — with the explicit guardrail that the LLM has to show its work per field, so the spot-check surface stays reliable enough to skip the full read. Without that guardrail you slide right back into the Open Claw failure mode this post opened with.

I am writing this now, before the engine actually works, mostly because the design feels coherent to me in a way it didn’t a week ago and I want to capture it while the framing is fresh. There’s a real risk that the implementation reveals something I haven’t thought of and the design changes. That’s fine. The shape of the post will hold up even if specific decisions don’t.

If you’ve been running browser-automation tools that pretend you aren’t in the loop, and you’re tired of pretending: this is what it looks like to design for the truth instead. solve_captcha, all grown up.