The browser is the real agent benchmark

Jun 29, 2026

I spent the last few days wiring up a publishing loop that can log into real websites, survive their auth weirdness, and actually ship content. Not a toy demo. Real Chrome profile. Real sessions. Real buttons. Real posts going out.

The funny part is that the LLM planning is not the hard part.

The hard part is the browser.

The work

The goal was simple: make the agent handle the boring parts of publishing. Substack posts, LinkedIn updates, GitHub release announcements, maybe X/Twitter if it stops acting like a suspicious nightclub bouncer.

Simple goal. Ugly implementation.

Modern websites are hostile to automation in ways that are not captured by most agent benchmarks. They hide editors inside Shadow DOM trees. They throw cookie banners into iframes. They trigger passkey dialogs outside the DOM where Selenium and CDP cannot touch them. They shift layouts when the viewport changes. They silently redirect expired sessions to login pages that still look like success if your verification check is lazy.

That is the part I care about. Not whether an agent can solve a made-up task in a sandbox. Whether it can keep its footing when the web page lies, moves, blocks, or changes shape.

LinkedIn was a good example

The LinkedIn composer looked straightforward at first: click "Start a post", type text, click "Post".

It was not straightforward.

The input was a Quill editor nested in Shadow DOM. Normal selectors saw nothing. The cookie banner lived in an iframe and had to be handled separately. After accepting it, the automation had to switch back to the top-level document or every later click failed in a very boring way.

The working version ended up being a small browser crawler: walk the DOM, enter each shadow root, find the real ql-editor, type the content, then find the actual Post button the same way. After that, it posted cleanly.

The interesting thing is how little "AI" was involved at the final step. The agent had to reason through the page, but the solution was old-fashioned engineering: inspect the DOM, write a precise crawler, verify the output.

Authentication is the other half

Everybody demos agents after the login screen. That skips most of the real work.

Real accounts have phone checks, passkeys, Cloudflare gates, suspicious login heuristics, profile-specific cookies, and native browser prompts. Some of those prompts do not exist in the page DOM at all. You cannot click them with JavaScript because JavaScript cannot see them.

The practical solution was not elegant, but it worked:

run a real Chromium profile on the VPS;
expose the desktop through VNC only when a human needs to solve auth;
close the VNC port immediately after;
keep the authenticated profile for future automated runs;
never pretend a redirect to a sign-in page is a successful publish.

That last one matters. A lot of automation scripts lie by accident. They see "no error" and call it done. I want the opposite habit: prove it, then claim it.

Why this matters for agents

A useful agent does not just call APIs. It has to operate in the messy parts of software where the API does not exist, is incomplete, or is intentionally weaker than the web UI.

That means the agent needs a few unglamorous abilities:

persistent authenticated browser profiles;
a clean manual handoff path for MFA and CAPTCHAs;
DOM inspection that understands iframes and Shadow DOM;
verification that checks the real final state, not just whether a command exited;
security hygiene around credentials and exposed desktops.

None of that sounds like magic. Good. Magic is usually where the bugs hide.

The stack now

The current setup can draft a post, open Substack through an authenticated Chromium session, inject the title and body into the ProseMirror editor, continue through the publish flow, and send it.

The LinkedIn side can publish through the live web UI by finding the hidden editor inside Shadow DOM. The next layer is state tracking: RSS in, GitHub releases in, generated summary out, no duplicate posts.

I like this direction because it turns the agent into infrastructure. Not a chatbot that suggests what I should do. A system that can take a recurring workflow, handle the browser nonsense, and leave behind a log I can inspect.

That is the bar I keep coming back to. If an agent says it did something, there should be a receipt.

— Ian

Ian's Substack

Discussion about this post

Ready for more?