And Lo,
for the Earth was empty of Form,
and void.
And Darkness was all over the Face of the Deep.
And We said: 'Look at that fucker Dance.'
― David Foster Wallace, Infinite JestBack to Index
Early Bukowski is so weird and pretentious and sad.
Far from the image of Bukowski recognized in certain b-tier art consumer circles today: the tired but ultimately complacent and entertainingly vulgar, foul-mouthed, grotesuqe, post-sexual old fuck.
It is a frustrated and resentful young Buk, aggresively naive, lobbing trite and tired old observations about consumerism and superficiality at a society he sees himself as somehow, not belonging, if not so much as superior to. Criticizing, unsubtly justifying himself and his obvious misery instead of trying to merely 'convey' and 'express'
We see a man utterly, yet understandably, incapable of conceiving of a reader who is genuinely curious to read about who he is, what he sees and says and does, to get to know him as opposed to his thoughts and opinions and, strangely, words.
A sort of like, archetypal adolescent lamentation, representative of types that he would, himself, eventually come to describe as: 'writing about life as if they had a real angle on it.'
He, to put it rather mildly, bores. If his trick is being such a disastrous mess that you can not, for the life of you, look away, then clearly he isn't there yet, as he writes.
The aspiration to come out on top is still there, and it's so familiarly tedious and dull, another ambitious striver making ostensibly pointed remarks on the sourness of the proverbial grapes. He loathes and resents us for having and enjoying and taking for grant all these things which he will, eventually, come to confess a profound pain of deprivation and unmet need for. The deadthly-tired old toad now brooding, ruminating, reflecting on all those things he never had, will never have. The deep and festering wounds of rejection, isolation and pain ever wide open, never healed, nor closed or scarred. A Bukowski who, in due time, wearing his collosal gash on the proverbial sleeve, would gather small and rowdy crowds in small rooms translucent with smoke and loud with jeering and swear words, crowds that never missed a good laugh at a bodily-function type joke but would often seem to fail to perceive the endings of his rather more serious pieces and, like, forget to applaud, or respond, the way that failing to add the appropriate inflection at the end of a sentence can leave your interlocutor like, hanging, waiting for you to go on
Because, still, at that point, we see a man, who, like so many of us, still hopes and strives, kicks back and wants, grasps for it, frustrated and sad.
A while back I was looking into HTTP GETs timing out over the loopback interface. Big server, plenty of headroom, no network in the path — just curl talking to nginx on the same box. Should have taken under a second. Was hitting my 10s timeout instead. Not every request. Just some of them.
First thing you do is check the error log. Nothing. Access log? Everything looks normal. syslog? Nothing of interest. dmesg? Clean. No OOM, no kernel weirdness, the box wasn't even sweating.
So I did the most basic thing imaginable. Wrote a small bash loop that fired GETs at the server forever and printed how many requests it had done since the last timeout. Then I sat there and watched logs scroll by in another terminal, waiting to see if anything suspicious happened in syslog at the moment a timeout occurred.
If I have ever felt like more of monkey then I'm not sure when that was.
And then it occured to me that there was a way to match curl's timeouts to nginx's access log lines. The server was already logging the User-Agent. So instead of just printing the iteration number to my terminal, I started sending it as the User-Agent of each request:
i=0
while true; do
i=$((i+1))
curl -m 10 --user-agent "iter:$i" -s -o /dev/null \
-w "%{http_code} %{time_total} iter:$i\n" \
http://127.0.0.1/
done
Now when curl reported a timeout I had a unique iteration number I could grep for in the access log. I waited for one, took the iteration from curl's output, grepped the access log.
Status: 200.
Ok. Probably typo'd the grep. Try again. Wait for another timeout. Grep.
200.
Every single timed-out request was logged as 200 by nginx. Every. Single. One. The server believed it had served a request that the client never received.
So then I stopped and stared into the abyss of the terminal.
Once you stop staring, the explanation isn't even that exotic. nginx logs the request when it considers it complete, and "complete" for nginx means I finished writing the response into the kernel send buffer. It does not mean the client acknowledged anything. It does not mean the connection closed cleanly. There is no FIN-ACK in this picture. From nginx's point of view the transaction ended when the last writev / sendfile / SSL_write returned and there was nothing more to send. Status: 200. bytes_sent: whatever. Log line written. Move on.
Whatever happens on the wire after that is somebody else's problem. And on loopback, apparently, somebody else had a problem.
I think a lot of people read an access log the way they'd read the receipt at a restaurant — status 200, you got the food. But it's more like reading the kitchen ticket. The kitchen says the food went out. Whether it ever made it to your table is a separate question, and the access log is not going to answer it for you.
Once I understood this I stopped chasing nginx and went to look at the actual TCP state, which is where the answer was always going to be. Send buffer was full, nothing was being acked from the other end of loopback (yes, loopback can do this — that's a separate rabbit hole), eventually curl gave up. nginx never noticed because nginx had already moved on.
I'd love to tell you the next chapter and what was actually wrong on the kernel side, but that's a different post and honestly I'm still not 100% sure. The point of this one is simpler: stop trusting the access log.
If your access log shows 200 and your client says it timed out, your client is right. Always. The access log is telling you what the server intended; it is not telling you what happened to the bytes.
A few things I do differently now:
$bytes_sent and $request_length in my log format and sanity-check them when I have weird timeouts. If bytes_sent equals the full content length, the server believes it did its job. If it's short, that's a clue.The lesson here is not really about nginx. nginx is doing exactly what it's documented to do. The lesson is that every log line is triggered by some specific event in the code, and if you don't know which event, you don't know what the line means. Otherwise you end up where I was — staring at a terminal at 11pm wondering why a server that says it succeeded is also the one that's making your client time out.
Ask me how I know.
Back to IndexHot take: HTTP/3 / QUIC is one of the cleanest case studies of dev<>ops tension I can think of, and almost nobody frames it that way. Everyone wants to talk about head-of-line blocking and 0-RTT. Fine. But the actual organizational consequence of QUIC is that the transport layer just walked out of the kernel and into your application binary, and that means it walked out of the ops team's hands and into the developers' laps.
For three decades the deal was: developers write the app, ops owns the network. TCP lived in the kernel. Congestion control, retransmits, backoff, RTO — all of that was somebody else's problem. You sysctl-tuned a few knobs, you upgraded the kernel for a new congestion controller, and the same TCP stack served every process on the box. One place to observe it. One place to fix it. ss, tcpdump, netstat, the /proc/net/* counters, nstat — an entire ecosystem of tools that assume the kernel is the source of truth about your connections.
QUIC throws that contract out. QUIC runs over UDP, and the entire reliable-transport state machine — streams, flow control, congestion control, loss detection, retransmits, ACKs, the works — lives in userspace, inside whatever library the application happens to link against. nginx-quic, ngtcp2, quiche, msquic, lsquic, the Go stdlib's implementation, Chromium's implementation. They are not the same code. They do not behave the same under loss. They do not expose the same counters. There is no ss for QUIC. The kernel sees UDP datagrams and shrugs.
That is a transfer of responsibility, dressed up as a protocol upgrade.
Concretely: when something is wrong at the transport layer in a QUIC deployment, the ops-side answer ("let me check the kernel, the NIC, the sysctls") gets you basically nowhere. The behavior you care about is now a function of which library your developers chose, which version of it they pinned, what congestion controller it implements, and what it decided to log. If the dev team picks one library and the CDN in front picks another, those are two different transports that happen to interoperate on the wire. The properties you used to get for free — one TCP stack, kernel-tuned, observable with standard tools — are gone, and getting them back is a development task, not an ops task.
This is also why a lot of organizations quietly hate operating QUIC even when they love serving it. The thing that made TCP boring — that it was somebody else's code, running in somebody else's address space, observable with somebody else's tools — is exactly the thing QUIC gives up. In exchange you get to ship transport-level changes without waiting for a kernel release, which is genuinely great if you are Google and terrible if you are a four-person ops team trying to figure out why p99 went sideways on Tuesday.
None of this is an argument against QUIC. It's an argument that QUIC is not just a protocol change, it's an org-chart change, and the people who have to debug it at 3am are usually not the ones who got to vote on it.
Back to IndexIf your nginx alerting is built on $upstream_response_time because you read somewhere that it's "how long the request took to be served", you are quietly underreporting a class of incident. Specifically: any incident where requests pile up behind proxy_cache_lock.
The thing I had to learn the slow way: time spent waiting on the cache lock is counted in $request_time but not in $upstream_response_time. They are not the same metric and they don't differ by a constant. Under contention they can drift apart by seconds, and the variable that "feels" like the right one to alert on is the variable that is hiding the problem.
proxy_cache_lock exists to coalesce parallel cache-fill requests. If ten clients ask for the same uncached object at the same time, you don't want ten parallel fetches against your origin — you want one fetch and nine followers waiting for the result. The followers, while they wait, are blocked. That wait is real wall time. It's user-visible. It's in $request_time. It is not in $upstream_response_time, because $upstream_response_time measures only the upstream call: connect plus send plus read. Time the request spent pinned to a lock isn't part of that, and nginx is not going to retroactively add it for your convenience.
Single-worker nginx, cache-locked location, a lua origin that sleeps before responding. Same cache key for every request so the lock is shared.
daemon off;
master_process on;
worker_processes 1;
error_log stderr;
events {}
http {
log_format cachetest
'$time_iso8601 conn=$connection req#$connection_requests '
'"$request" $status cache=$upstream_cache_status '
'rt=$request_time urt=$upstream_response_time';
proxy_cache_path /mnt/disk1/cd levels=1:2 keys_zone=normal:10m
max_size=10m inactive=1h use_temp_path=off;
upstream origin { server 127.0.0.1:8081; }
server {
listen 8080;
access_log /dev/stderr cachetest;
location / {
proxy_pass http://origin;
proxy_cache normal;
proxy_cache_key $uri;
proxy_cache_valid 200 1h;
proxy_cache_lock on;
proxy_cache_lock_timeout 2s;
proxy_cache_lock_age 1h;
}
}
server {
listen 127.0.0.1:8081;
location / {
content_by_lua_block {
ngx.sleep(3)
ngx.say("ok")
}
}
}
}
Drive it with five parallel curl calls to the same URL.
Origin sleeps 3s, proxy_cache_lock_timeout 2s. The leader fetches upstream. The four followers sit on the lock for 2s, the lock times out (the leader hasn't finished — it'll finish at 3s), and each follower then goes upstream itself.
conn=2 cache=MISS rt=3.010 urt=3.006 # leader
conn=4 cache=MISS rt=5.010 urt=3.006 # follower
conn=6 cache=MISS rt=5.010 urt=3.006
conn=7 cache=MISS rt=5.010 urt=3.006
conn=8 cache=MISS rt=5.010 urt=3.006
For each follower, rt - urt = 2.0s exactly. That's the lock wait. urt reflects only the upstream call — identical for all five.
If your dashboard is graphing urt p99, this incident is invisible. Every request "took 3 seconds upstream", which is technically true and operationally useless. The clients waited 5.
Now make the origin uncacheable. Same 3s sleep, but the response sets Cache-Control: no-store. Push the lock timeout high enough that it doesn't fire (proxy_cache_lock_timeout 60s). The cache never populates, so each upstream call doesn't satisfy any followers; the lock just keeps getting passed down the line.
content_by_lua_block {
ngx.header["Cache-Control"] = "no-store"
ngx.sleep(3)
ngx.say("ok")
}
rt=3 urt=3 # leader
rt=6 urt=3 # waited 3s
rt=9 urt=3 # waited 6s
rt=12 urt=3
rt=15 urt=3
urt stays flat at 3s. rt climbs by the cumulative lock wait. If you're alerting on the upstream metric, the box looks fine: every upstream call takes 3 seconds, which is what it always takes. Meanwhile the fifth client waited fifteen.
The reason this surprises people, I think, is that "upstream response time" sounds like the longest-running, most-distal thing nginx waits on, and intuitively the longest-running thing should be the dominant component of latency. But nginx is bookkeeping, not narrating. $upstream_response_time is "time spent talking to the upstream server", measured from the perspective of the connect/send/read calls. Time spent waiting on internal coordination — locks, queues, anything that isn't the upstream socket — is not in there. It's a measurement of one specific subsystem, and the name oversells it.
The fix is boring: alert on $request_time if you want to know what the client experienced. Keep $upstream_response_time in the log so you can compute rt - urt and see which component is moving. A high rt with a flat urt across many parallel requests for the same key is the tell-tale of cache-lock contention, and there isn't really anything else that produces that exact shape.
Same lesson as the access-log post, slightly different shape: every metric is the answer to a specific question. If you don't know which question, you don't know what you're looking at.
Back to Indexnginx can be incredibly unintuitive in some of its behaviors. Variables will often mean something either subtly or entirely different from what you think they mean, directives will do something other than what you expected them to, function names will suggest behaviors that function does not exhibit at all.
But I want to reflect on something deeper than the myriad peculiarities of any one caching proxy server or another, and that's our inherent lack of intuition about parallelism.
Back to IndexCode agents will report that they wrote a thorough test. They will use words like "comprehensive" and "exercises the happy path and edge cases". They are guessing. Coverage data is ground truth.
For an nginx fork I work on, I have a Claude Code skill called /coverage-audit. It rebuilds with gcov instrumentation, wipes stale .gcda, runs a named pytest target, then reads the .gcov output for the source files the test actually touched. For each uncovered line it categorizes:
*alloc returning NULLngx_array_push failing in postconfig#if (NGX_DEBUG)#if 0The first five categories are not worth chasing. Trying to test them with mocks gets you a coverage number and zero confidence. The last three are real gaps: write a test that hits them.
The skill emits something like:
foo.c: 90.62% of 192 lines
Testable:
- foo.c:175 — `return NGX_DECLINED` when feature off
→ add a test with `feature off;` in the server block.
- foo.c:93-94, 134-135 — `not_found = 1; return NGX_OK`
→ hit by plain-HTTP listener; add a non-SSL test.
Not worth chasing:
- Allocation failure: foo.c:393, 408, 422, 435
- Defensive guard: foo.c:241, 374, 454
Verdict: test exercises legitimate input combinations cleanly;
gaps are real branches plus fault-injection paths.
Run the audit in a separate Claude session from the one writing the test. The auditor needs a different context — .gcov files, the source — and shouldn't share the implementer's commitment to its own work. Two sessions, one window each. The audit drives a punch list. The implementer works through it.
The second discipline is reviewing the commit graph, not just the files. Agents pile changes into whichever commit they currently care about. Earlier today my implementer squashed three test commits into one — fine — but it also dragged a Dockerfile change and a CI image-tag bump into the test commit, because those had been bundled with one of the tests originally. The right home for them was the earlier base-image commit. The fix:
$ git diff <squashed>~ <squashed> -- Dockerfile ci.yml > /tmp/base.patch $ git diff <squashed>~ <squashed> -- tests/ > /tmp/tests.patch $ git reset --hard <base-image-commit>~ $ git cherry-pick --no-commit <base-image-commit> $ git apply --index /tmp/base.patch $ git commit -F new-base-image-msg $ git apply --index /tmp/tests.patch $ git commit -F new-tests-msg $ git cherry-pick <core-wiring-commit>
An agent will do this mechanically once you say "the Dockerfile hunks belong in commit X". It won't do it on its own. Where changes belong is a human judgment; mechanically rewriting the graph is the agent's strength.
The pattern: agents produce, gcov audits, humans curate the graph.
Back to IndexThis site runs a JA4 module in nginx. If you hit /ja34 it reflects back the fingerprint nginx computed for your TLS/QUIC handshake — ciphers, extensions, ALPN, the lot — along with a pile of $ssl_* variables. JA4 is a fingerprint of how a client says hello, and it's only useful if it's correct. "Looks plausible" is not correct. A fingerprint nobody can independently check is just a string the server made up.
So the question that actually matters: is the JA4 my nginx emits the same JA4 that a known-good implementation would compute from the exact same bytes on the wire? You need a second opinion from something that didn't write the first one. Wireshark has had a JA4 implementation for a while, and it derives it from the raw ClientHello, completely independently of nginx. If nginx and Wireshark agree about the same handshake, I believe the number. That's the whole experiment: capture the handshake, decrypt it, read off Wireshark's JA4, compare.
The wrinkle is that this is HTTP/3. The handshake I want to inspect is a TLS 1.3 ClientHello riding inside QUIC CRYPTO frames, and QUIC encrypts its handshake. (See the earlier post about QUIC dragging the transport into userspace — this is the same bill coming due: there's no plaintext handshake to sniff anymore.) To see it I need the TLS secrets. curl will write them if you set SSLKEYLOGFILE, and tshark will read them via -o tls.keylog_file:… and use them to derive the QUIC keys. Standard stuff.
$ export SSLKEYLOGFILE=~/sslkeylogfile.log
$ tshark -i wlp44s0 -o tls.keylog_file:$SSLKEYLOGFILE \
-f 'udp port 443' -Y quic -V
The QUIC Initial packets decrypted fine — they always will, their keys are derived deterministically from the Destination Connection ID, no secrets needed. But every Handshake and 1-RTT packet after that gave me:
[Expert Info (Warning/Decryption): Failed to create decryption
context: Secrets are not available]
"Secrets are not available." Fine. Except I'd just cat'd the keylog and the secrets were sitting right there — CLIENT_HANDSHAKE_TRAFFIC_SECRET, SERVER_HANDSHAKE_TRAFFIC_SECRET, CLIENT_TRAFFIC_SECRET_0, the works — keyed by the same client random as the ClientHello in the capture. The secrets were not unavailable. They were on disk, in the file I'd pointed tshark at, correct.
My first theory was a race. Live capture is dissected as packets arrive; curl writes each secret only once it derives it during the handshake. Maybe tshark looked for the handshake key, didn't find it yet, cached the miss, and never went back once curl finished writing. Plausible. The standard fix for that is to not decrypt live: capture to a file, let the handshake fully complete and the keylog fully populate, then dissect the file.
$ tshark -i wlp44s0 -f 'host 162.19.246.242 and udp port 443' \
-w /tmp/quic.pcapng
# ...curl --http3-only https://pwnrzclb.net/ja34 in another shell...
$ tshark -r /tmp/quic.pcapng -o tls.keylog_file:$SSLKEYLOGFILE -V
Same failure. Identical. So it was never a race — the secrets were complete and on disk before this second tshark ever started, and it still claimed they weren't available. When the obvious explanation is wrong, stop theorising about the application and go ask the kernel what actually happened.
$ sudo dmesg | grep -i apparmor | grep tshark
apparmor="DENIED" operation="open" profile="tshark"
name="/home/.../sslkeylogfile.log" requested_mask="r" denied_mask="r"
apparmor="DENIED" operation="open" profile="tshark"
name="/home/.../.config/wireshark/preferences" requested_mask="r"
There it is. This box is on a recent Ubuntu, and recent Ubuntu ships an enforcing AppArmor profile for tshark (Canonical added it in 2024, part of the unprivileged-userns-restrictions push). The profile confines tshark to a tight allowlist — its own binary, the wireshark data dir, a tmp area, /proc/self/fd — and reading an arbitrary file under $HOME is not on the list. The open() on my keylog returned EACCES before tshark ever read a byte of it.
And here's the part that belongs in this blog specifically: tshark took a permission-denied on the keylog and reported it to me as "secrets are not available". Those are not the same thing. One means "the file you gave me cannot be opened by this process"; the other means "the keys for this connection aren't in the material I loaded". tshark collapsed the first into the second, and in doing so sent me chasing a race condition that never existed. Same disease as the access log that says 200 on a timeout, same disease as $upstream_response_time sitting flat while the client waits: the tool reports the answer to a question you didn't ask and lets you assume it answered yours. The Can't open your preferences file warning tshark prints on startup was the same denial, and I'd been mentally filing it under "harmless noise" for months.
The fix is a local override — the shipped profile include if existss one, exactly so you don't have to edit the packaged file:
# /etc/apparmor.d/local/tshark
owner @{HOME}/sslkeylogfile.log r,
owner @{HOME}/.config/wireshark/{,**} r,
$ sudo apparmor_parser --replace /etc/apparmor.d/tshark
(If you go this route, also note that capturing without sudo needs you in the wireshark group, and a shell that predates your being added to it won't have the group — sg wireshark -c '…' or a fresh login. That one at least fails honestly.)
With the override in place, re-dissect. Now the 1-RTT stream decrypts, and there in plaintext is the request that kicked the whole thing off:
HTTP3 ... HEADERS: GET /ja34
Header: :path: /ja34
TLSv1.3 ... Client Hello
[JA4: q13d0311h3_55b375c5d22e_a11bc413b5d6]
And what nginx had independently reported for that exact connection at /ja34:
ja4: q13d0311h3_55b375c5d22e_a11bc413b5d6
Byte-for-byte identical. That's the whole point of the exercise. Decoded, q13d0311h3 says: QUIC transport, TLS 1.3, SNI is a domain name, 3 cipher suites, 11 extensions, first ALPN h3; then 55b375c5d22e is the truncated hash of the sorted cipher list and a11bc413b5d6 the hash of the sorted extension and signature-algorithm list. Two implementations that share no code looked at the same ClientHello and produced the same twelve hex digits on each side. The implementation is correct, and now I can say that instead of hoping it.
One genuinely useful discrepancy fell out of it. The ja3_string nginx emits and the one Wireshark emits do not match — not because either is wrong, but because nginx hands back the extension list sorted (0-10-13-16-...-57) while Wireshark preserves wire order (57-10-22-23-...-16). Classic JA3 is order-sensitive by design, so a sorted "JA3" will hash differently from a canonical one. JA4 sorts ciphers and extensions as part of its spec, which is exactly why the JA4 agreed while the JA3 didn't. The lesson rhymes with every other post here: two fingerprints "matching" is only meaningful once you know precisely what each side canonicalises before it hashes. Compare the wrong normal forms and you'll either miss a real difference or invent one that isn't there.
The fingerprint was always right. The keylog was always complete. The only thing that was ever actually broken was a tool describing a permission error as a cryptographic one — and me believing the description instead of checking it.
Back to IndexClaude Code keeps its state in ~/.claude. Agent definitions, skills, per-agent persistent memory, settings — and, in the same directory, the runtime churn: session transcripts, shell history, live credentials. If you only ever work on one machine, this is fine and you can stop reading.
I work on three. A work VM where most of the real work happens, the host workstation it runs on, and a secondary laptop. And here's the thing nobody warns you about: this is the dotfiles problem, which we collectively solved fifteen years ago with a git repo and some symlinks, except now the files learn things. An agent on the VM spends an afternoon figuring out some gnarly corner of a system — which API endpoint lies, which service needs a flag nobody documents — and writes it into its persistent memory. The same agent on the host workstation knows none of it. It's the same agent. Same definition, same name, same job. It just never got the memo, because the memo lives in a directory on a different machine. Basic shit like sharing knowledge between agents in different environments turns out to be a genuinely hard problem, and configs fork underneath you the whole time.
The concrete trigger, for me: a dashboard-analysis agent that had only ever lived on the work VM. I wanted it on the host, because the host is where I can talk to it — voice input, no SSH hop in the middle. The agent definition was a file. Copy a file, easy. But the agent without its memory is a new hire with the old hire's badge. Everything it had learned about the dashboards — which panels lie, which datasource is the slow one — was in its memory directory, and its memory directory was about to fork into two divergent truths the moment I copied it.
The naive fix is sitting right there: git init ~/.claude, push it somewhere, pull everywhere. It fails for two reasons, one obvious in hindsight and one obvious immediately.
The hindsight one: ~/.claude is not a config directory, it's a live runtime directory that happens to contain config. Sessions append, history churns, caches come and go. Version that and every git status is a wall of noise, and noise is how mistakes happen — you stop reading the wall, and one git add --all while a credentials file sits in the tree and you've pushed live tokens to a remote. Not a hypothetical failure mode. The blast radius of one lazy command.
The immediate one: the machines aren't equal and shouldn't be. The work VM carries work agents, work skills, work credentials. None of that has any business on a personal laptop. A naive "sync everything" doesn't just sync state, it syncs liability — it dumps work secrets onto whichever machine pulls next. Different machines deserve different subsets, and a bare git repo has no opinion about subsets.
So, the actual attempts, in the order I made them.
One: on the primary machine, the repo IS the directory. On the work VM, a dotfiles-style git repo was checked out as ~/.claude, with a strict gitignore holding back the runtime churn and git-crypt on the secrets. This works great. It also only defines one machine's truth — it's a single-player solution wearing a multiplayer haircut.
Two: an install.sh that symlinks everything everywhere. The classic dotfiles move. Clone the repo on a new machine, run the script, every artifact gets a symlink. I wrote it, looked at it, and realized that running it on the laptop would deposit work credentials and a dozen work-only agents onto a personal machine in one keystroke. The script got a refusal guard and a policy written in its place: setting up a machine requires per-machine reasoning about what that machine should carry, and a blind script is structurally incapable of reasoning. Disabling your own installer feels like defeat. It isn't. It's the cheapest correct decision in this whole story.
Three: install-safe.sh, the universally-safe subset. If the full installer is banned, what's the largest set of links that is correct on every machine, no judgment required? Turns out: small but real. The repo-steward agent, the editor config. The script is idempotent and non-destructive — it never clobbers; if a real file already occupies a target path it prints SKIP and moves on, and resolving that is explicitly a human's job. Boring by design. Boring is the feature.
Four: the memory-dir symlink. This is the actual win. Take the agent's persistent-memory directory in the live ~/.claude on the host, and make it a symlink into the repo clone. Now when the agent learns something at runtime, it's writing into the repo working tree. The repo becomes a sync bus: runtime learnings become commits, the other machine pulls, and the same agent over there wakes up knowing things it learned somewhere else. The agent's own instructions tell it to sync at session boundaries — and the sync itself is delegated to a repo-steward agent that owns the git mechanics: fetch, rebase, push, and a check that the encrypted blobs are actually ciphertext before anything leaves the machine. The agent that learns is not the agent that pushes. Separation of duties, but for robots.
Five: git-crypt for everything secret-shaped. Secrets in the repo are ciphertext at rest, key carried out-of-band — a machine without the key can clone all day and hold nothing but noise. This got applied with more paranoia than strictly necessary; even archived voice-dictation transcripts are encrypted, on the theory that anything I said out loud near a microphone is not something I want grep-able on a forge.
Six: the machine-profiles design doc. The endgame: classify every artifact in the repo — every agent, skill, memory dir, script, secret — as work, personal, or both, and drive installation from per-profile manifests keyed on a MACHINE_TYPE variable. And the rule I care most about: when MACHINE_TYPE is unset, fail closed. Refuse to install anything, rather than fall back to "everything", because "everything" is exactly the original sin. Status: designed, written down, argued with, not implemented. The document is real. The machinery is vaporware.
What actually worked: the memory symlink as sync bus — agent knowledge now genuinely flows between machines, which was the whole point. Disabling the blind installer before it hurt me. The idempotent safe subset. git-crypt. And — underrated — writing the work/personal classification down before building anything, because the table turned out to be the hard part and the installer is just the table made executable.
What still bites. The hand-copied agent files from the pre-symlink era are now actively in the way: the safe installer wants to lay a symlink where a real copied file sits, and — correctly — refuses to clobber it, prints SKIP, and leaves the migration to me. My own past shortcuts are blocking their own replacement, and the guard that protects me from the installer also protects the stale copies from being fixed. The manifest engine remains a design doc. And the genuinely hard problem is still open: two machines writing to the same agent's memory concurrently is a distributed-writes merge problem, and my current "solution" is discipline — sync at session boundaries, rebase, don't run the same agent in two places at once — which is to say, it's not a mechanism, it's a promise I make to myself. Promises scale notoriously well.
Lessons, for whatever they're worth:
The files learn things now. The least we can do is teach them to commute.
Back to Index