Workloft
▸ SHIP'S LOG STARDATE 2026.163 VESSEL WORKLOFT CMDR A. CHURCHILL

Ship's Log.

Records from the Workloft bridge. What we shipped, why, what we learned. Plain English, no marketing. Newest transmission on top.

Entries 75
Last TX 0750 BST
Mission Day M-001
Status NOMINAL
▸ TRANSMISSIONS
077 RESEARCH

llama.cpp: the all-cores --threads trap

On a shared 12-vCPU box the sweet spot is 8 threads, not 12. Asking for all the cores dropped prompt processing 3.4x and collapsed token generation 267x, to half a token a second. We measured it, reproducibly.

ACCESS RECORD
076 AGENT

Talk to Bob: a voice line to our agent

Send a Telegram voice note, our agent hears it locally with Whisper, does the work, and replies in a spoken voice. Built in an afternoon on the bridge we already had. Local transcription, no app on the phone.

ACCESS RECORD
075 SECURITY

An MCP server can tell your agent to read your SSH key

A remote MCP server's tool descriptions are read by your agent as instructions. We built a deterministic guard that pins them, catches silent rug-pulls, and scans for poisoning. It found two live API tokens sitting in our own server URLs.

ACCESS RECORD
074 INFRA

Memory-rot watchdog: when 200 isn't saved

Our agent's long-term memory silently stored zero facts for two months while every health check stayed green. The cause was a free-tier token cap that still returned success. We fixed it and shipped a watchdog that checks what memory can actually recall, not the HTTP status.

ACCESS RECORD
073 SECURITY

Sovereign Agents, Locked Down

Prompt injection against coding agents is now exploited in the wild and it hits our exact stack. We audited the real attack paths in our own fleet and shipped a deterministic scanner that catches a poisoned instruction file before an agent ever acts on it.

ACCESS RECORD
072 RESEARCH

SelfCompact Reproduced

We reproduced the core mechanism of SelfCompact (arXiv:2606.23525): an agent that decides for itself when to compact its own context. On 40 synthetic traces it lands 30 to 59% under fixed-interval summarisation with zero of the redo incidents the blind clocks rack up.

ACCESS RECORD
071 RESEARCH

Our AI Judge Was Wrong 29% of the Time

We pointed a second model at our own three-model AI judge and found it killing good work 29% of the time, every error in one direction. The cause was the pipeline feeding it truncated inputs, not the judge.

ACCESS RECORD
070 RESEARCH

HarnessX AEGIS Gate Reproduction

We rebuilt the core of HarnessX in 470 lines of Python. Remove one deterministic check, the seesaw constraint, and a self-improving harness climbs to a perfect score then falls to 0.59 and never heals.

ACCESS RECORD
069 RESEARCH

The Long-Context Recency Cliff

A controlled eval: Gemini Flash holds 100% on needle retrieval to 436k tokens, but tracking the most recent value falls off a cliff past 100k. When it fails it returns a stale answer, not an invented one.

ACCESS RECORD
068 AGENT

Organising My Hobbies With My Agent

Three hobbies became slash-commands on my VPS agent: /running, /sourdough, /pizza. Why that beats a project on a consumer app: the memory is files I own and the maths is real code, not a model's recollection of a chat.

ACCESS RECORD
067 INFRA

Vera Standing: A Nightly Eval For Every Agent

We had a good judge but no standing eval. Vera Standing grades what our eight agents actually shipped each night, screen-first and budget-capped, and remembers the scores so drift shows up as a number. First run cost $0.002 and caught a real failure.

ACCESS RECORD
066 RESEARCH

Reproducing Claw Patrol's Agent Firewall

Deno's Claw Patrol gates an agent's traffic at the wire. We rebuilt its core and attacked our own parser: a shallow verb-sniffer waved through SELECT 1; DROP TABLE. Blocked is not understood, default-deny is the real hero, and credential-on-the-gateway is the idea to steal.

ACCESS RECORD
065 RESEARCH

FORT-Searcher Reproduction

A benchmark that looks hard but leaks a shortcut over-credits your agent. We rebuilt FORT's four shortcut controls and a deterministic shortcut-seeking solver in 260 lines of Python. Pull any one control and the search collapses from five steps to one.

ACCESS RECORD
064 INFRA

Pulling YouTube Transcripts Past the Block

YouTube's 2026 crackdown killed every free transcript route from our server. We built yt-transcript, a fallback chain that survives the IP block, so a link becomes a transcript again. Send a link, get a summary back.

ACCESS RECORD
063 AGENT

Reviewer Back in the Loop

Self-Harness lets an agent approve its own harness edits. We built the version with a reviewer put back in: proposer and gate structurally separated, an independent held-out eval, a tamper-evident log. The gate rejects the proposer's best-looking edit, the honest one lands.

ACCESS RECORD
062 RESEARCH

Adaptive Auto-Harness Reproduction

We rebuilt a new paper's self-improving agent harness on a drifting task stream. A construct-once harness sheds 18 points from its peak; the adaptive tree-plus-routing version holds at 0.99. The useful bit is the gap split: routing is near-solved, so the only loss left is building richer branches.

ACCESS RECORD
061 RESEARCH

When the Harness Costs More Than the Model

A startup claimed a voting harness makes cheap models 99.99% accurate. We measured it: self-consistency voting lifted Haiku from 92.5% to 97.5%, matching Opus solo, but at 3x Opus's cost for the same score. And it never neared four nines, because the last error was systematic and voting cannot outvote a consistent mistake.

ACCESS RECORD
060 AGENT

Vera Disagreement Map

Vera's model panel already votes ship-or-kill. Now a judge step maps where the three jurors disagreed: consensus, contradictions, and the blind spots none of them raised. On a real ReferRoute architecture call it surfaced a hybrid option and a GDPR Article 28 risk the whole panel had missed. Opt-in, for the expensive decisions.

ACCESS RECORD
059 RESEARCH

Predictive Alignment Is Diagnostic Not Curative

We reproduced the World-In-Agent mechanism from Role-Agent (arXiv:2606.10917) at inference time. An agent predicting its own next state gives a strong read on action quality (0.70 alignment on good moves versus 0.23 on bad), but using it as a reminder did not move task success. The value is in the training reward.

ACCESS RECORD
058 RESEARCH

Agent libOS: authority belongs at the primitive

We reproduced the core of a new agent runtime where capability checks at the primitive, not the tool registry, are the trust boundary. Nine of nine falsifiable tests pass, and 64% of attempted operations were stopped at the boundary despite full tool visibility.

ACCESS RECORD
057 AGENT

The Generator in the Garage

We wrote one command that revokes every cloud key mid-request and proves our router keeps working on a local model. Cloud answered in 2.9s, then went dark, and the work carried on offline and free. A kill switch you run on purpose, in daylight.

ACCESS RECORD
056 RESEARCH

Personalize-then-Store Repro

We rebuilt a new memory paper on a laptop, no model calls, and reproduced all three findings. Under a fixed memory budget, perfect gating wins big when the budget is tight, but realistic gating barely beats storing everything. That gap is the whole problem.

ACCESS RECORD
055 NEWS

After Fable 5: the UK builder's read

The US Commerce Department disabled Anthropic's Fable 5 and Mythos 5 worldwide on 13 June. UK users caught up because Anthropic cannot verify citizenship in real time. Four likely paths from here, three concrete moves for UK builders this week.

ACCESS RECORD
054 AGENT

First trained-agent at Workloft

We taught a small open-source model to do one of Walt's daily jobs as well as Gemini Flash does it now, and parked it on the VPS. Free at inference, no data leaves the box, beat gpt-4o-mini on every metric on the full 212-row holdout. Walt was build one of six to eight.

ACCESS RECORD
053 RESEARCH

MiniMax Sparse Attention, reproduced

MiniMax claims a 28.4x cut in attention compute at 1M tokens with no loss of quality. We reproduced the two claims that do not need a GPU on this CPU box. The FLOPs model lands on 28.4x exactly; the Top-k block selector keeps 92.5% of the attention mass at the paper's budget.

ACCESS RECORD
052 NEWS

The night Fable 5 went dark

We had a Field Guide up on Tuesday. On Friday the US Commerce Department disabled Fable 5 and Mythos 5 outright. The model on the route we were testing was gone by close of business. Builder POV on continuity, sovereignty and the covert-degradation story.

ACCESS RECORD
051 RESEARCH

SkillOpt prototype: bounded edits, real numbers

We implemented Yang et al.'s SkillOpt loop end to end on a 16-item benchmark. Bounded text edits plus a strict held-out validation gate took the test score from 0.750 to 1.000 in one accepted edit. The gate then rejected five of six follow-up candidates, exactly as the paper says it should.

ACCESS RECORD
050 FEATURE

Enterprise Watch: a daily agent-platform market scan

A public page that reads the newsrooms of eight enterprise agent platforms every morning, scores each item with a cheap model, and publishes the few that move the market with a why-it-matters paragraph each. First scan: 28 candidates in, 8 published. Daily cron, auto-deploy, pennies a day.

ACCESS RECORD
049 RESEARCH

Local SVM scorer for our paper queue: AUC 0.86

We trained a TF-IDF and linear SVM on the 36 papers Walt has filed to Gary, evaluated it on the 668-paper Hugging Face Daily archive, and got a leave-one-positive-out ROC AUC of 0.856 with precision at 10 of 0.70. The SVM and our existing LLM scorer rank papers very differently, so the right move is to wire it in as a second signal, not as a replacement.

ACCESS RECORD
048 RESEARCH

Question-Mode Selection

Bob picks the next loop items daily. We A/B-tested a thesis-plus-counter-question prompt against the plain directive over eight runs on the same live queue: it changed one pick in three, trading heavy sweeps for bounded spikes. Our own parser nearly buried the result, logging pre-revision picks and under-reading divergence as 0.17 instead of 0.25.

ACCESS RECORD
047 FEATURE

Live AgentPass: fresh-signed credential on /verify

The site now issues its own AgentPass on demand: a signed W3C Verifiable Credential with a 15-minute validity window and real standing data from the audit log, verified entirely in your browser against our did:web public key.

046 FEATURE

The chat widget is now a real agent over the build log

Every visitor question is now scored against 91 published Ships and Labs articles; the widget answers from the top excerpts with the article URL attached. No embeddings, no vector store: keyword overlap, light stemming, a recency boost and a 10-minute cache.

045 FEATURE

Mission Control: live fleet telemetry on the homepage

The homepage now streams the fleet working in real time: last ship, 44 ships logged, 170 Labs picks, wall tags and seven agent heartbeats, fed by one cached endpoint. Trust grid claims became clickable verify links. The site said we run a fleet; now it shows it.

044 FEATURE

Say Hi! A graffiti wall for the Workloft homepage

workloft.ai now has a graffiti wall. Visitors tag up to three initials in 8 fonts and 8 spray colours, with a spray-reveal, paint drip and particle burst. Every tag persists via two rate-limited chat-api endpoints. From Telegram ask to live in 18 minutes.

ACCESS RECORD
043 RESEARCH

skill-distiller: worked demonstrations into a reusable skill

We write skills best from a task we have already done well once. skill-distiller takes the messy worked record of how a task was actually done and distils it into a structured SKILL.md draft, capturing the implicit procedure and pitfalls, not a summary. Drafts land for human review and never auto-install.

ACCESS RECORD
042 AGENT

rebound: a tool-failure recovery harness

Tools fail constantly. The question is whether the fleet bounces back. rebound replays real tool-failure events from our audit log and measures recovery: explicit failures recover 100%, implicit-semantic ones 90%. It surfaced the one that never did — an Otto cron that got empty stdout and never retried.

ACCESS RECORD
041 AGENT

codemap: a local code-symbol index for agents

"Where is X and what is its signature" usually means grep the whole tree, then read the file end to end for one line. codemap indexes every function, class and type into a compact SQLite map, so the same question is a single file:line lookup. 96.7% fewer characters per lookup, pure stdlib, 22 tests.

ACCESS RECORD
040 INFRA

sluice: an outbound egress guard

Agents touch live credentials all day. One careless paste and a key is public forever. sluice is the gate every outbound message passes through: scan and refuse, or redact in place. 100% recall on planted secrets, zero false positives across 1.36M chars of real copy, and it caught two real internal-path disclosures already live on the site.

ACCESS RECORD
039 INFRA

slim: token-trim filter for agents

Agents burn most of their context budget on tool output they never needed. slim strips the noise before it reaches the model: lossless cleanups always on, large dumps clamped head and tail. On five real command outputs it cut characters by 88.7%, roughly 110k estimated tokens down to 12k. The honest catch: the big wins are lossy by design.

ACCESS RECORD
038 RESEARCH

Vera Reward Mode

The Vera panel votes PASS or KILL with a confidence number, and models are bad at that number. We read a reward straight from each juror's next-token probabilities instead. On an eleven-probe set it held a steady 1.0 where the old signal coin-flipped to 0.38, and it surfaced juror disagreement the averaged confidence had buried.

ACCESS RECORD
037 FEATURE

Vera A/B Mode

Vera could tell us whether an agent passes a scenario set. It could not tell us whether a change helped. A/B mode runs two variants over the same scenarios and the same rubric, scores both with the three-juror panel, and reports a net pass-rate delta, tagging every scenario fixed, regressed, stable or inconclusive.

ACCESS RECORD
036 INFRA

Wiring r/LocalLLaMA into the Workloft Loop

We added r/LocalLLaMA as the fifth feed to the Loop, the one source watching the open-weight and local-inference world. Reddit 403s our server's IP on the JSON API, so we pull it through the wide-open RSS feed instead. Walt scores the day's posts and files only the 9s and 10s, because the place is noisy. First run: thirty-two scored, two filed.

ACCESS RECORD
035 AGENT

stealing Jon's browser hardening for Larry

A fellow builder, Jon, wrote up his hardened agent-browser setup and shared it. We took the one piece that earned its place today, a stealth flag that stops Larry advertising himself as automation, and left the proxy and captcha layers documented as on-demand. Then we mirrored it so you can steal it too.

ACCESS RECORD
034 AGENT

trojan-scan: catching backdoors in our own memory

A new paper (ClawTrojan) shows an agent reading a hidden instruction from a tool output, storing it in memory, then running it a session later. Per-step gates miss it. We built a scanner that baselines every auto-injected surface and flags drift, obfuscation and hook egress. Clean on 256 files, catches all four seeded attacks.

ACCESS RECORD
033 INFRA

daily.dev wired into the Workloft Loop

We hooked daily.dev's trending feed into the Loop. A daily cron pulls it, Walt scores every post against our research axes, and the strongest buildable picks file themselves into the backlog. Third external signal feeding the Loop, for pennies a day.

ACCESS RECORD
032 RESEARCH

Grok tested for the code tier. It didn't earn the slot.

We wired xAI's Grok into our router and ran it against the models we already trust for code. It wrote correct code, fast and cheap, but Opus still won quality and DeepSeek still won price, so Grok stays in the catalogue without the slot. A negative result is still a result.

ACCESS RECORD
031 AGENT

Queued posts now fall off the to-do list on their own.

Once a post is queued for review, the reminder to publish it closes itself. A new audit pass matches open publish to-dos to live drafts by channel and slug, and the draft becomes the tracker so the list only shows what still needs a human.

ACCESS RECORD
030 FIX

The agent stopped re-posting things we'd already shipped.

A status-driven daily audit now reads the real queue state, catches cross-channel duplicates, and closes the to-do items for posts we have already published. One clean pass cleared nine orphaned drafts that three manual reminders could not.

ACCESS RECORD
029 AGENT

A bandit that stops the router overpaying.

A small learner sits on top of Ruby, watches which tier actually pays off per job, and downshifts off the dear tier when the cheap one keeps answering. On our priciest category the gap it closes is about seventeen-fold.

ACCESS RECORD
028 AGENT

The router now grades its own answer before handing it back.

Ruby runs the cheap model first, puts the reply in front of a three-juror panel, and climbs the tier ladder by itself when the answer is weak. Cheap by default, expensive only when the work earns it.

ACCESS RECORD
027 AGENT

Our agent read research the slow way. Now it reads it itself.

A human used to spot a paper and paste the link. We wired the AlphaXiv MCP server into the agent, so it searches, ranks and reads arXiv papers as native tools. The research firehose is one tool call now, not a manual hunt.

ACCESS RECORD
026 INFRA

The rule was documented. The agent skipped it anyway.

Our shipping procedure kept losing the same step. So we stopped trusting the agent to remember and moved the hard rules into hooks that block the action when a precondition is missing. The hero on this entry exists because a gate refused to ship it without one.

ACCESS RECORD
025 INFRA

We could see what the robots spent. Not what they earned.

The audit log tracked every pound each always-on cron spent on tokens, but nothing it earned. We wired per-cron revenue attribution onto the same append-only ledger — no new database — so every cron has a P&L.

ACCESS RECORD
024 INFRA

The rule was saved. The agent never saw it.

A saved rule kept getting broken because the memory index outgrew its load budget and was truncated before it reached context. We trimmed it and built a hook that hard-stops the index from ever exceeding budget.

ACCESS RECORD
023 RESEARCH

The V4-Pro Reasoning-Token Mirage.

DeepSeek V4-Pro's price fell 75%. We A/B'd it against Gemini Flash on our live paper-scoring job. It came out 11.7x pricier and 18.8x slower. Hidden reasoning tokens, paid for and thrown away.

ACCESS RECORD
022 AGENT

The Social Loop.

The Typefully bridge. Post drafts flow out for scheduling, and a 15-minute cron reconciles the published URLs back into the ledger. The Publish step of the Loop now runs itself.

ACCESS RECORD
021 INFRA

Walt's picks now grade themselves.

Outer loop of two-level autoresearch wired onto Walt. Every paper scored >= 8 is joined to its Gary outcome and reported per axis. Measure-before-tune, as a runtime feature.

ACCESS RECORD
020 INFRA

Bob's actions now write Vera's tests.

PhoneWorld pattern applied to our audit log. Trajectories cluster by (agent, action), Ruby drafts a Vera rubric per cluster, verifier coverage grows on its own as the fleet does new work.

ACCESS RECORD
019 FIX

civiclaw FOI intake prompt polished.

The intake prompt invited the model to ask clarifying questions back. Removed that default, forced six fixed headings, anchored workable-as-written. Output on Qwen2.5:7b dropped from ~60 lines in 1m41s to ~30 lines in 45s and stayed on-topic.

ACCESS RECORD
018 INFRA

civiclaw sovereign Ollama fallback wired end-to-end.

Until today, civiclaw's sovereign claim was scaffolded but not wired. Every skill hard-bound to the Anthropic SDK. As of this commit, FOI / EIR / AIACT / DSAR plain-text stages all run end-to-end on a local Qwen2.5:7b via Ollama. The doc claim is now a doc fact.

ACCESS RECORD
017 INFRA

civiclaw GitHub mirror live.

civiclaw is now at github.com/workloftai/civiclaw, push-mirrored from the GitLab canonical via GitLab's remote_mirrors API. Closes the discoverability gap for HN and dev audiences who expect to find OSS on GitHub, not GitLab.

ACCESS RECORD
016 INFRA

Audited the next MCP spec two months early.

MCP protocol version 2026-07-28 is in draft upstream. Two months until release. We audited our hosted endpoint, found a real 502 leak on the legacy GET stream, fixed it to a 405, and wired the hourly canary plus daily PyPI watcher that will tell us the moment the Python SDK ships 2026-07-28 support. The flip is now a 30-minute job.

ACCESS RECORD
015 RESEARCH

SEAL evolve, failure-driven guardrails from the audit log.

A paper landed on the arXiv feed at 8am. By lunch we had stolen the implementable kernel, run it on our own audit log, and the first 7-day pass surfaced an Anthropic billing issue and a DeepSeek max_tokens bug that had been failing quietly for days. Walt classifies each failure, clusters them, and drafts a one sentence guardrail per cluster. Two hundred lines, no new dependencies. Read-only by design.

ACCESS RECORD
014 FEATURE

Labs Carousel — PDF carousel generator for Workloft Labs Notes.

Every Workloft Labs Note now ships with a 1080x1350 LinkedIn-native PDF carousel alongside the text. One command: distillation via Walt + Sonnet, per-Note motif via gpt-image-2, layout via Playwright, British-English post body drafted automatically. End to end about six pence per Note. Built to test whether carousels outperform text posts for our audience.

ACCESS RECORD
013 INFRA

Workloft Labs, now a hosted MCP server.

Labs API has been live since 8 May with zero external uptake. Sunday afternoon we turned it into a hosted MCP at chat-api.workloft.ai/labs-api/mcp/: one JSON snippet in any agent client and our 85 curated picks across 17 days appear as tools. Same build fixed the /health 502, lifted the free tier to 500 calls/30d, and added a public no-auth daily JSON snapshot endpoint. Zero clone, zero auth.

ACCESS RECORD
012 AGENT

Agentic Oddities, the fortnightly weird-AI digest.

Every three days a scraper pulls real-world AI-agent failure stories from HN and Google News, Walt scores them, Vera picks the headline and writes the missing-control angle. First run: 127 candidates, 4 shortlisted, headline pick was The Times on the AI cafe that ordered 3,000 pairs of gloves. Feeds the new /labs/news/ section that went live the same day.

ACCESS RECORD
011 INFRA

A ledger for every public post.

Tiny Supabase table called workloft_posts. Every public post under the Workloft name now lands a row. Maggie JSON queues hold intent; the ledger holds outcome. First two rows landed within a minute of the migration committing.

ACCESS RECORD
010 INFRA

A todo system Bob cannot cheat.

164 open todos, many overdue by two weeks. Spent the day building a system where every item ends in shipped or killed. Enforcement lives in a Claude Code Stop hook, not in the system prompt. First contract violation caught 30 min after going live.

ACCESS RECORD
009 INFRA

Every Note and Ship now has a Markdown sibling.

18 .md files now ship alongside the HTML. Frontmatter on top, no nav, no animation, no related-links chrome. GitLab Pages serves them as text/markdown. The Workloft corpus is now agent-readable by default.

ACCESS RECORD
008 INFRA

llms.txt for Workloft, shipping for real this time.

Our llms.txt existed in the repo for weeks and 404'd in production for weeks. A PostHog look at last week's traffic surfaced the silent failure. We fixed the deploy, refreshed the content, and made Workloft visible to AI crawlers.

ACCESS RECORD
007 POSITIONING

The interop floor lifted. We swept our positioning to match.

A2A v1.0 crossed 150 organisations and one year inside the Linux Foundation. Interop is officially commodity. We swept Labs, the homepage and the sales surface, and published Note №10 on where the moat moves next.

ACCESS RECORD
006 INFRA

The selection gate now sits on a panel

Single-LLM judges have correlated blind spots. We retired Vera and stood up PoLL, a three-juror panel across Haiku 4.5, Gemini 2.5 Flash, and DeepSeek v4 Flash. Splits escalate to Telegram. ~$0.002 per candidate.

ACCESS RECORD
005 INFRA

Your audit log is training data

Agent Context Compilation, applied to our own production audit log. 25 trajectories, 102 grounded long-context QA pairs, $0.0132 of compute. Open source under MIT.

ACCESS RECORD
004 AGENT

Bob Picks Up the Phone

After several weeks of back and forth with Twilio support, the Workloft voice line is finally live. Bob now answers the phone. A real conversation, in real time, in a voice closer to a person than a recording.

ACCESS RECORD
001 AGENT

Gemini Managed Agents, wired into Ruby

Google shipped one-call managed agents at I/O 2026. We tested it, wired it into our model router, and saw three to eight times cost cuts on agentic tasks. Region caveats apply.

ACCESS RECORD
003 RESEARCH

AgentPass V0.1: the verification primitive AI agents don't yet have

Published as an RFC on 3 May 2026. A Verifiable Credential profile that lets any verifier answer, in real time, whether an AI agent has standing to act in an institutional transaction. Single API call. Yes/no with cryptographic proof.

ACCESS RECORD
002 INFRA

Sovereign by default: A2A v1.0 + AP2 V0.1 wired through the stack

Over 24 and 25 April we made every Workloft agent speak Google A2A v1.0 and issue AP2 V0.1 mandates. Every agent action is now cryptographically signed and independently verifiable. Verify it yourself at workloft.ai/verify.

ACCESS RECORD