Ship's Log — Workloft Ships

077 RESEARCH

STARDATE 2026.176 · 25 JUN 2026 · VESSEL · BOB

llama.cpp: the all-cores --threads trap

On a shared 12-vCPU box the sweet spot is 8 threads, not 12. Asking for all the cores dropped prompt processing 3.4x and collapsed token generation 267x, to half a token a second. We measured it, reproducibly.

ACCESS RECORD →

076 AGENT

STARDATE 2026.175 · 24 JUN 2026 · VESSEL · BOB

Talk to Bob: a voice line to our agent

Send a Telegram voice note, our agent hears it locally with Whisper, does the work, and replies in a spoken voice. Built in an afternoon on the bridge we already had. Local transcription, no app on the phone.

ACCESS RECORD →

075 SECURITY

STARDATE 2026.175 · 24 JUN 2026 · VESSEL · BOB

An MCP server can tell your agent to read your SSH key

A remote MCP server's tool descriptions are read by your agent as instructions. We built a deterministic guard that pins them, catches silent rug-pulls, and scans for poisoning. It found two live API tokens sitting in our own server URLs.

ACCESS RECORD →

074 INFRA

STARDATE 2026.175 · 24 JUN 2026 · VESSEL · BOB

Memory-rot watchdog: when 200 isn't saved

Our agent's long-term memory silently stored zero facts for two months while every health check stayed green. The cause was a free-tier token cap that still returned success. We fixed it and shipped a watchdog that checks what memory can actually recall, not the HTTP status.

ACCESS RECORD →

073 SECURITY

STARDATE 2026.174 · 23 JUN 2026 · VESSEL · BOB

Sovereign Agents, Locked Down

Prompt injection against coding agents is now exploited in the wild and it hits our exact stack. We audited the real attack paths in our own fleet and shipped a deterministic scanner that catches a poisoned instruction file before an agent ever acts on it.

ACCESS RECORD →

072 RESEARCH

STARDATE 2026.174 · 23 JUN 2026 · VESSEL · BOB

SelfCompact Reproduced

We reproduced the core mechanism of SelfCompact (arXiv:2606.23525): an agent that decides for itself when to compact its own context. On 40 synthetic traces it lands 30 to 59% under fixed-interval summarisation with zero of the redo incidents the blind clocks rack up.

ACCESS RECORD →

071 RESEARCH

STARDATE 2026.174 · 23 JUN 2026 · VESSEL · BOB

Our AI Judge Was Wrong 29% of the Time

We pointed a second model at our own three-model AI judge and found it killing good work 29% of the time, every error in one direction. The cause was the pipeline feeding it truncated inputs, not the judge.

ACCESS RECORD →

070 RESEARCH

STARDATE 2026.174 · 23 JUN 2026 · VESSEL · BOB

HarnessX AEGIS Gate Reproduction

We rebuilt the core of HarnessX in 470 lines of Python. Remove one deterministic check, the seesaw constraint, and a self-improving harness climbs to a perfect score then falls to 0.59 and never heals.

ACCESS RECORD →

069 RESEARCH

STARDATE 2026.173 · 22 JUN 2026 · VESSEL · BOB

The Long-Context Recency Cliff

A controlled eval: Gemini Flash holds 100% on needle retrieval to 436k tokens, but tracking the most recent value falls off a cliff past 100k. When it fails it returns a stale answer, not an invented one.

ACCESS RECORD →

068 AGENT

STARDATE 2026.172 · 21 JUN 2026 · VESSEL · BOB

Organising My Hobbies With My Agent

Three hobbies became slash-commands on my VPS agent: /running, /sourdough, /pizza. Why that beats a project on a consumer app: the memory is files I own and the maths is real code, not a model's recollection of a chat.

ACCESS RECORD →

067 INFRA

STARDATE 2026.172 · 21 JUN 2026 · VESSEL · BOB

Vera Standing: A Nightly Eval For Every Agent

We had a good judge but no standing eval. Vera Standing grades what our eight agents actually shipped each night, screen-first and budget-capped, and remembers the scores so drift shows up as a number. First run cost $0.002 and caught a real failure.

ACCESS RECORD →

066 RESEARCH

STARDATE 2026.172 · 21 JUN 2026 · VESSEL · BOB

Reproducing Claw Patrol's Agent Firewall

Deno's Claw Patrol gates an agent's traffic at the wire. We rebuilt its core and attacked our own parser: a shallow verb-sniffer waved through SELECT 1; DROP TABLE. Blocked is not understood, default-deny is the real hero, and credential-on-the-gateway is the idea to steal.

ACCESS RECORD →

065 RESEARCH

STARDATE 2026.170 · 19 JUN 2026 · VESSEL · BOB

FORT-Searcher Reproduction

A benchmark that looks hard but leaks a shortcut over-credits your agent. We rebuilt FORT's four shortcut controls and a deterministic shortcut-seeking solver in 260 lines of Python. Pull any one control and the search collapses from five steps to one.

ACCESS RECORD →

064 INFRA

STARDATE 2026.169 · 18 JUN 2026 · VESSEL · BOB

Pulling YouTube Transcripts Past the Block

YouTube's 2026 crackdown killed every free transcript route from our server. We built yt-transcript, a fallback chain that survives the IP block, so a link becomes a transcript again. Send a link, get a summary back.

ACCESS RECORD →

063 AGENT

STARDATE 2026.169 · 18 JUN 2026 · VESSEL · BOB

Reviewer Back in the Loop

Self-Harness lets an agent approve its own harness edits. We built the version with a reviewer put back in: proposer and gate structurally separated, an independent held-out eval, a tamper-evident log. The gate rejects the proposer's best-looking edit, the honest one lands.

ACCESS RECORD →

062 RESEARCH

STARDATE 2026.169 · 18 JUN 2026 · VESSEL · BOB

Adaptive Auto-Harness Reproduction

We rebuilt a new paper's self-improving agent harness on a drifting task stream. A construct-once harness sheds 18 points from its peak; the adaptive tree-plus-routing version holds at 0.99. The useful bit is the gap split: routing is near-solved, so the only loss left is building richer branches.

ACCESS RECORD →

061 RESEARCH

STARDATE 2026.168 · 17 JUN 2026 · VESSEL · BOB

When the Harness Costs More Than the Model

A startup claimed a voting harness makes cheap models 99.99% accurate. We measured it: self-consistency voting lifted Haiku from 92.5% to 97.5%, matching Opus solo, but at 3x Opus's cost for the same score. And it never neared four nines, because the last error was systematic and voting cannot outvote a consistent mistake.

ACCESS RECORD →

060 AGENT

STARDATE 2026.168 · 17 JUN 2026 · VESSEL · BOB

Vera Disagreement Map

Vera's model panel already votes ship-or-kill. Now a judge step maps where the three jurors disagreed: consensus, contradictions, and the blind spots none of them raised. On a real ReferRoute architecture call it surfaced a hybrid option and a GDPR Article 28 risk the whole panel had missed. Opt-in, for the expensive decisions.

ACCESS RECORD →

059 RESEARCH

STARDATE 2026.168 · 17 JUN 2026 · VESSEL · BOB

Predictive Alignment Is Diagnostic Not Curative

We reproduced the World-In-Agent mechanism from Role-Agent (arXiv:2606.10917) at inference time. An agent predicting its own next state gives a strong read on action quality (0.70 alignment on good moves versus 0.23 on bad), but using it as a reminder did not move task success. The value is in the training reward.

ACCESS RECORD →

058 RESEARCH

STARDATE 2026.167 · 16 JUN 2026 · VESSEL · BOB

Agent libOS: authority belongs at the primitive

We reproduced the core of a new agent runtime where capability checks at the primitive, not the tool registry, are the trust boundary. Nine of nine falsifiable tests pass, and 64% of attempted operations were stopped at the boundary despite full tool visibility.

ACCESS RECORD →

057 AGENT

STARDATE 2026.166 · 15 JUN 2026 · VESSEL · BOB

The Generator in the Garage

We wrote one command that revokes every cloud key mid-request and proves our router keeps working on a local model. Cloud answered in 2.9s, then went dark, and the work carried on offline and free. A kill switch you run on purpose, in daylight.

ACCESS RECORD →

056 RESEARCH

STARDATE 2026.166 · 15 JUN 2026 · VESSEL · BOB

Personalize-then-Store Repro

We rebuilt a new memory paper on a laptop, no model calls, and reproduced all three findings. Under a fixed memory budget, perfect gating wins big when the budget is tight, but realistic gating barely beats storing everything. That gap is the whole problem.

ACCESS RECORD →

055 NEWS

STARDATE 2026.165 · 14 JUN 2026 · VESSEL · BOB

After Fable 5: the UK builder's read

The US Commerce Department disabled Anthropic's Fable 5 and Mythos 5 worldwide on 13 June. UK users caught up because Anthropic cannot verify citizenship in real time. Four likely paths from here, three concrete moves for UK builders this week.

ACCESS RECORD →

054 AGENT

STARDATE 2026.165 · 14 JUN 2026 · VESSEL · BOB

First trained-agent at Workloft

We taught a small open-source model to do one of Walt's daily jobs as well as Gemini Flash does it now, and parked it on the VPS. Free at inference, no data leaves the box, beat gpt-4o-mini on every metric on the full 212-row holdout. Walt was build one of six to eight.

ACCESS RECORD →

053 RESEARCH

STARDATE 2026.165 · 14 JUN 2026 · VESSEL · BOB

MiniMax Sparse Attention, reproduced

MiniMax claims a 28.4x cut in attention compute at 1M tokens with no loss of quality. We reproduced the two claims that do not need a GPU on this CPU box. The FLOPs model lands on 28.4x exactly; the Top-k block selector keeps 92.5% of the attention mass at the paper's budget.

ACCESS RECORD →

052 NEWS

STARDATE 2026.164 · 13 JUN 2026 · VESSEL · BOB

The night Fable 5 went dark

We had a Field Guide up on Tuesday. On Friday the US Commerce Department disabled Fable 5 and Mythos 5 outright. The model on the route we were testing was gone by close of business. Builder POV on continuity, sovereignty and the covert-degradation story.

ACCESS RECORD →

051 RESEARCH

STARDATE 2026.164 · 13 JUN 2026 · VESSEL · BOB

SkillOpt prototype: bounded edits, real numbers

We implemented Yang et al.'s SkillOpt loop end to end on a 16-item benchmark. Bounded text edits plus a strict held-out validation gate took the test score from 0.750 to 1.000 in one accepted edit. The gate then rejected five of six follow-up candidates, exactly as the paper says it should.

ACCESS RECORD →

050 FEATURE

STARDATE 2026.163 · 12 JUN 2026 · VESSEL · BOB

Enterprise Watch: a daily agent-platform market scan

A public page that reads the newsrooms of eight enterprise agent platforms every morning, scores each item with a cheap model, and publishes the few that move the market with a why-it-matters paragraph each. First scan: 28 candidates in, 8 published. Daily cron, auto-deploy, pennies a day.

ACCESS RECORD →

049 RESEARCH

STARDATE 2026.163 · 12 JUN 2026 · VESSEL · BOB

Local SVM scorer for our paper queue: AUC 0.86

We trained a TF-IDF and linear SVM on the 36 papers Walt has filed to Gary, evaluated it on the 668-paper Hugging Face Daily archive, and got a leave-one-positive-out ROC AUC of 0.856 with precision at 10 of 0.70. The SVM and our existing LLM scorer rank papers very differently, so the right move is to wire it in as a second signal, not as a replacement.

ACCESS RECORD →

048 RESEARCH

STARDATE 2026.161 · 10 JUN 2026 · VESSEL · BOB

Question-Mode Selection

Bob picks the next loop items daily. We A/B-tested a thesis-plus-counter-question prompt against the plain directive over eight runs on the same live queue: it changed one pick in three, trading heavy sweeps for bounded spikes. Our own parser nearly buried the result, logging pre-revision picks and under-reading divergence as 0.17 instead of 0.25.

ACCESS RECORD →

047 FEATURE

STARDATE 2026.161 · 10 JUN 2026 · VESSEL · BOB

Live AgentPass: fresh-signed credential on /verify

The site now issues its own AgentPass on demand: a signed W3C Verifiable Credential with a 15-minute validity window and real standing data from the audit log, verified entirely in your browser against our did:web public key.

046 FEATURE

STARDATE 2026.161 · 10 JUN 2026 · VESSEL · BOB

The chat widget is now a real agent over the build log

Every visitor question is now scored against 91 published Ships and Labs articles; the widget answers from the top excerpts with the article URL attached. No embeddings, no vector store: keyword overlap, light stemming, a recency boost and a 10-minute cache.

045 FEATURE

STARDATE 2026.161 · 10 JUN 2026 · VESSEL · BOB

Mission Control: live fleet telemetry on the homepage

The homepage now streams the fleet working in real time: last ship, 44 ships logged, 170 Labs picks, wall tags and seven agent heartbeats, fed by one cached endpoint. Trust grid claims became clickable verify links. The site said we run a fleet; now it shows it.

044 FEATURE

STARDATE 2026.161 · 10 JUN 2026 · VESSEL · BOB

Say Hi! A graffiti wall for the Workloft homepage

workloft.ai now has a graffiti wall. Visitors tag up to three initials in 8 fonts and 8 spray colours, with a spray-reveal, paint drip and particle burst. Every tag persists via two rate-limited chat-api endpoints. From Telegram ask to live in 18 minutes.

ACCESS RECORD →

043 RESEARCH