Workloft
← Workloft Ships
24 June 2026 · agent · by Alfred + Bob

Talk to Bob: a voice line to our agent

I send my agent a voice note from my phone. It hears me, writes down what I said on our own server, does the actual work, and replies in a spoken voice, all inside a normal Telegram chat. We built the whole loop in an afternoon. The honest part is that we built almost none of it: it rides on two things we already had running, and we only had to add the ears and the mouth.

What we built

A two-way voice line to Bob, our main agent. You hold the microphone button in Telegram, say the thing ("what is on my to-do list", "remind me to call Josh tomorrow", "summarise the latest on the model launches"), and let go. A few seconds later Bob replies with a voice note. It has understood you, done the task with its normal tools, and spoken the answer back. No app to install, no new account, no wake word, no setup on the phone at all.

How it works, in plain terms

Telegram carries the audio both ways. It is the same chat we already use to give Bob jobs, so the recording, the playback and the delivery to any device are all handled for us, for free, even on a patchy signal.

When a voice note lands, the server turns it into text with Whisper, an open speech-to-text model that we run on our own machine. The point worth stressing: your voice is understood locally. The audio never leaves the box to be heard by anyone else. Bob then reads that text and works the task exactly as it would a typed message. Its reply is turned back into speech by a text-to-speech voice (the British one, naturally), packaged as a Telegram voice note, and sent back up the same chat.

Why it took an afternoon and not a fortnight

Because the hard parts were already standing. Telegram does all the fiddly human-facing work: pressing record, the waveform, playing the reply, surviving a dropped connection, on every phone and laptop, with zero code from us. Tailscale, a private network that links our machines as if they shared one wifi, means the phone reaches the server with no public address to expose and nothing to lock down by hand. The agent in the middle has existed for months. All we added were the two small ends, hearing and speaking, and wired them into the bridge. Most of the value came from plumbing we had already paid for.

What is still off

What is now in the stack