SCRAPPYLABS  /  FIELD GUIDE

Put your agent in the meeting.
Stay the puppet master.

A field guide for building a meeting co-pilot that actually participates — not a delayed transcript, not a post-call summary. Your agent listens live, speaks when you say so, and answers when addressed. You keep the strings.

2026-05-26  ·  v1.0 By ScrappyLabs Stack-agnostic · 3 reference paths

01The gap nobody fills

There are dozens of AI meeting tools. All of them do the same thing — transcribe your call, summarize it after, maybe extract action items. Useful, but after the fact.

OtterFirefliesGranolaRead.ai Fathomtl;dvZoom AI CompanionMS CopilotGemini Notes

The gap is the during: an agent that's actually in the meeting, listening real-time, that you can puppet in real-time. Make it answer the question that just got asked. Deliver the pricing slide verbally. Push back politely on the wrong assumption. Take the meeting while you go grab coffee.

The pieces have existed for two years.
Nobody's shipped a clean recipe. So here's one.

02What "puppet master mode" actually means

Three operating modes. You toggle between them mid-call:

Mode A
Listen
Agent silently transcribes and contextualizes. You're talking. It's reading the room.
controlled by: you (silent)
Mode B — the win
Puppet
You type or whisper a prompt. The agent speaks it in your voice (or its own). Every utterance, your call.
controlled by: you (every word)
Mode C
Auto-answer
Agent responds when @-mentioned, addressed by name, or a wake phrase fires. Inside the scope you set.
controlled by: agent (in your guardrails)

The win is Puppet mode. Auto-answer is a bonus, transcription is table stakes. If your tool stops at "we transcribe your meetings" you're playing the same game as everyone else.

03The five-box architecture

Every meeting bot is these five boxes plus you. Most existing tools collapse them into a vertical so you can't swap pieces — that's the trap.

01
Audio In
capture the meeting audio
02
ASR
speech → text, streaming
03
Brain
LLM + rolling transcript context
04
TTS
text → voice
05
Audio Out
inject back into the meeting
CONTROL CHANNEL — you steering, in real-time

Trade-offs happen per box. Below: real options for each.

04Pick a path per box

Box 1 — Audio in (capture)

PathHowCostNotes
Local sinkPipeWire (Linux), BlackHole (macOS), VB-Cable (Windows)$0Loopback the meeting app's output
Tab captureChromium getDisplayMedia({audio:true}) via CDP$0Works headless, no OS audio plumbing
Platform APIZoom RTMS, Meet Media APIvariesCleanest but platform-locked + permission-gated

Box 2 — ASR (speech to text)

PathToolCostLatency
Local Whisperwhisper.cpp, faster-whisper, WhisperX$0 + GPU200–800ms
Local Q3-ASRQwen3-ASR-Flash (multilingual, fast)$0 + GPU200–500ms
Cloud streamDeepgram Nova-3, AssemblyAI, Speechmatics$0.004–0.02/min100–300ms
Cloud batchOpenAI Whisper API, Google STT$0.006/min1–3s

For Puppet mode, ASR latency doesn't really matter — you're typing. For Auto-answer, budget <500ms end-to-end or the agent talks over people.

Box 3 — Brain (the LLM)

PathToolCostNotes
Local LLMOllama / vLLM / llama.cpp + Qwen3, Llama 3.3, Mistral, DeepSeek$0 + GPUTool-calling capable models only
APIClaude, GPT-4o, Gemini$0.50–15/M tokensBest quality, fastest iteration
HybridLocal for routine, API for hardvariesRoute by question complexity

The brain needs persistent context — the rolling transcript IS the system prompt. Most failures come from not feeding it the last 2–5 minutes before each generation. Don't be clever; just paste the transcript.

Box 4 — TTS (text to voice)

PathToolCostNotes
Local neuralPiper, Kokoro, XTTS-v2, F5-TTS, Qwen3-TTS$0 + GPU/CPUKokoro + Piper run on CPU
Cloud premiumElevenLabs, Cartesia Sonic, PlayHT$0.10–0.30/1K charsBest naturalness, voice clones
Cloud commodityOpenAI TTS, Google TTS, Azure$0.015–0.03/1K charsGood enough, cheap

Voice cloning matters more than you think. In your voice, people forget it's a bot within 30 seconds. Generic stock voice — every utterance breaks the spell.

Box 5 — Audio out (inject)

PathHowNotes
Virtual micPipeWire module-null-sink as the meeting's microphoneLinux default
Aggregate devicemacOS aggregate of real mic + BlackHolemacOS default
Browser injectMediaStreamAudioSourceNode in Chromium via CDPNo OS plumbing
PSTN dial-inTelnyx, Twilio, Vonage — bot dials the phone bridgeUniversal but $0.005–0.01/min

05Two ways to be in the meeting

Two postures the bot can take. Same five boxes underneath — different identity model.

Posture 1 — Separate participant

The bot has its own seat, name tag, video tile (avatar or static image). Joins via its own browser profile or account.

✓ Clearly identified as the AI  ·  ✓ Can stay after you leave  ·  ✗ Needs its own account  ·  ✗ Some hosts auto-eject unknowns

Best for: external — sales, discovery, anything where consent + transparency matter.

Posture 2 — Co-pilot (your mic)

The bot's audio is mixed into your mic feed. From the meeting's view, it's all coming from you.

✓ No extra participant  ·  ✓ Works when bots are banned  ·  ✗ Needs a voice clone of you  ·  ✗ Harder to leave alone

Best for: internal — backup brain on standups, technical reviews, anywhere a third name on screen is weird.

06The control channel (the puppet strings)

How do you drive it in real-time without breaking eye contact and staring at a terminal? This is the part most tutorials skip.

ChannelSetupBest for
Terminalbot say "the price is forty-nine dollars" in a tmux paneSolo / dev workflow
Hotkey + voicePush-to-talk hotkey → ASR → bot speaks the transcriptHands-on-keyboard, eyes-on-meeting
Phone DMType into Telegram/Slack DM → bot speaks itPhone-as-puppet, invisible to camera
Mention trigger"Buddy, what's our SLA?" → RAG-backed replyAuto-answer mode

Most people land on terminal + phone in practice. The phone is the killer because it's invisible to the camera and you can use it while making eye contact.

07Three reference stacks

Pick the path that matches what you already have. All three produce the same outcome.

PATH A
Full Local
$0/min · needs hardware
AudioPipeWire null-sink
ASRWhisper.cpp / Q3-ASR
BrainQwen3 / Llama 3.3
TTSPiper / XTTS / Q3-TTS
OutPipeWire null-source
Controltmux + phone over LAN/Tailscale
Hardware floor: one 24GB GPU runs the full stack (Whisper-large + Qwen3-30B Q4 + Kokoro/Piper).
PATH B — RECOMMENDED
Hybrid
~$0.05–0.15/min · no GPU needed
AudioBlackHole (Mac) / PipeWire
ASRDeepgram Nova-3 stream
BrainClaude Sonnet 4.6 API
TTSCartesia Sonic-2 / ElevenLabs
OutAggregate device → meeting
Controltmux + phone
Best quality-per-effort ratio for anyone without a GPU. This is what we'd build first if we were starting over.
PATH C
All Cloud
~$0.10–0.20/min · rented VM
AudioHeadless Chromium tab capture
ASRAssemblyAI Universal-Streaming
BrainGPT-4o
TTSOpenAI TTS
OutChromium MediaStream
ControlWeb puppet panel (you build it)
Runs on a rented VM, zero local install. Useful when you're building this as a hosted product for others.

08Build in this order (the only one that works)

Resist the urge to start with the brain. Audio plumbing is where attempts die.

  1. Audio loopback proven first. Virtual mic the meeting app sees + a virtual sink that captures meeting audio. Test by playing a .wav into the virtual mic and confirming the other participant hears it. Do not move on until this is rock solid.
  2. TTS into the meeting. Pipe TTS output to the virtual mic. You now have a "type → others hear it" loop. This alone is a useful tool.
  3. ASR off the meeting audio. Capture the sink, feed to ASR, see live transcript in your terminal. Now you can read what's happening.
  4. Brain + rolling transcript context. Wrap an LLM, give it the last 2–5 minutes of transcript as context, expose a say(prompt) command.
  5. Puppet mode. Add one control channel (terminal, phone, or Slack — pick one). Make it work well before adding more.
  6. Auto-answer (optional, last). Wake-word or @-mention detection → auto-trigger. Add this last — it's the most likely to embarrass you.

09What it looks like

A 60–90 second demo: Buddy joins a meeting, gets puppeted by Brian, switches to auto-answer, and exits on command.

DEMO VIDEO — COMING SOON

Recorded with Buddy on Google Meet

10Pitfalls we hit so you don't have to

Echo loops

If the bot's TTS reaches your real mic, the bot will transcribe itself and respond to itself. Mute your real mic in puppet mode, or route the bot's audio to a sink the mic physically can't pick up.

Same-domain Meet "adaptive audio"

Google Meet has an adaptive-audio feature that suppresses one mic when it detects two participants on the same Workspace domain in the same room. Use a different domain for the bot's account than yours. (Cost us a day.)

PipeWire linger nodes

pactl unload-module doesn't always free a virtual node — if you created it with object.linger=true, you also have to pw-cli destroy <global-id>. Otherwise you accumulate duplicate sinks across runs.

iOS WebSocket to Tailscale CGNAT

If you're building a mobile puppet controller: URLSessionWebSocketTask silently refuses ws://100.x.x.x Tailscale addresses with no error. Use Starscream or NWConnection.

Headless browsers reveal themselves

Meeting platforms increasingly detect headless Chromium and shadow-ban the bot. Run a real browser (windowed, even if you never look at it) for any external meeting.

TTS voice mismatch in Posture 2

If you augment your own mic with a generic TTS voice, every utterance is jarring. Clone your voice (XTTS, F5, ElevenLabs) before going live in co-pilot mode.

11What's NOT in this guide

Build it yourself, or ship faster.

Everything above is free. If you want help shipping it — voice cloning at scale, hosted brain, meeting-platform update channel, the boring parts — that's what we do.