myVoxtral/docs/superpowers/specs/2026-04-07-myvoxtral-design.md
2026-04-07 19:15:49 +02:00

5.1 KiB
Raw Blame History

MyVoxtral — macOS Realtime Transcription App

Overview

A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.

Target

  • macOS 14+
  • SwiftUI
  • No third-party dependencies

API Protocol

Connection

  • WebSocket to wss://api.mistral.ai/v1/audio/transcriptions/realtime
  • Auth: Bearer token in WebSocket upgrade headers
  • Model: voxtral-mini-transcribe-realtime-2602

Audio Format

  • PCM 16-bit signed little-endian, mono, 16kHz
  • 480ms chunks (~15,360 bytes raw)
  • Sent as base64-encoded strings in JSON messages

Messages Sent (Client → Server)

{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
{"type": "input_audio.flush"}
{"type": "input_audio.end"}
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}

Events Received (Server → Client)

Event type Key fields Description
session.created .session Connection established
transcription.text.delta .text (string) Incremental transcribed text
transcription.segment .text, .start, .end, .speaker_id Segment-level transcription
transcription.language .audio_language Detected language
transcription.done .model, .text, .usage, .language, .segments Session complete
error .error.message.detail Error details

Architecture

Four layers, data flows one direction:

Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor

1. AudioCapture

  • AVAudioEngine with input node tap
  • Converts captured audio to PCM 16-bit LE mono @ 16kHz
  • Yields Data chunks every ~480ms

2. VoxtralWebSocketClient

  • Uses URLSessionWebSocketTask (no dependencies)
  • Connects with Bearer token in headers
  • Sends JSON messages with base64-encoded audio
  • Sends flush/end control messages on stop
  • Parses incoming JSON by type field
  • Exposes text deltas via AsyncStream or callback

3. TranscriptionManager (ObservableObject)

  • Owns AudioCapture and VoxtralWebSocketClient
  • State machine: .idle.recording.idle (or .error(String))
  • Accumulates text from deltas into current session buffer
  • On stop: appends timestamped session to log file
  • Output mode: .textBox or .cursorInjection (togglable)

4. UI Components

MenuBarExtra

  • Mic icon in menu bar, changes color when recording
  • Dropdown: Start/Stop, Settings, Quit

TranscriptionWindow

  • Small floating NSPanel (.floating window level)
  • Shows accumulated transcription text
  • Copy button
  • Opens when recording starts (in textBox mode)

SettingsView

  • API key text field (stored in UserDefaults)
  • Global shortcut picker (configurable key combo)
  • Output mode toggle (text box vs cursor injection)
  • Latency slider (240ms2400ms, maps to target_streaming_delay_ms)

Cursor Injection

  • CGEvent to simulate keystrokes for each character of received text delta
  • Requires Accessibility permission
  • Check via AXIsProcessTrusted() on first use; prompt if missing
  • Falls back to text box mode if permission denied

Global Shortcut

  • NSEvent.addGlobalMonitorForEvents(matching: .keyDown) for the configured combo
  • Default: unset (user must configure)
  • Stored in UserDefaults as modifier flags + key code

Transcription Log

  • Append-only text file at ~/Library/Application Support/MyVoxtral/transcription.log
  • Format: [ISO-8601 timestamp]\n<transcribed text>\n---\n
  • Written on session end (stop recording or transcription.done)

Error Handling

Condition Behavior
No API key set Open Settings automatically on first launch
WebSocket disconnect Auto-retry once, then show error in menu bar
Mic permission denied System alert → System Settings > Privacy
Accessibility permission missing Prompt via AXIsProcessTrusted(), fall back to text box
Invalid API key (401) Show error in dropdown, stop recording

File Structure

MyVoxtral/
├── MyVoxtralApp.swift              # App entry, MenuBarExtra
├── Models/
│   ├── TranscriptionManager.swift  # Orchestrator, state machine
│   └── AppSettings.swift           # UserDefaults wrapper
├── Audio/
│   └── AudioCapture.swift          # AVAudioEngine mic capture
├── Network/
│   ├── VoxtralWebSocketClient.swift # WebSocket protocol
│   └── VoxtralMessages.swift       # JSON message types (Codable)
├── Views/
│   ├── TranscriptionWindow.swift   # Floating text panel
│   ├── SettingsView.swift          # Preferences window
│   └── MenuBarView.swift           # Dropdown menu content
├── Utilities/
│   ├── CursorInjector.swift        # CGEvent keystroke sim
│   ├── GlobalShortcut.swift        # Configurable hotkey
│   └── TranscriptionLogger.swift   # Append to log file
└── Resources/
    └── Assets.xcassets              # Menu bar icons