5.1 KiB
5.1 KiB
MyVoxtral — macOS Realtime Transcription App
Overview
A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.
Target
- macOS 14+
- SwiftUI
- No third-party dependencies
API Protocol
Connection
- WebSocket to
wss://api.mistral.ai/v1/audio/transcriptions/realtime - Auth: Bearer token in WebSocket upgrade headers
- Model:
voxtral-mini-transcribe-realtime-2602
Audio Format
- PCM 16-bit signed little-endian, mono, 16kHz
- 480ms chunks (~15,360 bytes raw)
- Sent as base64-encoded strings in JSON messages
Messages Sent (Client → Server)
{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
{"type": "input_audio.flush"}
{"type": "input_audio.end"}
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}
Events Received (Server → Client)
| Event type | Key fields | Description |
|---|---|---|
session.created |
.session |
Connection established |
transcription.text.delta |
.text (string) |
Incremental transcribed text |
transcription.segment |
.text, .start, .end, .speaker_id |
Segment-level transcription |
transcription.language |
.audio_language |
Detected language |
transcription.done |
.model, .text, .usage, .language, .segments |
Session complete |
error |
.error.message.detail |
Error details |
Architecture
Four layers, data flows one direction:
Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor
1. AudioCapture
AVAudioEnginewith input node tap- Converts captured audio to PCM 16-bit LE mono @ 16kHz
- Yields
Datachunks every ~480ms
2. VoxtralWebSocketClient
- Uses
URLSessionWebSocketTask(no dependencies) - Connects with Bearer token in headers
- Sends JSON messages with base64-encoded audio
- Sends flush/end control messages on stop
- Parses incoming JSON by
typefield - Exposes text deltas via AsyncStream or callback
3. TranscriptionManager (ObservableObject)
- Owns AudioCapture and VoxtralWebSocketClient
- State machine:
.idle→.recording→.idle(or.error(String)) - Accumulates text from deltas into current session buffer
- On stop: appends timestamped session to log file
- Output mode:
.textBoxor.cursorInjection(togglable)
4. UI Components
MenuBarExtra
- Mic icon in menu bar, changes color when recording
- Dropdown: Start/Stop, Settings, Quit
TranscriptionWindow
- Small floating
NSPanel(.floatingwindow level) - Shows accumulated transcription text
- Copy button
- Opens when recording starts (in textBox mode)
SettingsView
- API key text field (stored in UserDefaults)
- Global shortcut picker (configurable key combo)
- Output mode toggle (text box vs cursor injection)
- Latency slider (240ms–2400ms, maps to
target_streaming_delay_ms)
Cursor Injection
CGEventto simulate keystrokes for each character of received text delta- Requires Accessibility permission
- Check via
AXIsProcessTrusted()on first use; prompt if missing - Falls back to text box mode if permission denied
Global Shortcut
NSEvent.addGlobalMonitorForEvents(matching: .keyDown)for the configured combo- Default: unset (user must configure)
- Stored in UserDefaults as modifier flags + key code
Transcription Log
- Append-only text file at
~/Library/Application Support/MyVoxtral/transcription.log - Format:
[ISO-8601 timestamp]\n<transcribed text>\n---\n - Written on session end (stop recording or
transcription.done)
Error Handling
| Condition | Behavior |
|---|---|
| No API key set | Open Settings automatically on first launch |
| WebSocket disconnect | Auto-retry once, then show error in menu bar |
| Mic permission denied | System alert → System Settings > Privacy |
| Accessibility permission missing | Prompt via AXIsProcessTrusted(), fall back to text box |
| Invalid API key (401) | Show error in dropdown, stop recording |
File Structure
MyVoxtral/
├── MyVoxtralApp.swift # App entry, MenuBarExtra
├── Models/
│ ├── TranscriptionManager.swift # Orchestrator, state machine
│ └── AppSettings.swift # UserDefaults wrapper
├── Audio/
│ └── AudioCapture.swift # AVAudioEngine mic capture
├── Network/
│ ├── VoxtralWebSocketClient.swift # WebSocket protocol
│ └── VoxtralMessages.swift # JSON message types (Codable)
├── Views/
│ ├── TranscriptionWindow.swift # Floating text panel
│ ├── SettingsView.swift # Preferences window
│ └── MenuBarView.swift # Dropdown menu content
├── Utilities/
│ ├── CursorInjector.swift # CGEvent keystroke sim
│ ├── GlobalShortcut.swift # Configurable hotkey
│ └── TranscriptionLogger.swift # Append to log file
└── Resources/
└── Assets.xcassets # Menu bar icons