# MyVoxtral — macOS Realtime Transcription App ## Overview A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs. ## Target - macOS 14+ - SwiftUI - No third-party dependencies ## API Protocol ### Connection - WebSocket to `wss://api.mistral.ai/v1/audio/transcriptions/realtime` - Auth: Bearer token in WebSocket upgrade headers - Model: `voxtral-mini-transcribe-realtime-2602` ### Audio Format - PCM 16-bit signed little-endian, mono, 16kHz - 480ms chunks (~15,360 bytes raw) - Sent as base64-encoded strings in JSON messages ### Messages Sent (Client → Server) ```json {"type": "input_audio.append", "audio": ""} {"type": "input_audio.flush"} {"type": "input_audio.end"} {"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}} ``` ### Events Received (Server → Client) | Event type | Key fields | Description | |---|---|---| | `session.created` | `.session` | Connection established | | `transcription.text.delta` | `.text` (string) | Incremental transcribed text | | `transcription.segment` | `.text`, `.start`, `.end`, `.speaker_id` | Segment-level transcription | | `transcription.language` | `.audio_language` | Detected language | | `transcription.done` | `.model`, `.text`, `.usage`, `.language`, `.segments` | Session complete | | `error` | `.error.message.detail` | Error details | ## Architecture Four layers, data flows one direction: ``` Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor ``` ### 1. AudioCapture - `AVAudioEngine` with input node tap - Converts captured audio to PCM 16-bit LE mono @ 16kHz - Yields `Data` chunks every ~480ms ### 2. VoxtralWebSocketClient - Uses `URLSessionWebSocketTask` (no dependencies) - Connects with Bearer token in headers - Sends JSON messages with base64-encoded audio - Sends flush/end control messages on stop - Parses incoming JSON by `type` field - Exposes text deltas via AsyncStream or callback ### 3. TranscriptionManager (ObservableObject) - Owns AudioCapture and VoxtralWebSocketClient - State machine: `.idle` → `.recording` → `.idle` (or `.error(String)`) - Accumulates text from deltas into current session buffer - On stop: appends timestamped session to log file - Output mode: `.textBox` or `.cursorInjection` (togglable) ### 4. UI Components #### MenuBarExtra - Mic icon in menu bar, changes color when recording - Dropdown: Start/Stop, Settings, Quit #### TranscriptionWindow - Small floating `NSPanel` (`.floating` window level) - Shows accumulated transcription text - Copy button - Opens when recording starts (in textBox mode) #### SettingsView - API key text field (stored in UserDefaults) - Global shortcut picker (configurable key combo) - Output mode toggle (text box vs cursor injection) - Latency slider (240ms–2400ms, maps to `target_streaming_delay_ms`) ## Cursor Injection - `CGEvent` to simulate keystrokes for each character of received text delta - Requires Accessibility permission - Check via `AXIsProcessTrusted()` on first use; prompt if missing - Falls back to text box mode if permission denied ## Global Shortcut - `NSEvent.addGlobalMonitorForEvents(matching: .keyDown)` for the configured combo - Default: unset (user must configure) - Stored in UserDefaults as modifier flags + key code ## Transcription Log - Append-only text file at `~/Library/Application Support/MyVoxtral/transcription.log` - Format: `[ISO-8601 timestamp]\n\n---\n` - Written on session end (stop recording or `transcription.done`) ## Error Handling | Condition | Behavior | |---|---| | No API key set | Open Settings automatically on first launch | | WebSocket disconnect | Auto-retry once, then show error in menu bar | | Mic permission denied | System alert → System Settings > Privacy | | Accessibility permission missing | Prompt via `AXIsProcessTrusted()`, fall back to text box | | Invalid API key (401) | Show error in dropdown, stop recording | ## File Structure ``` MyVoxtral/ ├── MyVoxtralApp.swift # App entry, MenuBarExtra ├── Models/ │ ├── TranscriptionManager.swift # Orchestrator, state machine │ └── AppSettings.swift # UserDefaults wrapper ├── Audio/ │ └── AudioCapture.swift # AVAudioEngine mic capture ├── Network/ │ ├── VoxtralWebSocketClient.swift # WebSocket protocol │ └── VoxtralMessages.swift # JSON message types (Codable) ├── Views/ │ ├── TranscriptionWindow.swift # Floating text panel │ ├── SettingsView.swift # Preferences window │ └── MenuBarView.swift # Dropdown menu content ├── Utilities/ │ ├── CursorInjector.swift # CGEvent keystroke sim │ ├── GlobalShortcut.swift # Configurable hotkey │ └── TranscriptionLogger.swift # Append to log file └── Resources/ └── Assets.xcassets # Menu bar icons ```