chore: scaffold MyVoxtral macOS menu bar app

2026-04-07 19:15:49 +02:00 · 2026-04-07 19:15:49 +02:00 · 4b1cae1b5f
commit 4b1cae1b5f
4424 changed files with 9392 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-07-myvoxtral-design.md
+++ b/docs/superpowers/specs/2026-04-07-myvoxtral-design.md
@ -0,0 +1,148 @@
+# MyVoxtral — macOS Realtime Transcription App
+
+## Overview
+
+A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.
+
+## Target
+
+- macOS 14+
+- SwiftUI
+- No third-party dependencies
+
+## API Protocol
+
+### Connection
+
+- WebSocket to `wss://api.mistral.ai/v1/audio/transcriptions/realtime`
+- Auth: Bearer token in WebSocket upgrade headers
+- Model: `voxtral-mini-transcribe-realtime-2602`
+
+### Audio Format
+
+- PCM 16-bit signed little-endian, mono, 16kHz
+- 480ms chunks (~15,360 bytes raw)
+- Sent as base64-encoded strings in JSON messages
+
+### Messages Sent (Client → Server)
+
+```json
+{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
+{"type": "input_audio.flush"}
+{"type": "input_audio.end"}
+{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}
+```
+
+### Events Received (Server → Client)
+
+| Event type | Key fields | Description |
+|---|---|---|
+| `session.created` | `.session` | Connection established |
+| `transcription.text.delta` | `.text` (string) | Incremental transcribed text |
+| `transcription.segment` | `.text`, `.start`, `.end`, `.speaker_id` | Segment-level transcription |
+| `transcription.language` | `.audio_language` | Detected language |
+| `transcription.done` | `.model`, `.text`, `.usage`, `.language`, `.segments` | Session complete |
+| `error` | `.error.message.detail` | Error details |
+
+## Architecture
+
+Four layers, data flows one direction:
+
+```
+Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor
+```
+
+### 1. AudioCapture
+
+- `AVAudioEngine` with input node tap
+- Converts captured audio to PCM 16-bit LE mono @ 16kHz
+- Yields `Data` chunks every ~480ms
+
+### 2. VoxtralWebSocketClient
+
+- Uses `URLSessionWebSocketTask` (no dependencies)
+- Connects with Bearer token in headers
+- Sends JSON messages with base64-encoded audio
+- Sends flush/end control messages on stop
+- Parses incoming JSON by `type` field
+- Exposes text deltas via AsyncStream or callback
+
+### 3. TranscriptionManager (ObservableObject)
+
+- Owns AudioCapture and VoxtralWebSocketClient
+- State machine: `.idle` → `.recording` → `.idle` (or `.error(String)`)
+- Accumulates text from deltas into current session buffer
+- On stop: appends timestamped session to log file
+- Output mode: `.textBox` or `.cursorInjection` (togglable)
+
+### 4. UI Components
+
+#### MenuBarExtra
+- Mic icon in menu bar, changes color when recording
+- Dropdown: Start/Stop, Settings, Quit
+
+#### TranscriptionWindow
+- Small floating `NSPanel` (`.floating` window level)
+- Shows accumulated transcription text
+- Copy button
+- Opens when recording starts (in textBox mode)
+
+#### SettingsView
+- API key text field (stored in UserDefaults)
+- Global shortcut picker (configurable key combo)
+- Output mode toggle (text box vs cursor injection)
+- Latency slider (240ms–2400ms, maps to `target_streaming_delay_ms`)
+
+## Cursor Injection
+
+- `CGEvent` to simulate keystrokes for each character of received text delta
+- Requires Accessibility permission
+- Check via `AXIsProcessTrusted()` on first use; prompt if missing
+- Falls back to text box mode if permission denied
+
+## Global Shortcut
+
+- `NSEvent.addGlobalMonitorForEvents(matching: .keyDown)` for the configured combo
+- Default: unset (user must configure)
+- Stored in UserDefaults as modifier flags + key code
+
+## Transcription Log
+
+- Append-only text file at `~/Library/Application Support/MyVoxtral/transcription.log`
+- Format: `[ISO-8601 timestamp]\n<transcribed text>\n---\n`
+- Written on session end (stop recording or `transcription.done`)
+
+## Error Handling
+
+| Condition | Behavior |
+|---|---|
+| No API key set | Open Settings automatically on first launch |
+| WebSocket disconnect | Auto-retry once, then show error in menu bar |
+| Mic permission denied | System alert → System Settings > Privacy |
+| Accessibility permission missing | Prompt via `AXIsProcessTrusted()`, fall back to text box |
+| Invalid API key (401) | Show error in dropdown, stop recording |
+
+## File Structure
+
+```
+MyVoxtral/
+├── MyVoxtralApp.swift              # App entry, MenuBarExtra
+├── Models/
+│   ├── TranscriptionManager.swift  # Orchestrator, state machine
+│   └── AppSettings.swift           # UserDefaults wrapper
+├── Audio/
+│   └── AudioCapture.swift          # AVAudioEngine mic capture
+├── Network/
+│   ├── VoxtralWebSocketClient.swift # WebSocket protocol
+│   └── VoxtralMessages.swift       # JSON message types (Codable)
+├── Views/
+│   ├── TranscriptionWindow.swift   # Floating text panel
+│   ├── SettingsView.swift          # Preferences window
+│   └── MenuBarView.swift           # Dropdown menu content
+├── Utilities/
+│   ├── CursorInjector.swift        # CGEvent keystroke sim
+│   ├── GlobalShortcut.swift        # Configurable hotkey
+│   └── TranscriptionLogger.swift   # Append to log file
+└── Resources/
+    └── Assets.xcassets              # Menu bar icons
+```