myVoxtral/docs/superpowers/specs/2026-04-07-myvoxtral-design.md
2026-04-07 19:15:49 +02:00

148 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MyVoxtral — macOS Realtime Transcription App
## Overview
A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.
## Target
- macOS 14+
- SwiftUI
- No third-party dependencies
## API Protocol
### Connection
- WebSocket to `wss://api.mistral.ai/v1/audio/transcriptions/realtime`
- Auth: Bearer token in WebSocket upgrade headers
- Model: `voxtral-mini-transcribe-realtime-2602`
### Audio Format
- PCM 16-bit signed little-endian, mono, 16kHz
- 480ms chunks (~15,360 bytes raw)
- Sent as base64-encoded strings in JSON messages
### Messages Sent (Client → Server)
```json
{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
{"type": "input_audio.flush"}
{"type": "input_audio.end"}
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}
```
### Events Received (Server → Client)
| Event type | Key fields | Description |
|---|---|---|
| `session.created` | `.session` | Connection established |
| `transcription.text.delta` | `.text` (string) | Incremental transcribed text |
| `transcription.segment` | `.text`, `.start`, `.end`, `.speaker_id` | Segment-level transcription |
| `transcription.language` | `.audio_language` | Detected language |
| `transcription.done` | `.model`, `.text`, `.usage`, `.language`, `.segments` | Session complete |
| `error` | `.error.message.detail` | Error details |
## Architecture
Four layers, data flows one direction:
```
Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor
```
### 1. AudioCapture
- `AVAudioEngine` with input node tap
- Converts captured audio to PCM 16-bit LE mono @ 16kHz
- Yields `Data` chunks every ~480ms
### 2. VoxtralWebSocketClient
- Uses `URLSessionWebSocketTask` (no dependencies)
- Connects with Bearer token in headers
- Sends JSON messages with base64-encoded audio
- Sends flush/end control messages on stop
- Parses incoming JSON by `type` field
- Exposes text deltas via AsyncStream or callback
### 3. TranscriptionManager (ObservableObject)
- Owns AudioCapture and VoxtralWebSocketClient
- State machine: `.idle``.recording``.idle` (or `.error(String)`)
- Accumulates text from deltas into current session buffer
- On stop: appends timestamped session to log file
- Output mode: `.textBox` or `.cursorInjection` (togglable)
### 4. UI Components
#### MenuBarExtra
- Mic icon in menu bar, changes color when recording
- Dropdown: Start/Stop, Settings, Quit
#### TranscriptionWindow
- Small floating `NSPanel` (`.floating` window level)
- Shows accumulated transcription text
- Copy button
- Opens when recording starts (in textBox mode)
#### SettingsView
- API key text field (stored in UserDefaults)
- Global shortcut picker (configurable key combo)
- Output mode toggle (text box vs cursor injection)
- Latency slider (240ms2400ms, maps to `target_streaming_delay_ms`)
## Cursor Injection
- `CGEvent` to simulate keystrokes for each character of received text delta
- Requires Accessibility permission
- Check via `AXIsProcessTrusted()` on first use; prompt if missing
- Falls back to text box mode if permission denied
## Global Shortcut
- `NSEvent.addGlobalMonitorForEvents(matching: .keyDown)` for the configured combo
- Default: unset (user must configure)
- Stored in UserDefaults as modifier flags + key code
## Transcription Log
- Append-only text file at `~/Library/Application Support/MyVoxtral/transcription.log`
- Format: `[ISO-8601 timestamp]\n<transcribed text>\n---\n`
- Written on session end (stop recording or `transcription.done`)
## Error Handling
| Condition | Behavior |
|---|---|
| No API key set | Open Settings automatically on first launch |
| WebSocket disconnect | Auto-retry once, then show error in menu bar |
| Mic permission denied | System alert → System Settings > Privacy |
| Accessibility permission missing | Prompt via `AXIsProcessTrusted()`, fall back to text box |
| Invalid API key (401) | Show error in dropdown, stop recording |
## File Structure
```
MyVoxtral/
├── MyVoxtralApp.swift # App entry, MenuBarExtra
├── Models/
│ ├── TranscriptionManager.swift # Orchestrator, state machine
│ └── AppSettings.swift # UserDefaults wrapper
├── Audio/
│ └── AudioCapture.swift # AVAudioEngine mic capture
├── Network/
│ ├── VoxtralWebSocketClient.swift # WebSocket protocol
│ └── VoxtralMessages.swift # JSON message types (Codable)
├── Views/
│ ├── TranscriptionWindow.swift # Floating text panel
│ ├── SettingsView.swift # Preferences window
│ └── MenuBarView.swift # Dropdown menu content
├── Utilities/
│ ├── CursorInjector.swift # CGEvent keystroke sim
│ ├── GlobalShortcut.swift # Configurable hotkey
│ └── TranscriptionLogger.swift # Append to log file
└── Resources/
└── Assets.xcassets # Menu bar icons
```