chore: scaffold MyVoxtral macOS menu bar app
This commit is contained in:
commit
4b1cae1b5f
4424 changed files with 9392 additions and 0 deletions
148
docs/superpowers/specs/2026-04-07-myvoxtral-design.md
Normal file
148
docs/superpowers/specs/2026-04-07-myvoxtral-design.md
Normal file
|
|
@ -0,0 +1,148 @@
|
|||
# MyVoxtral — macOS Realtime Transcription App
|
||||
|
||||
## Overview
|
||||
|
||||
A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.
|
||||
|
||||
## Target
|
||||
|
||||
- macOS 14+
|
||||
- SwiftUI
|
||||
- No third-party dependencies
|
||||
|
||||
## API Protocol
|
||||
|
||||
### Connection
|
||||
|
||||
- WebSocket to `wss://api.mistral.ai/v1/audio/transcriptions/realtime`
|
||||
- Auth: Bearer token in WebSocket upgrade headers
|
||||
- Model: `voxtral-mini-transcribe-realtime-2602`
|
||||
|
||||
### Audio Format
|
||||
|
||||
- PCM 16-bit signed little-endian, mono, 16kHz
|
||||
- 480ms chunks (~15,360 bytes raw)
|
||||
- Sent as base64-encoded strings in JSON messages
|
||||
|
||||
### Messages Sent (Client → Server)
|
||||
|
||||
```json
|
||||
{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
|
||||
{"type": "input_audio.flush"}
|
||||
{"type": "input_audio.end"}
|
||||
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}
|
||||
```
|
||||
|
||||
### Events Received (Server → Client)
|
||||
|
||||
| Event type | Key fields | Description |
|
||||
|---|---|---|
|
||||
| `session.created` | `.session` | Connection established |
|
||||
| `transcription.text.delta` | `.text` (string) | Incremental transcribed text |
|
||||
| `transcription.segment` | `.text`, `.start`, `.end`, `.speaker_id` | Segment-level transcription |
|
||||
| `transcription.language` | `.audio_language` | Detected language |
|
||||
| `transcription.done` | `.model`, `.text`, `.usage`, `.language`, `.segments` | Session complete |
|
||||
| `error` | `.error.message.detail` | Error details |
|
||||
|
||||
## Architecture
|
||||
|
||||
Four layers, data flows one direction:
|
||||
|
||||
```
|
||||
Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor
|
||||
```
|
||||
|
||||
### 1. AudioCapture
|
||||
|
||||
- `AVAudioEngine` with input node tap
|
||||
- Converts captured audio to PCM 16-bit LE mono @ 16kHz
|
||||
- Yields `Data` chunks every ~480ms
|
||||
|
||||
### 2. VoxtralWebSocketClient
|
||||
|
||||
- Uses `URLSessionWebSocketTask` (no dependencies)
|
||||
- Connects with Bearer token in headers
|
||||
- Sends JSON messages with base64-encoded audio
|
||||
- Sends flush/end control messages on stop
|
||||
- Parses incoming JSON by `type` field
|
||||
- Exposes text deltas via AsyncStream or callback
|
||||
|
||||
### 3. TranscriptionManager (ObservableObject)
|
||||
|
||||
- Owns AudioCapture and VoxtralWebSocketClient
|
||||
- State machine: `.idle` → `.recording` → `.idle` (or `.error(String)`)
|
||||
- Accumulates text from deltas into current session buffer
|
||||
- On stop: appends timestamped session to log file
|
||||
- Output mode: `.textBox` or `.cursorInjection` (togglable)
|
||||
|
||||
### 4. UI Components
|
||||
|
||||
#### MenuBarExtra
|
||||
- Mic icon in menu bar, changes color when recording
|
||||
- Dropdown: Start/Stop, Settings, Quit
|
||||
|
||||
#### TranscriptionWindow
|
||||
- Small floating `NSPanel` (`.floating` window level)
|
||||
- Shows accumulated transcription text
|
||||
- Copy button
|
||||
- Opens when recording starts (in textBox mode)
|
||||
|
||||
#### SettingsView
|
||||
- API key text field (stored in UserDefaults)
|
||||
- Global shortcut picker (configurable key combo)
|
||||
- Output mode toggle (text box vs cursor injection)
|
||||
- Latency slider (240ms–2400ms, maps to `target_streaming_delay_ms`)
|
||||
|
||||
## Cursor Injection
|
||||
|
||||
- `CGEvent` to simulate keystrokes for each character of received text delta
|
||||
- Requires Accessibility permission
|
||||
- Check via `AXIsProcessTrusted()` on first use; prompt if missing
|
||||
- Falls back to text box mode if permission denied
|
||||
|
||||
## Global Shortcut
|
||||
|
||||
- `NSEvent.addGlobalMonitorForEvents(matching: .keyDown)` for the configured combo
|
||||
- Default: unset (user must configure)
|
||||
- Stored in UserDefaults as modifier flags + key code
|
||||
|
||||
## Transcription Log
|
||||
|
||||
- Append-only text file at `~/Library/Application Support/MyVoxtral/transcription.log`
|
||||
- Format: `[ISO-8601 timestamp]\n<transcribed text>\n---\n`
|
||||
- Written on session end (stop recording or `transcription.done`)
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Condition | Behavior |
|
||||
|---|---|
|
||||
| No API key set | Open Settings automatically on first launch |
|
||||
| WebSocket disconnect | Auto-retry once, then show error in menu bar |
|
||||
| Mic permission denied | System alert → System Settings > Privacy |
|
||||
| Accessibility permission missing | Prompt via `AXIsProcessTrusted()`, fall back to text box |
|
||||
| Invalid API key (401) | Show error in dropdown, stop recording |
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
MyVoxtral/
|
||||
├── MyVoxtralApp.swift # App entry, MenuBarExtra
|
||||
├── Models/
|
||||
│ ├── TranscriptionManager.swift # Orchestrator, state machine
|
||||
│ └── AppSettings.swift # UserDefaults wrapper
|
||||
├── Audio/
|
||||
│ └── AudioCapture.swift # AVAudioEngine mic capture
|
||||
├── Network/
|
||||
│ ├── VoxtralWebSocketClient.swift # WebSocket protocol
|
||||
│ └── VoxtralMessages.swift # JSON message types (Codable)
|
||||
├── Views/
|
||||
│ ├── TranscriptionWindow.swift # Floating text panel
|
||||
│ ├── SettingsView.swift # Preferences window
|
||||
│ └── MenuBarView.swift # Dropdown menu content
|
||||
├── Utilities/
|
||||
│ ├── CursorInjector.swift # CGEvent keystroke sim
|
||||
│ ├── GlobalShortcut.swift # Configurable hotkey
|
||||
│ └── TranscriptionLogger.swift # Append to log file
|
||||
└── Resources/
|
||||
└── Assets.xcassets # Menu bar icons
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue