myVoxtral/docs/superpowers/specs/2026-04-07-myvoxtral-design.md

# MyVoxtral — macOS Realtime Transcription App

## Overview

A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.

## Target

- macOS 14+
- SwiftUI
- No third-party dependencies

## API Protocol

### Connection

- WebSocket to `wss://api.mistral.ai/v1/audio/transcriptions/realtime`
- Auth: Bearer token in WebSocket upgrade headers
- Model: `voxtral-mini-transcribe-realtime-2602`

### Audio Format

- PCM 16-bit signed little-endian, mono, 16kHz
- 480ms chunks (~15,360 bytes raw)
- Sent as base64-encoded strings in JSON messages

### Messages Sent (Client → Server)

```json
{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
{"type": "input_audio.flush"}
{"type": "input_audio.end"}
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}
```

### Events Received (Server → Client)

| Event type | Key fields | Description |
|---|---|---|
| `session.created` | `.session` | Connection established |
| `transcription.text.delta` | `.text` (string) | Incremental transcribed text |
| `transcription.segment` | `.text`, `.start`, `.end`, `.speaker_id` | Segment-level transcription |
| `transcription.language` | `.audio_language` | Detected language |
| `transcription.done` | `.model`, `.text`, `.usage`, `.language`, `.segments` | Session complete |
| `error` | `.error.message.detail` | Error details |

## Architecture

Four layers, data flows one direction:

```
Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor
```

### 1. AudioCapture

- `AVAudioEngine` with input node tap
- Converts captured audio to PCM 16-bit LE mono @ 16kHz
- Yields `Data` chunks every ~480ms

### 2. VoxtralWebSocketClient

- Uses `URLSessionWebSocketTask` (no dependencies)
- Connects with Bearer token in headers
- Sends JSON messages with base64-encoded audio
- Sends flush/end control messages on stop
- Parses incoming JSON by `type` field
- Exposes text deltas via AsyncStream or callback

### 3. TranscriptionManager (ObservableObject)

- Owns AudioCapture and VoxtralWebSocketClient
- State machine: `.idle` → `.recording` → `.idle` (or `.error(String)`)
- Accumulates text from deltas into current session buffer
- On stop: appends timestamped session to log file
- Output mode: `.textBox` or `.cursorInjection` (togglable)

### 4. UI Components

#### MenuBarExtra
- Mic icon in menu bar, changes color when recording
- Dropdown: Start/Stop, Settings, Quit

#### TranscriptionWindow
- Small floating `NSPanel` (`.floating` window level)
- Shows accumulated transcription text
- Copy button
- Opens when recording starts (in textBox mode)

#### SettingsView
- API key text field (stored in UserDefaults)
- Global shortcut picker (configurable key combo)
- Output mode toggle (text box vs cursor injection)
- Latency slider (240ms–2400ms, maps to `target_streaming_delay_ms`)

## Cursor Injection

- `CGEvent` to simulate keystrokes for each character of received text delta
- Requires Accessibility permission
- Check via `AXIsProcessTrusted()` on first use; prompt if missing
- Falls back to text box mode if permission denied

## Global Shortcut

- `NSEvent.addGlobalMonitorForEvents(matching: .keyDown)` for the configured combo
- Default: unset (user must configure)
- Stored in UserDefaults as modifier flags + key code

## Transcription Log

- Append-only text file at `~/Library/Application Support/MyVoxtral/transcription.log`
- Format: `[ISO-8601 timestamp]\n<transcribed text>\n---\n`
- Written on session end (stop recording or `transcription.done`)

## Error Handling

| Condition | Behavior |
|---|---|
| No API key set | Open Settings automatically on first launch |
| WebSocket disconnect | Auto-retry once, then show error in menu bar |
| Mic permission denied | System alert → System Settings > Privacy |
| Accessibility permission missing | Prompt via `AXIsProcessTrusted()`, fall back to text box |
| Invalid API key (401) | Show error in dropdown, stop recording |

## File Structure

```
MyVoxtral/
├── MyVoxtralApp.swift              # App entry, MenuBarExtra
├── Models/
│   ├── TranscriptionManager.swift  # Orchestrator, state machine
│   └── AppSettings.swift           # UserDefaults wrapper
├── Audio/
│   └── AudioCapture.swift          # AVAudioEngine mic capture
├── Network/
│   ├── VoxtralWebSocketClient.swift # WebSocket protocol
│   └── VoxtralMessages.swift       # JSON message types (Codable)
├── Views/
│   ├── TranscriptionWindow.swift   # Floating text panel
│   ├── SettingsView.swift          # Preferences window
│   └── MenuBarView.swift           # Dropdown menu content
├── Utilities/
│   ├── CursorInjector.swift        # CGEvent keystroke sim
│   ├── GlobalShortcut.swift        # Configurable hotkey
│   └── TranscriptionLogger.swift   # Append to log file
└── Resources/
    └── Assets.xcassets              # Menu bar icons
```