Carsten Abele 4b1cae1b5f chore: scaffold MyVoxtral macOS menu bar app

2026-04-07 19:15:49 +02:00

5.1 KiB

Raw Blame History

MyVoxtral — macOS Realtime Transcription App

Overview

A minimal macOS menu bar app that captures microphone audio and streams it to the Mistral Voxtral Realtime API for live transcription. Output goes to either a floating text window or directly to the active cursor position via accessibility APIs.

Target

macOS 14+
SwiftUI
No third-party dependencies

API Protocol

Connection

WebSocket to wss://api.mistral.ai/v1/audio/transcriptions/realtime
Auth: Bearer token in WebSocket upgrade headers
Model: voxtral-mini-transcribe-realtime-2602

Audio Format

PCM 16-bit signed little-endian, mono, 16kHz
480ms chunks (~15,360 bytes raw)
Sent as base64-encoded strings in JSON messages

Messages Sent (Client → Server)

{"type": "input_audio.append", "audio": "<base64-encoded PCM>"}
{"type": "input_audio.flush"}
{"type": "input_audio.end"}
{"type": "session.update", "session": {"audio_format": {"encoding": "pcm_s16le", "sample_rate": 16000}, "target_streaming_delay_ms": 240}}

Events Received (Server → Client)

Event type	Key fields	Description
`session.created`	`.session`	Connection established
`transcription.text.delta`	`.text` (string)	Incremental transcribed text
`transcription.segment`	`.text`, `.start`, `.end`, `.speaker_id`	Segment-level transcription
`transcription.language`	`.audio_language`	Detected language
`transcription.done`	`.model`, `.text`, `.usage`, `.language`, `.segments`	Session complete
`error`	`.error.message.detail`	Error details

Architecture

Four layers, data flows one direction:

Mic → AudioCapture → VoxtralWebSocketClient → TranscriptionManager → UI / Cursor

1. AudioCapture

AVAudioEngine with input node tap
Converts captured audio to PCM 16-bit LE mono @ 16kHz
Yields Data chunks every ~480ms

2. VoxtralWebSocketClient

Uses URLSessionWebSocketTask (no dependencies)
Connects with Bearer token in headers
Sends JSON messages with base64-encoded audio
Sends flush/end control messages on stop
Parses incoming JSON by type field
Exposes text deltas via AsyncStream or callback

3. TranscriptionManager (ObservableObject)

Owns AudioCapture and VoxtralWebSocketClient
State machine: .idle → .recording → .idle (or .error(String))
Accumulates text from deltas into current session buffer
On stop: appends timestamped session to log file
Output mode: .textBox or .cursorInjection (togglable)

4. UI Components

MenuBarExtra

Mic icon in menu bar, changes color when recording
Dropdown: Start/Stop, Settings, Quit

TranscriptionWindow

Small floating NSPanel (.floating window level)
Shows accumulated transcription text
Copy button
Opens when recording starts (in textBox mode)

SettingsView

API key text field (stored in UserDefaults)
Global shortcut picker (configurable key combo)
Output mode toggle (text box vs cursor injection)
Latency slider (240ms–2400ms, maps to target_streaming_delay_ms)

Cursor Injection

CGEvent to simulate keystrokes for each character of received text delta
Requires Accessibility permission
Check via AXIsProcessTrusted() on first use; prompt if missing
Falls back to text box mode if permission denied

Global Shortcut

NSEvent.addGlobalMonitorForEvents(matching: .keyDown) for the configured combo
Default: unset (user must configure)
Stored in UserDefaults as modifier flags + key code

Transcription Log

Append-only text file at ~/Library/Application Support/MyVoxtral/transcription.log
Format: [ISO-8601 timestamp]\n<transcribed text>\n---\n
Written on session end (stop recording or transcription.done)

Error Handling

Condition	Behavior
No API key set	Open Settings automatically on first launch
WebSocket disconnect	Auto-retry once, then show error in menu bar
Mic permission denied	System alert → System Settings > Privacy
Accessibility permission missing	Prompt via `AXIsProcessTrusted()`, fall back to text box
Invalid API key (401)	Show error in dropdown, stop recording

File Structure

MyVoxtral/
├── MyVoxtralApp.swift              # App entry, MenuBarExtra
├── Models/
│   ├── TranscriptionManager.swift  # Orchestrator, state machine
│   └── AppSettings.swift           # UserDefaults wrapper
├── Audio/
│   └── AudioCapture.swift          # AVAudioEngine mic capture
├── Network/
│   ├── VoxtralWebSocketClient.swift # WebSocket protocol
│   └── VoxtralMessages.swift       # JSON message types (Codable)
├── Views/
│   ├── TranscriptionWindow.swift   # Floating text panel
│   ├── SettingsView.swift          # Preferences window
│   └── MenuBarView.swift           # Dropdown menu content
├── Utilities/
│   ├── CursorInjector.swift        # CGEvent keystroke sim
│   ├── GlobalShortcut.swift        # Configurable hotkey
│   └── TranscriptionLogger.swift   # Append to log file
└── Resources/
    └── Assets.xcassets              # Menu bar icons

5.1 KiB Raw Blame History Unescape Escape