1320 lines
35 KiB
Markdown
1320 lines
35 KiB
Markdown
# MyVoxtral Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Build a macOS menu bar app that streams microphone audio to Mistral Voxtral Realtime API and outputs transcribed text to a floating window or the active cursor.
|
|
|
|
**Architecture:** SwiftUI menu bar app with four layers — AudioCapture (AVAudioEngine), VoxtralWebSocketClient (URLSessionWebSocketTask), TranscriptionManager (ObservableObject orchestrator), and UI (MenuBarExtra + floating panel). No third-party dependencies.
|
|
|
|
**Tech Stack:** Swift, SwiftUI, AVFoundation, WebSocket (URLSession), CGEvent, macOS 14+
|
|
|
|
---
|
|
|
|
## File Structure
|
|
|
|
```
|
|
MyVoxtral/
|
|
├── MyVoxtralApp.swift # App entry point, MenuBarExtra, app lifecycle
|
|
├── Models/
|
|
│ ├── TranscriptionManager.swift # State machine orchestrator, owns audio + WS client
|
|
│ └── AppSettings.swift # @AppStorage wrapper, settings model
|
|
├── Audio/
|
|
│ └── AudioCapture.swift # AVAudioEngine mic tap, PCM conversion, async chunk stream
|
|
├── Network/
|
|
│ ├── VoxtralWebSocketClient.swift # WebSocket connect/send/receive, session management
|
|
│ └── VoxtralMessages.swift # Codable structs for all WS JSON messages
|
|
├── Views/
|
|
│ ├── TranscriptionWindow.swift # NSPanel-based floating text window
|
|
│ ├── SettingsView.swift # API key, shortcut, mode, latency
|
|
│ └── MenuBarView.swift # Menu bar dropdown content
|
|
├── Utilities/
|
|
│ ├── CursorInjector.swift # CGEvent keystroke simulation
|
|
│ ├── GlobalShortcut.swift # NSEvent global monitor, key combo storage
|
|
│ └── TranscriptionLogger.swift # Append sessions to log file
|
|
└── Info.plist # Mic + accessibility usage descriptions
|
|
```
|
|
|
|
---
|
|
|
|
### Task 1: Xcode Project Scaffold
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral.xcodeproj` (via `swift package init` or Xcode project)
|
|
- Create: `MyVoxtral/MyVoxtralApp.swift`
|
|
- Create: `MyVoxtral/Info.plist`
|
|
|
|
- [ ] **Step 1: Create the Swift package / Xcode project**
|
|
|
|
Create a new macOS app project. Use SwiftUI lifecycle.
|
|
|
|
```bash
|
|
mkdir -p MyVoxtral/MyVoxtral
|
|
```
|
|
|
|
Create `MyVoxtral/MyVoxtral/MyVoxtralApp.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
|
|
@main
|
|
struct MyVoxtralApp: App {
|
|
var body: some Scene {
|
|
MenuBarExtra("MyVoxtral", systemImage: "mic.fill") {
|
|
Text("MyVoxtral")
|
|
Divider()
|
|
Button("Quit") {
|
|
NSApplication.shared.terminate(nil)
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Create Info.plist with privacy descriptions**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Info.plist`:
|
|
|
|
```xml
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
|
<plist version="1.0">
|
|
<dict>
|
|
<key>NSMicrophoneUsageDescription</key>
|
|
<string>MyVoxtral needs microphone access to transcribe your speech in real time.</string>
|
|
<key>LSUIElement</key>
|
|
<true/>
|
|
</dict>
|
|
</plist>
|
|
```
|
|
|
|
`LSUIElement` = true makes it a menu-bar-only app (no Dock icon).
|
|
|
|
- [ ] **Step 3: Create the Xcode project file**
|
|
|
|
```bash
|
|
cd MyVoxtral
|
|
cat > Package.swift << 'SWIFT'
|
|
// swift-tools-version: 5.10
|
|
import PackageDescription
|
|
|
|
let package = Package(
|
|
name: "MyVoxtral",
|
|
platforms: [.macOS(.v14)],
|
|
targets: [
|
|
.executableTarget(
|
|
name: "MyVoxtral",
|
|
path: "MyVoxtral"
|
|
)
|
|
]
|
|
)
|
|
SWIFT
|
|
```
|
|
|
|
- [ ] **Step 4: Build and verify the empty shell runs**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds. Running shows a mic icon in the menu bar with "MyVoxtral" and "Quit".
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git init
|
|
git add -A
|
|
git commit -m "chore: scaffold MyVoxtral macOS menu bar app"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: WebSocket Message Types
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Network/VoxtralMessages.swift`
|
|
|
|
- [ ] **Step 1: Define all outbound message types**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Network/VoxtralMessages.swift`:
|
|
|
|
```swift
|
|
import Foundation
|
|
|
|
// MARK: - Outbound Messages (Client → Server)
|
|
|
|
struct AudioAppendMessage: Encodable {
|
|
let type = "input_audio.append"
|
|
let audio: String // base64-encoded PCM
|
|
}
|
|
|
|
struct AudioFlushMessage: Encodable {
|
|
let type = "input_audio.flush"
|
|
}
|
|
|
|
struct AudioEndMessage: Encodable {
|
|
let type = "input_audio.end"
|
|
}
|
|
|
|
struct SessionUpdateMessage: Encodable {
|
|
let type = "session.update"
|
|
let session: SessionConfig
|
|
}
|
|
|
|
struct SessionConfig: Encodable {
|
|
let audioFormat: AudioFormatConfig
|
|
let targetStreamingDelayMs: Int
|
|
|
|
enum CodingKeys: String, CodingKey {
|
|
case audioFormat = "audio_format"
|
|
case targetStreamingDelayMs = "target_streaming_delay_ms"
|
|
}
|
|
}
|
|
|
|
struct AudioFormatConfig: Encodable {
|
|
let encoding = "pcm_s16le"
|
|
let sampleRate = 16000
|
|
|
|
enum CodingKeys: String, CodingKey {
|
|
case encoding
|
|
case sampleRate = "sample_rate"
|
|
}
|
|
}
|
|
|
|
// MARK: - Inbound Messages (Server → Client)
|
|
|
|
enum VoxtralEvent {
|
|
case sessionCreated
|
|
case textDelta(String)
|
|
case segment(text: String, start: Double, end: Double)
|
|
case language(String)
|
|
case done(text: String)
|
|
case error(String)
|
|
case unknown(String)
|
|
}
|
|
|
|
struct IncomingEvent: Decodable {
|
|
let type: String
|
|
}
|
|
|
|
struct TextDeltaEvent: Decodable {
|
|
let text: String
|
|
}
|
|
|
|
struct LanguageEvent: Decodable {
|
|
let audioLanguage: String
|
|
|
|
enum CodingKeys: String, CodingKey {
|
|
case audioLanguage = "audio_language"
|
|
}
|
|
}
|
|
|
|
struct SegmentEvent: Decodable {
|
|
let text: String
|
|
let start: Double
|
|
let end: Double
|
|
}
|
|
|
|
struct DoneEvent: Decodable {
|
|
let text: String
|
|
}
|
|
|
|
struct ErrorEvent: Decodable {
|
|
let error: ErrorDetail?
|
|
}
|
|
|
|
struct ErrorDetail: Decodable {
|
|
let message: ErrorMessage?
|
|
}
|
|
|
|
struct ErrorMessage: Decodable {
|
|
let detail: String?
|
|
}
|
|
|
|
// MARK: - Event Parsing
|
|
|
|
func parseVoxtralEvent(from data: Data) -> VoxtralEvent {
|
|
guard let envelope = try? JSONDecoder().decode(IncomingEvent.self, from: data) else {
|
|
return .unknown(String(data: data, encoding: .utf8) ?? "")
|
|
}
|
|
|
|
switch envelope.type {
|
|
case "session.created":
|
|
return .sessionCreated
|
|
case "transcription.text.delta":
|
|
guard let e = try? JSONDecoder().decode(TextDeltaEvent.self, from: data) else { return .unknown("") }
|
|
return .textDelta(e.text)
|
|
case "transcription.segment":
|
|
guard let e = try? JSONDecoder().decode(SegmentEvent.self, from: data) else { return .unknown("") }
|
|
return .segment(text: e.text, start: e.start, end: e.end)
|
|
case "transcription.language":
|
|
guard let e = try? JSONDecoder().decode(LanguageEvent.self, from: data) else { return .unknown("") }
|
|
return .language(e.audioLanguage)
|
|
case "transcription.done":
|
|
guard let e = try? JSONDecoder().decode(DoneEvent.self, from: data) else { return .unknown("") }
|
|
return .done(text: e.text)
|
|
case "error":
|
|
if let e = try? JSONDecoder().decode(ErrorEvent.self, from: data) {
|
|
return .error(e.error?.message?.detail ?? "Unknown error")
|
|
}
|
|
return .error("Unknown error")
|
|
default:
|
|
return .unknown(envelope.type)
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify it compiles**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Network/VoxtralMessages.swift
|
|
git commit -m "feat: add Voxtral WebSocket message types and parser"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: WebSocket Client
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Network/VoxtralWebSocketClient.swift`
|
|
|
|
- [ ] **Step 1: Implement the WebSocket client**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Network/VoxtralWebSocketClient.swift`:
|
|
|
|
```swift
|
|
import Foundation
|
|
|
|
@MainActor
|
|
final class VoxtralWebSocketClient {
|
|
private var webSocketTask: URLSessionWebSocketTask?
|
|
private var session: URLSession?
|
|
private let encoder = JSONEncoder()
|
|
|
|
var onEvent: ((VoxtralEvent) -> Void)?
|
|
|
|
func connect(apiKey: String, delayMs: Int) {
|
|
guard let url = URL(string: "wss://api.mistral.ai/v1/audio/transcriptions/realtime") else { return }
|
|
|
|
var request = URLRequest(url: url)
|
|
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
|
|
|
|
session = URLSession(configuration: .default)
|
|
webSocketTask = session?.webSocketTask(with: request)
|
|
webSocketTask?.resume()
|
|
|
|
// Send session config
|
|
let config = SessionUpdateMessage(
|
|
session: SessionConfig(
|
|
audioFormat: AudioFormatConfig(),
|
|
targetStreamingDelayMs: delayMs
|
|
)
|
|
)
|
|
sendJSON(config)
|
|
|
|
// Start receiving
|
|
receiveLoop()
|
|
}
|
|
|
|
func sendAudio(_ pcmData: Data) {
|
|
let base64 = pcmData.base64EncodedString()
|
|
let msg = AudioAppendMessage(audio: base64)
|
|
sendJSON(msg)
|
|
}
|
|
|
|
func flush() {
|
|
sendJSON(AudioFlushMessage())
|
|
}
|
|
|
|
func disconnect() {
|
|
sendJSON(AudioEndMessage())
|
|
webSocketTask?.cancel(with: .normalClosure, reason: nil)
|
|
webSocketTask = nil
|
|
session?.invalidateAndCancel()
|
|
session = nil
|
|
}
|
|
|
|
private func sendJSON<T: Encodable>(_ value: T) {
|
|
guard let data = try? encoder.encode(value),
|
|
let string = String(data: data, encoding: .utf8) else { return }
|
|
webSocketTask?.send(.string(string)) { error in
|
|
if let error {
|
|
print("WebSocket send error: \(error)")
|
|
}
|
|
}
|
|
}
|
|
|
|
private func receiveLoop() {
|
|
webSocketTask?.receive { [weak self] result in
|
|
switch result {
|
|
case .success(let message):
|
|
switch message {
|
|
case .string(let text):
|
|
if let data = text.data(using: .utf8) {
|
|
let event = parseVoxtralEvent(from: data)
|
|
Task { @MainActor in
|
|
self?.onEvent?(event)
|
|
}
|
|
}
|
|
case .data(let data):
|
|
let event = parseVoxtralEvent(from: data)
|
|
Task { @MainActor in
|
|
self?.onEvent?(event)
|
|
}
|
|
@unknown default:
|
|
break
|
|
}
|
|
self?.receiveLoop()
|
|
case .failure(let error):
|
|
Task { @MainActor in
|
|
self?.onEvent?(.error("Connection lost: \(error.localizedDescription)"))
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Network/VoxtralWebSocketClient.swift
|
|
git commit -m "feat: add Voxtral WebSocket client with connect/send/receive"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: Audio Capture
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Audio/AudioCapture.swift`
|
|
|
|
- [ ] **Step 1: Implement AVAudioEngine mic capture**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Audio/AudioCapture.swift`:
|
|
|
|
```swift
|
|
import AVFoundation
|
|
|
|
final class AudioCapture {
|
|
private let engine = AVAudioEngine()
|
|
private let targetSampleRate: Double = 16000
|
|
private let chunkDurationMs: Double = 480
|
|
|
|
var onChunk: ((Data) -> Void)?
|
|
|
|
private var buffer = Data()
|
|
private let bytesPerChunk: Int
|
|
|
|
init() {
|
|
// 16kHz * 2 bytes (16-bit) * 1 channel * 0.48s = 15360 bytes
|
|
bytesPerChunk = Int(targetSampleRate * 2 * chunkDurationMs / 1000)
|
|
}
|
|
|
|
func start() throws {
|
|
let inputNode = engine.inputNode
|
|
let inputFormat = inputNode.outputFormat(forBus: 0)
|
|
|
|
let targetFormat = AVAudioFormat(
|
|
commonFormat: .pcmFormatInt16,
|
|
sampleRate: targetSampleRate,
|
|
channels: 1,
|
|
interleaved: true
|
|
)!
|
|
|
|
guard let converter = AVAudioConverter(from: inputFormat, to: targetFormat) else {
|
|
throw AudioCaptureError.converterCreationFailed
|
|
}
|
|
|
|
let bufferSize = AVAudioFrameCount(inputFormat.sampleRate * chunkDurationMs / 1000)
|
|
|
|
inputNode.installTap(onBus: 0, bufferSize: bufferSize, format: inputFormat) { [weak self] pcmBuffer, _ in
|
|
guard let self else { return }
|
|
|
|
let frameCount = AVAudioFrameCount(
|
|
Double(pcmBuffer.frameLength) * self.targetSampleRate / inputFormat.sampleRate
|
|
)
|
|
guard let convertedBuffer = AVAudioPCMBuffer(
|
|
pcmFormat: targetFormat,
|
|
frameCapacity: frameCount
|
|
) else { return }
|
|
|
|
var error: NSError?
|
|
let status = converter.convert(to: convertedBuffer, error: &error) { _, outStatus in
|
|
outStatus.pointee = .haveData
|
|
return pcmBuffer
|
|
}
|
|
|
|
guard status != .error, error == nil else { return }
|
|
|
|
let byteCount = Int(convertedBuffer.frameLength) * 2 // 16-bit = 2 bytes
|
|
guard let int16Ptr = convertedBuffer.int16ChannelData?[0] else { return }
|
|
let data = Data(bytes: int16Ptr, count: byteCount)
|
|
|
|
self.buffer.append(data)
|
|
|
|
while self.buffer.count >= self.bytesPerChunk {
|
|
let chunk = self.buffer.prefix(self.bytesPerChunk)
|
|
self.buffer = Data(self.buffer.dropFirst(self.bytesPerChunk))
|
|
self.onChunk?(Data(chunk))
|
|
}
|
|
}
|
|
|
|
engine.prepare()
|
|
try engine.start()
|
|
}
|
|
|
|
func stop() {
|
|
engine.inputNode.removeTap(onBus: 0)
|
|
engine.stop()
|
|
|
|
// Flush remaining buffer
|
|
if !buffer.isEmpty {
|
|
onChunk?(buffer)
|
|
buffer = Data()
|
|
}
|
|
}
|
|
}
|
|
|
|
enum AudioCaptureError: Error, LocalizedError {
|
|
case converterCreationFailed
|
|
|
|
var errorDescription: String? {
|
|
switch self {
|
|
case .converterCreationFailed:
|
|
return "Failed to create audio format converter"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Audio/AudioCapture.swift
|
|
git commit -m "feat: add AudioCapture with AVAudioEngine mic tap and PCM conversion"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: App Settings
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Models/AppSettings.swift`
|
|
|
|
- [ ] **Step 1: Implement settings model**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Models/AppSettings.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
import Carbon.HIToolbox
|
|
|
|
enum OutputMode: String, CaseIterable {
|
|
case textBox = "Text Window"
|
|
case cursorInjection = "Type at Cursor"
|
|
}
|
|
|
|
final class AppSettings: ObservableObject {
|
|
static let shared = AppSettings()
|
|
|
|
@AppStorage("apiKey") var apiKey: String = ""
|
|
@AppStorage("outputMode") var outputMode: OutputMode = .textBox
|
|
@AppStorage("streamingDelayMs") var streamingDelayMs: Int = 480
|
|
@AppStorage("shortcutKeyCode") var shortcutKeyCode: UInt16 = 0
|
|
@AppStorage("shortcutModifiers") var shortcutModifiers: UInt = 0
|
|
|
|
var hasAPIKey: Bool { !apiKey.isEmpty }
|
|
|
|
var hasShortcut: Bool { shortcutKeyCode != 0 || shortcutModifiers != 0 }
|
|
|
|
var shortcutDisplayString: String {
|
|
guard hasShortcut else { return "Not Set" }
|
|
var parts: [String] = []
|
|
let mods = NSEvent.ModifierFlags(rawValue: shortcutModifiers)
|
|
if mods.contains(.control) { parts.append("^") }
|
|
if mods.contains(.option) { parts.append("\u{2325}") }
|
|
if mods.contains(.shift) { parts.append("\u{21E7}") }
|
|
if mods.contains(.command) { parts.append("\u{2318}") }
|
|
// Map key code to character
|
|
if let scalar = Unicode.Scalar(shortcutKeyCode) {
|
|
parts.append(String(Character(scalar)).uppercased())
|
|
}
|
|
return parts.joined()
|
|
}
|
|
}
|
|
|
|
extension OutputMode: RawRepresentable where RawValue == String {}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Models/AppSettings.swift
|
|
git commit -m "feat: add AppSettings with API key, shortcut, output mode, delay"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Transcription Logger
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Utilities/TranscriptionLogger.swift`
|
|
|
|
- [ ] **Step 1: Implement the log file writer**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Utilities/TranscriptionLogger.swift`:
|
|
|
|
```swift
|
|
import Foundation
|
|
|
|
struct TranscriptionLogger {
|
|
private static var logFileURL: URL {
|
|
let appSupport = FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask).first!
|
|
let dir = appSupport.appendingPathComponent("MyVoxtral", isDirectory: true)
|
|
try? FileManager.default.createDirectory(at: dir, withIntermediateDirectories: true)
|
|
return dir.appendingPathComponent("transcription.log")
|
|
}
|
|
|
|
static func append(text: String) {
|
|
let timestamp = ISO8601DateFormatter().string(from: Date())
|
|
let entry = "[\(timestamp)]\n\(text)\n---\n\n"
|
|
|
|
guard let data = entry.data(using: .utf8) else { return }
|
|
|
|
if FileManager.default.fileExists(atPath: logFileURL.path) {
|
|
if let handle = try? FileHandle(forWritingTo: logFileURL) {
|
|
handle.seekToEndOfFile()
|
|
handle.write(data)
|
|
handle.closeFile()
|
|
}
|
|
} else {
|
|
try? data.write(to: logFileURL)
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Utilities/TranscriptionLogger.swift
|
|
git commit -m "feat: add TranscriptionLogger for append-only session log"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 7: Cursor Injector
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Utilities/CursorInjector.swift`
|
|
|
|
- [ ] **Step 1: Implement CGEvent keystroke simulation**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Utilities/CursorInjector.swift`:
|
|
|
|
```swift
|
|
import ApplicationServices
|
|
|
|
struct CursorInjector {
|
|
static var isAccessibilityGranted: Bool {
|
|
AXIsProcessTrusted()
|
|
}
|
|
|
|
static func promptAccessibilityPermission() {
|
|
let options = [kAXTrustedCheckOptionPrompt.takeUnretainedValue() as String: true] as CFDictionary
|
|
AXIsProcessTrustedWithOptions(options)
|
|
}
|
|
|
|
static func typeText(_ text: String) {
|
|
guard isAccessibilityGranted else { return }
|
|
|
|
let source = CGEventSource(stateID: .hidSystemState)
|
|
|
|
for character in text {
|
|
let string = String(character)
|
|
let event = CGEvent(keyboardEventSource: source, virtualKey: 0, keyDown: true)
|
|
event?.keyboardSetUnicodeString(stringLength: string.utf16.count, unicodeString: Array(string.utf16))
|
|
event?.post(tap: .cghidEventTap)
|
|
|
|
let eventUp = CGEvent(keyboardEventSource: source, virtualKey: 0, keyDown: false)
|
|
eventUp?.keyboardSetUnicodeString(stringLength: string.utf16.count, unicodeString: Array(string.utf16))
|
|
eventUp?.post(tap: .cghidEventTap)
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Utilities/CursorInjector.swift
|
|
git commit -m "feat: add CursorInjector with CGEvent keystroke simulation"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 8: Global Shortcut
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Utilities/GlobalShortcut.swift`
|
|
|
|
- [ ] **Step 1: Implement global keyboard shortcut monitor**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Utilities/GlobalShortcut.swift`:
|
|
|
|
```swift
|
|
import Cocoa
|
|
|
|
final class GlobalShortcut {
|
|
private var monitor: Any?
|
|
var onTrigger: (() -> Void)?
|
|
|
|
func register(keyCode: UInt16, modifiers: UInt) {
|
|
unregister()
|
|
guard keyCode != 0 || modifiers != 0 else { return }
|
|
|
|
let requiredFlags = NSEvent.ModifierFlags(rawValue: modifiers)
|
|
|
|
monitor = NSEvent.addGlobalMonitorForEvents(matching: .keyDown) { [weak self] event in
|
|
let mask: NSEvent.ModifierFlags = [.command, .option, .control, .shift]
|
|
if event.keyCode == keyCode && event.modifierFlags.intersection(mask) == requiredFlags {
|
|
self?.onTrigger?()
|
|
}
|
|
}
|
|
}
|
|
|
|
func unregister() {
|
|
if let monitor {
|
|
NSEvent.removeMonitor(monitor)
|
|
}
|
|
monitor = nil
|
|
}
|
|
|
|
deinit {
|
|
unregister()
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Utilities/GlobalShortcut.swift
|
|
git commit -m "feat: add configurable global keyboard shortcut"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 9: Transcription Manager
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Models/TranscriptionManager.swift`
|
|
|
|
- [ ] **Step 1: Implement the orchestrator**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Models/TranscriptionManager.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
|
|
enum RecordingState: Equatable {
|
|
case idle
|
|
case recording
|
|
case error(String)
|
|
}
|
|
|
|
@MainActor
|
|
final class TranscriptionManager: ObservableObject {
|
|
@Published var state: RecordingState = .idle
|
|
@Published var currentText: String = ""
|
|
|
|
private let audioCapture = AudioCapture()
|
|
private let wsClient = VoxtralWebSocketClient()
|
|
private let settings = AppSettings.shared
|
|
private var hasRetried = false
|
|
|
|
var isRecording: Bool { state == .recording }
|
|
|
|
func toggle() {
|
|
if isRecording {
|
|
stop()
|
|
} else {
|
|
start()
|
|
}
|
|
}
|
|
|
|
func start() {
|
|
guard settings.hasAPIKey else {
|
|
state = .error("No API key set. Open Settings.")
|
|
return
|
|
}
|
|
|
|
currentText = ""
|
|
hasRetried = false
|
|
|
|
wsClient.onEvent = { [weak self] event in
|
|
self?.handleEvent(event)
|
|
}
|
|
|
|
wsClient.connect(apiKey: settings.apiKey, delayMs: settings.streamingDelayMs)
|
|
|
|
audioCapture.onChunk = { [weak self] chunk in
|
|
Task { @MainActor in
|
|
self?.wsClient.sendAudio(chunk)
|
|
}
|
|
}
|
|
|
|
do {
|
|
try audioCapture.start()
|
|
state = .recording
|
|
} catch {
|
|
state = .error("Mic error: \(error.localizedDescription)")
|
|
}
|
|
}
|
|
|
|
func stop() {
|
|
audioCapture.stop()
|
|
wsClient.flush()
|
|
wsClient.disconnect()
|
|
state = .idle
|
|
|
|
if !currentText.isEmpty {
|
|
TranscriptionLogger.append(text: currentText)
|
|
}
|
|
}
|
|
|
|
private func handleEvent(_ event: VoxtralEvent) {
|
|
switch event {
|
|
case .sessionCreated:
|
|
break
|
|
case .textDelta(let text):
|
|
currentText += text
|
|
if settings.outputMode == .cursorInjection {
|
|
CursorInjector.typeText(text)
|
|
}
|
|
case .segment:
|
|
break // text deltas already handle text accumulation
|
|
case .language:
|
|
break
|
|
case .done(let text):
|
|
if currentText.isEmpty {
|
|
currentText = text
|
|
}
|
|
case .error(let message):
|
|
if !hasRetried && state == .recording {
|
|
hasRetried = true
|
|
wsClient.disconnect()
|
|
wsClient.connect(apiKey: settings.apiKey, delayMs: settings.streamingDelayMs)
|
|
} else {
|
|
state = .error(message)
|
|
audioCapture.stop()
|
|
}
|
|
case .unknown:
|
|
break
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Models/TranscriptionManager.swift
|
|
git commit -m "feat: add TranscriptionManager orchestrating audio, WS, and output"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 10: Floating Transcription Window
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Views/TranscriptionWindow.swift`
|
|
|
|
- [ ] **Step 1: Implement NSPanel-based floating window**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Views/TranscriptionWindow.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
import AppKit
|
|
|
|
struct TranscriptionContentView: View {
|
|
@ObservedObject var manager: TranscriptionManager
|
|
|
|
var body: some View {
|
|
VStack(alignment: .leading, spacing: 8) {
|
|
HStack {
|
|
Circle()
|
|
.fill(manager.isRecording ? .red : .gray)
|
|
.frame(width: 8, height: 8)
|
|
Text(manager.isRecording ? "Recording..." : "Idle")
|
|
.font(.caption)
|
|
.foregroundStyle(.secondary)
|
|
Spacer()
|
|
Button {
|
|
NSPasteboard.general.clearContents()
|
|
NSPasteboard.general.setString(manager.currentText, forType: .string)
|
|
} label: {
|
|
Image(systemName: "doc.on.doc")
|
|
}
|
|
.buttonStyle(.borderless)
|
|
.disabled(manager.currentText.isEmpty)
|
|
}
|
|
|
|
ScrollViewReader { proxy in
|
|
ScrollView {
|
|
Text(manager.currentText.isEmpty ? "Transcription will appear here..." : manager.currentText)
|
|
.frame(maxWidth: .infinity, alignment: .leading)
|
|
.foregroundStyle(manager.currentText.isEmpty ? .secondary : .primary)
|
|
.textSelection(.enabled)
|
|
.id("bottom")
|
|
}
|
|
.onChange(of: manager.currentText) {
|
|
proxy.scrollTo("bottom", anchor: .bottom)
|
|
}
|
|
}
|
|
}
|
|
.padding()
|
|
.frame(width: 320, height: 200)
|
|
}
|
|
}
|
|
|
|
final class TranscriptionPanel {
|
|
private var panel: NSPanel?
|
|
private let manager: TranscriptionManager
|
|
|
|
init(manager: TranscriptionManager) {
|
|
self.manager = manager
|
|
}
|
|
|
|
func show() {
|
|
if panel == nil {
|
|
let panel = NSPanel(
|
|
contentRect: NSRect(x: 0, y: 0, width: 320, height: 200),
|
|
styleMask: [.titled, .closable, .resizable, .nonactivatingPanel, .utilityWindow],
|
|
backing: .buffered,
|
|
defer: false
|
|
)
|
|
panel.title = "MyVoxtral"
|
|
panel.isFloatingPanel = true
|
|
panel.level = .floating
|
|
panel.contentView = NSHostingView(rootView: TranscriptionContentView(manager: manager))
|
|
panel.center()
|
|
self.panel = panel
|
|
}
|
|
panel?.orderFront(nil)
|
|
}
|
|
|
|
func hide() {
|
|
panel?.orderOut(nil)
|
|
}
|
|
|
|
var isVisible: Bool {
|
|
panel?.isVisible ?? false
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Views/TranscriptionWindow.swift
|
|
git commit -m "feat: add floating transcription panel with auto-scroll and copy"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 11: Settings View
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Views/SettingsView.swift`
|
|
|
|
- [ ] **Step 1: Implement settings window**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Views/SettingsView.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
|
|
struct SettingsView: View {
|
|
@ObservedObject var settings = AppSettings.shared
|
|
@State private var isRecordingShortcut = false
|
|
|
|
var body: some View {
|
|
Form {
|
|
Section("API") {
|
|
SecureField("Mistral API Key", text: $settings.apiKey)
|
|
}
|
|
|
|
Section("Output") {
|
|
Picker("Mode", selection: $settings.outputMode) {
|
|
ForEach(OutputMode.allCases, id: \.self) { mode in
|
|
Text(mode.rawValue).tag(mode)
|
|
}
|
|
}
|
|
.pickerStyle(.segmented)
|
|
|
|
if settings.outputMode == .cursorInjection && !CursorInjector.isAccessibilityGranted {
|
|
HStack {
|
|
Image(systemName: "exclamationmark.triangle.fill")
|
|
.foregroundStyle(.yellow)
|
|
Text("Accessibility permission required")
|
|
.font(.caption)
|
|
Button("Grant") {
|
|
CursorInjector.promptAccessibilityPermission()
|
|
}
|
|
.font(.caption)
|
|
}
|
|
}
|
|
}
|
|
|
|
Section("Shortcut") {
|
|
HStack {
|
|
Text("Toggle Recording:")
|
|
Spacer()
|
|
Button(isRecordingShortcut ? "Press keys..." : settings.shortcutDisplayString) {
|
|
isRecordingShortcut = true
|
|
}
|
|
.onKeyPress { press in
|
|
guard isRecordingShortcut else { return .ignored }
|
|
settings.shortcutKeyCode = press.key.character.flatMap {
|
|
UInt16($0.asciiValue ?? 0)
|
|
} ?? 0
|
|
settings.shortcutModifiers = press.modifiers.rawValue
|
|
isRecordingShortcut = false
|
|
return .handled
|
|
}
|
|
}
|
|
}
|
|
|
|
Section("Latency") {
|
|
VStack(alignment: .leading) {
|
|
Text("Streaming delay: \(settings.streamingDelayMs)ms")
|
|
.font(.caption)
|
|
Slider(
|
|
value: Binding(
|
|
get: { Double(settings.streamingDelayMs) },
|
|
set: { settings.streamingDelayMs = Int($0) }
|
|
),
|
|
in: 240...2400,
|
|
step: 120
|
|
)
|
|
HStack {
|
|
Text("Fast").font(.caption2).foregroundStyle(.secondary)
|
|
Spacer()
|
|
Text("Accurate").font(.caption2).foregroundStyle(.secondary)
|
|
}
|
|
}
|
|
}
|
|
}
|
|
.formStyle(.grouped)
|
|
.frame(width: 360, height: 340)
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Views/SettingsView.swift
|
|
git commit -m "feat: add SettingsView with API key, shortcut, mode, and latency"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 12: Menu Bar View
|
|
|
|
**Files:**
|
|
- Create: `MyVoxtral/MyVoxtral/Views/MenuBarView.swift`
|
|
|
|
- [ ] **Step 1: Implement the menu bar dropdown**
|
|
|
|
Create `MyVoxtral/MyVoxtral/Views/MenuBarView.swift`:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
|
|
struct MenuBarView: View {
|
|
@ObservedObject var manager: TranscriptionManager
|
|
@ObservedObject var settings = AppSettings.shared
|
|
let onShowTranscription: () -> Void
|
|
let onShowSettings: () -> Void
|
|
|
|
var body: some View {
|
|
VStack(spacing: 4) {
|
|
Button(manager.isRecording ? "Stop Recording" : "Start Recording") {
|
|
manager.toggle()
|
|
}
|
|
.keyboardShortcut("r")
|
|
|
|
if case .error(let msg) = manager.state {
|
|
Text(msg)
|
|
.font(.caption)
|
|
.foregroundStyle(.red)
|
|
.lineLimit(2)
|
|
.padding(.horizontal, 8)
|
|
}
|
|
|
|
Divider()
|
|
|
|
Button("Show Transcription") {
|
|
onShowTranscription()
|
|
}
|
|
|
|
Button("Settings...") {
|
|
onShowSettings()
|
|
}
|
|
|
|
Divider()
|
|
|
|
Button("Quit") {
|
|
NSApplication.shared.terminate(nil)
|
|
}
|
|
.keyboardShortcut("q")
|
|
}
|
|
.padding(.vertical, 4)
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build to verify**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/Views/MenuBarView.swift
|
|
git commit -m "feat: add MenuBarView with start/stop, settings, and error display"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 13: Wire Everything Together in App Entry
|
|
|
|
**Files:**
|
|
- Modify: `MyVoxtral/MyVoxtral/MyVoxtralApp.swift`
|
|
|
|
- [ ] **Step 1: Update the app entry point to integrate all components**
|
|
|
|
Replace the contents of `MyVoxtral/MyVoxtral/MyVoxtralApp.swift` with:
|
|
|
|
```swift
|
|
import SwiftUI
|
|
|
|
@main
|
|
struct MyVoxtralApp: App {
|
|
@StateObject private var manager = TranscriptionManager()
|
|
@StateObject private var settings = AppSettings.shared
|
|
@State private var transcriptionPanel: TranscriptionPanel?
|
|
@State private var settingsWindow: NSWindow?
|
|
|
|
private let globalShortcut = GlobalShortcut()
|
|
|
|
var body: some Scene {
|
|
MenuBarExtra {
|
|
MenuBarView(
|
|
manager: manager,
|
|
onShowTranscription: { showTranscriptionWindow() },
|
|
onShowSettings: { showSettingsWindow() }
|
|
)
|
|
} label: {
|
|
Image(systemName: manager.isRecording ? "mic.fill" : "mic")
|
|
.symbolRenderingMode(.palette)
|
|
.foregroundStyle(manager.isRecording ? .red : .primary)
|
|
}
|
|
.onChange(of: settings.shortcutKeyCode) { registerShortcut() }
|
|
.onChange(of: settings.shortcutModifiers) { registerShortcut() }
|
|
.onAppear {
|
|
if !settings.hasAPIKey {
|
|
showSettingsWindow()
|
|
}
|
|
registerShortcut()
|
|
}
|
|
.onChange(of: manager.isRecording) {
|
|
if manager.isRecording && settings.outputMode == .textBox {
|
|
showTranscriptionWindow()
|
|
}
|
|
}
|
|
}
|
|
|
|
private func registerShortcut() {
|
|
globalShortcut.register(
|
|
keyCode: settings.shortcutKeyCode,
|
|
modifiers: settings.shortcutModifiers
|
|
)
|
|
globalShortcut.onTrigger = { [weak manager] in
|
|
Task { @MainActor in
|
|
manager?.toggle()
|
|
}
|
|
}
|
|
}
|
|
|
|
private func showTranscriptionWindow() {
|
|
if transcriptionPanel == nil {
|
|
transcriptionPanel = TranscriptionPanel(manager: manager)
|
|
}
|
|
transcriptionPanel?.show()
|
|
}
|
|
|
|
private func showSettingsWindow() {
|
|
if let settingsWindow, settingsWindow.isVisible {
|
|
settingsWindow.makeKeyAndOrderFront(nil)
|
|
return
|
|
}
|
|
let window = NSWindow(
|
|
contentRect: NSRect(x: 0, y: 0, width: 360, height: 340),
|
|
styleMask: [.titled, .closable],
|
|
backing: .buffered,
|
|
defer: false
|
|
)
|
|
window.title = "MyVoxtral Settings"
|
|
window.contentView = NSHostingView(rootView: SettingsView())
|
|
window.center()
|
|
window.makeKeyAndOrderFront(nil)
|
|
NSApp.activate(ignoringOtherApps: true)
|
|
self.settingsWindow = window
|
|
}
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Build the full app**
|
|
|
|
```bash
|
|
swift build
|
|
```
|
|
|
|
Expected: Build succeeds with all components linked.
|
|
|
|
- [ ] **Step 3: Run and verify basic functionality**
|
|
|
|
```bash
|
|
swift run
|
|
```
|
|
|
|
Expected: Menu bar icon appears. Clicking shows the dropdown with Start/Stop, Settings, Quit. Settings window opens if no API key.
|
|
|
|
- [ ] **Step 4: Commit**
|
|
|
|
```bash
|
|
git add MyVoxtral/MyVoxtralApp.swift
|
|
git commit -m "feat: wire all components into app entry with menu bar, shortcut, and windows"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 14: End-to-End Smoke Test
|
|
|
|
- [ ] **Step 1: Verify the full recording flow**
|
|
|
|
```bash
|
|
swift run
|
|
```
|
|
|
|
Manual test checklist:
|
|
1. App launches as menu bar icon (no Dock icon)
|
|
2. Click icon → dropdown shows Start Recording, Settings, Quit
|
|
3. Open Settings → enter a valid Mistral API key
|
|
4. Set output mode to "Text Window"
|
|
5. Click "Start Recording" → mic icon turns red, transcription window opens
|
|
6. Speak into microphone → text appears in the floating window
|
|
7. Click "Stop Recording" → icon returns to normal
|
|
8. Verify `~/Library/Application Support/MyVoxtral/transcription.log` contains the session
|
|
9. Switch to "Type at Cursor" mode → grant Accessibility permission if prompted
|
|
10. Open TextEdit, click Start Recording, speak → text appears at cursor in TextEdit
|
|
11. Configure a keyboard shortcut in Settings → verify it toggles recording from any app
|
|
|
|
- [ ] **Step 2: Fix any issues found during smoke testing**
|
|
|
|
Address any build errors, runtime crashes, or connection issues.
|
|
|
|
- [ ] **Step 3: Final commit**
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "test: verify end-to-end transcription flow"
|
|
```
|