Prodigy

A high-performance, real-time speech-to-speech system designed for low-latency telephony communication. Prodigy integrates Whisper (ASR), LLaMA (LLM), and Kokoro or NeuTTS (TTS) into a linear microservice pipeline, using a standalone SIP client as an RTP gateway. Optimized for Apple Silicon (CoreML/Metal) with no PyTorch runtime dependency.

Architecture

                         Telephony Network
                              |
                         [SIP Client]
                          /        \
                    RTP in          RTP out
                        |              ^
                       IAP            OAP
                        |              ^
                       VAD          Kokoro / NeuTTS
                        |              ^
                     Whisper -----> LLaMA

                     [Frontend] (web UI + log aggregation)

The pipeline is a linear chain of C++ programs. Every adjacent pair communicates over two persistent TCP connections (management + data). The frontend manages all services and provides a web UI at http://0.0.0.0:8080/.

Requirements

OS: macOS Apple Silicon (M1/M2/M3/M4)
Language: C++17, Python 3.9+
Build: CMake 3.22+, Ninja (recommended)
Dependencies (installed automatically by runmetoinstalleverythingfirst):
- whisper.cpp (compiled with CoreML + Metal)
- llama.cpp (compiled with Metal)
- espeak-ng (brew install espeak-ng)
- macOS frameworks: Accelerate, Metal, CoreML, Foundation

Quick Install & Build

# Step 1: Install everything (Homebrew, Miniconda, models, CoreML exports)
./runmetoinstalleverythingfirst

# Step 2: Build all services
./runmetobuildeverything

# Step 3: Launch
cd bin && ./frontend
# Web UI: http://localhost:8080

runmetobuildeverything auto-clones whisper-cpp and llama-cpp if missing, detects Ninja for fast parallel builds, and bypasses the macOS Xcode license check using the Command Line Tools SDK directly.

Manual Build

# Build whisper.cpp (CoreML + Metal, static)
cmake -G Ninja -S whisper-cpp -B whisper-cpp/build \
  -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF \
  -DWHISPER_COREML=ON -DGGML_METAL=ON \
  -DWHISPER_BUILD_TESTS=OFF -DWHISPER_BUILD_EXAMPLES=OFF
cmake --build whisper-cpp/build -j

# Build llama.cpp (Metal, static)
cmake -G Ninja -S llama-cpp -B llama-cpp/build \
  -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF \
  -DGGML_METAL=ON
cmake --build llama-cpp/build -j

# Build Prodigy
cmake -G Ninja -S . -B build \
  -DCMAKE_BUILD_TYPE=Release -DKOKORO_COREML=ON -DBUILD_TESTS=ON
cmake --build build -j

CMake Options

Option	Default	Description
`KOKORO_COREML`	`ON`	Enable CoreML ANE acceleration for Kokoro decoder
`BUILD_TESTS`	`ON`	Build unit/integration tests (requires GoogleTest); disable with `-DBUILD_TESTS=OFF`
`ESPEAK_NG_DATA_DIR`	auto-detected	Path to espeak-ng-data directory

Models

All model files are placed in bin/models/. runmetoinstalleverythingfirst downloads and prepares all of these automatically.

Whisper (ASR)

File	Size	Purpose
`ggml-large-v3-turbo-q5_0.bin`	~547 MB	Default ASR model (best speed/accuracy balance)
`ggml-large-v3-q5_0.bin`	~1.0 GB	Higher accuracy ASR model
`ggml-large-v3-turbo-encoder.mlmodelc/`	varies	CoreML ANE encoder for large-v3-turbo
`ggml-large-v3-encoder.mlmodelc/`	varies	CoreML ANE encoder for large-v3

LLaMA (LLM)

File	Size	Purpose
`Llama-3.2-1B-Instruct-Q8_0.gguf`	~1.2 GB	Response generation (Metal-accelerated)

Kokoro (TTS)

Located in bin/models/kokoro-german/:

File	Purpose
`coreml/kokoro_duration.mlmodelc/`	Duration model (CoreML ANE)
`coreml/kokoro_f0n_{3s,5s,10s}.mlmodelc/`	F0/N predictor buckets (CoreML ANE)
`decoder_variants/*.mlmodelc/`	Split decoder models (CoreML ANE)
`<voice>_voice.bin`	Voice style embedding (256-dim float32). Available: `df_eva_voice.bin`, `dm_bernd_voice.bin`
`vocab.json`	Phoneme-to-token mapping

NeuTTS (alternative TTS)

Located in bin/models/neutts-nano-german/:

File	Size	Purpose
`neutts-nano-german-Q4_0.gguf`	~185 MB	LLaMA-based speech backbone
`neucodec_decoder.mlmodelc/`	~3.4 GB	NeuCodec CoreML decoder
`ref_codes.bin`	-	Pre-computed reference voice codec codes
`ref_text.txt`	-	Reference voice phoneme transcript

Port Map

All services bind to 127.0.0.1:

Service	Mgmt Port	Data Port	Cmd Port	Notes
SIP Client	13100	13101	13102	+ SIP UDP 5060 + RTP UDP 10000+
IAP	13110	13111	13112
VAD	13115	13116	13117
Whisper	13120	13121	13122
LLaMA	13130	13131	13132
Kokoro / NeuTTS	13140	13141	13142	Mutually exclusive
OAP	13150	13151	13152
Frontend	-	-	-	HTTP 8080, Log UDP 22022

Services

1. SIP Client (`bin/sip-client`)

RTP gateway and SIP stack. Handles SIP registration with Digest authentication, incoming/outgoing call management, and routes raw RTP audio between the telephony network and the internal pipeline.

Key behaviors:

Minimal SIP stack over raw UDP (port 5060 by default)
MD5 Digest authentication with WWW-Authenticate challenge parsing
Re-registers every 60 seconds
Multi-line: supports N simultaneous SIP registrations (--lines 0 is valid for test-only mode)
Inbound RTP forwarded to IAP as raw Packet frames (12-byte RTP header included; IAP strips it)
Outbound G.711 frames from OAP wrapped in RTP headers (seq, timestamp, SSRC) and sent via UDP
Stale call auto-hangup after 60 seconds of no RTP traffic
RTP port base: 10000, incremented by 2 per call

Command-Line Parameters:

Argument	Default	Description
`[--lines N] [<user> <server> [port]]`	0 lines	Lines to register at startup; positional args only needed when `lines > 0`
`--log-level <LEVEL>`	`INFO`	Log verbosity: ERROR, WARN, INFO, DEBUG, TRACE

Runtime Commands (cmd port 13102):

Command	Description
`ADD_LINE <user> <server> <port> <password>`	Register a new SIP account dynamically (space-delimited; use `-` for no password)
`REMOVE_LINE <index>`	Unregister and remove a SIP line by index
`LIST_LINES`	List all registered SIP lines
`GET_STATS`	JSON RTP counters for all active calls (rx/tx packets, bytes, forwarded, discarded)
`PING`	Health check → `PONG`
`STATUS`	Registered lines, active calls, connection state
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

2. Inbound Audio Processor (`bin/inbound-audio-processor`)

Converts G.711 μ-law telephony audio (8kHz) to float32 PCM (16kHz) for the VAD service.

Signal chain:

G.711 μ-law decode: 256-entry ITU-T lookup table; each byte → float32 in [-1.0, 1.0]
8kHz→16kHz upsample: 15-tap Hamming-windowed sinc FIR half-band filter (cutoff ~3.8kHz, ~40dB stopband). Zero-stuffs input, then filters to remove spectral copies above 4kHz.

Each 160-byte RTP payload (20ms @ 8kHz) produces 320 float32 samples (20ms @ 16kHz). Continues processing and discards output if VAD is unavailable; auto-reconnects when VAD comes back online.

Command-Line Parameters:

Argument	Default	Description
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13112):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Active call count, upstream/downstream state, avg/max per-packet latency (μs)
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

3. VAD Service (`bin/vad-service`)

Energy-based Voice Activity Detection. Segments continuous 16kHz PCM into speech chunks (0.5–8 seconds) for Whisper.

Algorithm:

Adaptive noise floor: EMA update (alpha=0.05) during silence frames; time constant ~1 second
Onset detection: requires 3 consecutive frames above threshold × noise_floor to confirm speech start
End detection: 400ms of consecutive sub-threshold frames triggers speech-end
Micro-pause detection: short pauses (~400ms) between words trigger early submission rather than waiting for full silence — reduces Whisper inference latency since inference time scales with chunk length
Smart-split: when max chunk length is reached during speech, finds the lowest-energy frame near the boundary to avoid cutting mid-word
Pre-speech context: 400ms (8 frames × 50ms) before confirmed onset is prepended to each chunk
RMS energy gate: chunks with RMS < 0.005 discarded as near-silence
SPEECH_ACTIVE/SPEECH_IDLE signals: broadcast downstream to Kokoro/OAP for TTS interruption

Command-Line Parameters:

Argument	Default	Description
`--vad-threshold <mult>`	`2.0`	Energy threshold multiplier over noise floor
`--vad-silence-ms <ms>`	`400`	Silence duration to end speech segment
`--vad-max-chunk-ms <ms>`	`8000`	Maximum speech chunk duration
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13117):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Noise floor, threshold, silence_ms, max_chunk_ms, active calls, upstream/downstream state
`SET_VAD_THRESHOLD:<mult>`	Update threshold multiplier at runtime
`SET_VAD_SILENCE_MS:<ms>`	Update silence detection duration at runtime
`SET_VAD_MAX_CHUNK_MS:<ms>`	Update max chunk length at runtime
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

4. Whisper Service (`bin/whisper-service`)

Automatic Speech Recognition (ASR). Receives pre-segmented speech chunks from VAD and returns transcribed text to LLaMA.

Inference details:

Backend: whisper.cpp with CoreML ANE (Apple Neural Engine) + Metal fallback
Decoding: Greedy strategy (not beam search). On 2–8s segments, greedy is 3–5× faster than beam_size=5 with negligible accuracy difference. Temperature fallback with temp_inc=0.2 handles uncertain segments.
Telephony-optimized parameters: no_speech_thold=0.9 (prevents early decoder stop on G.711-degraded audio), entropy_thold=2.8 (tolerant of codec uncertainty)
No audio normalization: audio passed directly to Whisper (matches whisper-cli defaults for optimal accuracy on G.711 input)
RMS energy pre-check: rejects chunks with RMS < 0.005 to prevent hallucinations on near-silence
Packet buffering: if LLaMA is disconnected, buffers up to 64 transcription packets and drains them on reconnect
Hallucination filter (default OFF, runtime-toggleable): exact-match detection of common Whisper hallucination strings (e.g., "Untertitel", "Copyright", "Musik"); repetition detection; trailing suffix stripping

Command-Line Parameters:

Argument	Default	Description
`--language <lang>` / `-l <lang>`	`de`	Whisper language code
`--model <path>` / `-m <path>`	`models/ggml-large-v3-turbo-q5_0.bin`	Path to GGML model file
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13122):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Model name, upstream/downstream state, hallucination filter state
`HALLUCINATION_FILTER:ON` / `OFF`	Enable/disable hallucination filter
`HALLUCINATION_FILTER:STATUS`	Query filter state
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

5. LLaMA Service (`bin/llama-service`)

Generates a spoken German reply from transcribed text using Llama-3.2-1B-Instruct.

Inference details:

Model: Llama-3.2-1B-Instruct Q8_0 GGUF, all layers on Metal GPU (n_gpu_layers=-1)
Template: llama_chat_apply_template() — uses the model's built-in chat template for correct role tagging; no manual prompt formatting
Sampling: Greedy (llama_sampler_init_greedy). Max 64 tokens per response. Stops at sentence-ending punctuation (., ?, !) or EOS.
Context: 2048 tokens, 4 threads
German system prompt: enforces always-German, max 1 sentence / 15 words, polite and natural tone. ~320ms average latency on Apple M-series.
Session isolation: each call gets its own LlamaCall struct with independent message history and KV cache sequence ID. Context cleared on CALL_END.
Shut-up mechanism: SPEECH_ACTIVE from VAD aborts active generation immediately (~5–13ms interrupt latency). Worker loop defers new responses while speech is active.
Tokenizer resilience: retries with progressively larger buffer (up to 4×) if llama_tokenize() returns a negative value

Command-Line Parameters:

Argument	Default	Description
`--model <path>` / `-m <path>`	`models/Llama-3.2-1B-Instruct-Q8_0.gguf`	Path to GGUF model
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13132):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Model name, active calls, upstream/downstream state, speech active flag
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

6. Kokoro TTS Service (`bin/kokoro-service`)

Text-to-speech using the Kokoro model. Receives response text from LLaMA and streams 24kHz float32 PCM audio to OAP. No PyTorch dependency — all inference via CoreML on Apple Neural Engine.

Phonemization pipeline:

espeak-ng (via libespeak-ng) converts input text → IPA phoneme string. Language auto-detected (de / en-us) via detect_german().
Phoneme cache (LRU, 10,000 entries): avoids re-running espeak-ng for repeated phrases.
KokoroVocab: greedy longest-match scan (up to 4 chars per token, UTF-8 aware) maps phonemes → int64 token IDs from vocab.json. Input padded to 512 tokens.

Two-stage CoreML inference:

Stage 1 — Duration model (kokoro_duration.mlmodelc): predicts per-phoneme durations, generates alignment tensors (pred_dur, d, t_en, s, ref_s). Style encoding from <voice>_voice.bin (256-dim reference embedding).
Stage 1b — F0/N predictor (kokoro_f0n_{3s,5s,10s}.mlmodelc): three bucketed models (3s/5s/10s) predict fundamental frequency (f0_pred) and voicing (n_pred) from the duration model's d and s outputs. Bucket selected by utterance length. These condition the harmonic/noise excitation signal — without them, speech sounds hoarse/unvoiced.
Stage 2 — Decoder (decoder_variants/*.mlmodelc): split decoder generates the audio waveform from alignment tensors + F0/N conditioning. All models run with MLComputeUnitsAll (ANE + GPU + CPU).

Audio output processing:

normalize_audio(): scales to 0.90 peak ceiling (skips near-silent audio and already-normalized output)
apply_fade_in(): 48-sample linear ramp at onset to prevent click artifacts
Sends audio to OAP in 4800-sample chunks (200ms @ 24kHz) for smooth buffer filling

SPEECH_ACTIVE handling: Abandoned synthesis immediately if VAD signals caller speech. Per-call synthesis threads, so multi-line calls synthesize in parallel.

Command-Line Parameters:

Argument	Default	Description
`--voice <NAME>`	`df_eva`	Voice to use (`df_eva`, `dm_bernd`)
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13142):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Active calls, upstream/downstream state, current speed
`SET_SPEED:<0.5–2.0>`	Set synthesis speed (1.0 = normal, clamped to [0.5, 2.0])
`GET_SPEED`	Query current speed
`TEST_SYNTH:<text>`	Synthesize text and return timing/peak/RMS stats (no audio output)
`BENCHMARK:<text>\|<N>`	Run N synthesis iterations; returns avg/p50/p95 latency and RTF
`SYNTH_WAV:<path>\|<text>`	Synthesize text and save to WAV file at `<path>` (relative paths only)
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

7. NeuTTS Service (`bin/neutts-service`) — Alternative TTS

Alternative TTS engine using the NeuTTS Nano German model. Occupies the same pipeline slot as Kokoro (ports 13140–13142) — only one TTS service can run at a time.

Inference pipeline:

espeak-ng converts input text → IPA phonemes (language de, with stress markers)
Builds a NeuTTS prompt: user: Convert the text to speech:<|TEXT_PROMPT_START|>{ref_phones} {phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{ref_codes}
Tokenize and feed to NeuTTS backbone (llama.cpp, Q4_0 GGUF) with temperature=1.0, top_k=50 autoregressive sampling
Extract <|speech_N|> tokens as integer codec codes
Stop at <|SPEECH_GENERATION_END|> or EOS
Decode codes through NeuCodec CoreML decoder → 24kHz float32 PCM

Reference voice: Pre-computed codec codes (ref_codes.bin) and phonemized text (ref_text.txt) loaded at startup to define voice timbre and style.

Audio post-processing: Same as Kokoro — normalize to 0.90 peak, 48-sample fade-in.

Command-Line Parameters:

Argument	Default	Description
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13142):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Active calls, upstream/downstream state
`TEST_SYNTH:<text>`	Synthesize and return timing stats
`SYNTH_WAV:<path>\|<text>`	Synthesize text to a WAV file at the given path
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

8. Outbound Audio Processor (`bin/outbound-audio-processor`)

Converts 24kHz float32 PCM from Kokoro/NeuTTS into 160-byte G.711 μ-law frames for the SIP client. Maintains constant 20ms output cadence.

Signal chain (per call):

DC blocking (first-order high-pass): α = 0.9947697 (~20Hz cutoff). Removes DC offset and LF rumble. Initialized with the first sample value to avoid onset click.
Presence boost (optional, default OFF): High-shelf biquad IIR filter, +3dB shelf at 2500Hz (Audio EQ Cookbook, S=1). Adds air/clarity to the telephone band.
Anti-aliasing FIR (63-tap, Hamming-windowed sinc): Cutoff 3400/12000 (normalized). ~43dB stopband attenuation. Coefficients computed once at startup, shared across all calls. Per-call fir_history[31] preserves filter state across chunks.
3:1 Decimation: Keep every 3rd filtered sample (24kHz → 8kHz).
G.711 μ-law encode (ITU-T compliant): ULAW_CLIP=32635, ULAW_BIAS=132. Encodes int16 PCM to 8-bit μ-law byte.

Output scheduler: Dedicated sender thread fires every 20ms using steady_clock. Sends exactly 160 bytes per tick. If the TTS buffer is empty (silence or Kokoro disconnected), sends 0xFF (μ-law silence) to maintain RTP clock continuity. Scheduler resync guard: if OS sleep/load spike causes >100ms drift, snaps next_tick to now instead of firing a burst of catch-up frames.

SPEECH_ACTIVE handling: Clears all per-call buffers and resets FIR/DC/biquad state immediately when VAD signals caller speech. A configurable sidetone guard (default 1500ms) suppresses flushes arriving shortly after new TTS audio — prevents echo from triggering a spurious flush.

WAV recording (optional): When enabled, records the 8kHz int16 PCM output per call. Written to disk on CALL_END.

Command-Line Parameters:

Argument	Default	Description
`--save-wav-dir <dir>`	(disabled)	Enable WAV recording and set output directory
`--log-level <LEVEL>`	`INFO`	Log verbosity

Runtime Commands (cmd port 13152):

Command	Description
`PING`	Health check → `PONG`
`STATUS`	Active calls, buffer lengths, upstream/downstream state
`SAVE_WAV:ON` / `OFF` / `STATUS`	Toggle WAV recording
`SET_SAVE_WAV_DIR:<dir>`	Set WAV output directory
`PRESENCE_BOOST:ON` / `OFF` / `STATUS`	Toggle +3dB presence boost biquad
`SET_SIDETONE_GUARD_MS:<ms>`	Set SPEECH_ACTIVE guard window (default 1500ms)
`TEST_ENCODE:<freq>\|<dur_ms>`	Generate sine wave, encode, measure μ-law RMS output
`SET_LOG_LEVEL:<LEVEL>`	Change log verbosity without restart

9. Frontend (`bin/frontend`)

Central control plane. Serves the web UI, manages service lifecycles, aggregates logs, and exposes all configuration via REST API.

Storage: SQLite database (whispertalk.db) — persists service configurations, log level settings, and test results.

Log aggregation: Each service sends structured log entries as UDP datagrams to port 22022. Frontend stores them in SQLite (ring-buffered in memory for fast /recent queries) and streams them live via SSE.

Full HTTP API (port 8080):

Method	Endpoint	Description
GET	`/api/services`	List all managed services + status
POST	`/api/services/start`	Start a service `{name, args}`
POST	`/api/services/stop`	Stop a service `{name}`
POST	`/api/services/restart`	Restart a service `{name}`
GET/POST	`/api/services/config`	Read/write per-service config (persisted in SQLite)
GET	`/api/logs`	Paginated log query `{limit, offset, service, level}`
GET	`/api/logs/recent`	Last N entries from in-memory ring buffer
GET	`/api/logs/stream`	SSE live log stream
POST	`/api/settings/log_level`	Set per-service log level (propagated to running service immediately)
POST	`/api/db/query`	Execute SELECT query (read-only guard)
POST	`/api/db/write_mode`	Toggle write mode for unsafe queries
GET	`/api/db/schema`	Return SQLite schema
GET	`/api/whisper/models`	List available GGML model files in `models/`
POST	`/api/whisper/accuracy_test`	Run offline Whisper accuracy test on a WAV file
POST	`/api/whisper/hallucination_filter`	Enable/disable Whisper hallucination filter
GET/POST	`/api/vad/config`	Read/write VAD parameters (propagated to running service)
GET/POST	`/api/oap/wav_recording`	Read/write OAP WAV recording settings
POST	`/api/sip/add-line`	Register a new SIP account
POST	`/api/sip/remove-line`	Remove a SIP account
GET	`/api/sip/lines`	List registered SIP lines
GET	`/api/sip/stats`	RTP counters per active call
POST	`/api/iap/quality_test`	Offline G.711 codec round-trip quality test
GET	`/api/testfiles`	List WAV+TXT sample pairs in `Testfiles/`
POST	`/api/testfiles/scan`	Rescan `Testfiles/` directory
POST	`/api/tests/start`	Run a test binary
POST	`/api/tests/stop`	Kill a running test
GET	`/api/tests/*/history`	Test run history
GET	`/api/tests/*/log`	Test stdout/stderr
GET	`/api/test_results`	Pipeline WER test results
GET	`/api/status`	System uptime, service health summary

Web UI features:

Service management: start/stop/restart each pipeline service independently
Real-time log streaming with per-service and per-level filtering
Log level control: checkboxes (ERROR/WARN/INFO/DEBUG/TRACE) applied immediately and persisted
VAD configuration: threshold, silence duration, max chunk length — runtime update without restart
Whisper configuration: model selection, hallucination filter toggle
Kokoro configuration: synthesis speed slider, SYNTH_WAV test
OAP configuration: WAV recording toggle + directory, presence boost toggle
SIP management: add/remove SIP lines, view RTP statistics
Beta testing page: audio injection into live calls via Test SIP Provider
Test infrastructure: ASR accuracy tests, pipeline WER tests, LLaMA quality tests, codec quality tests

Interconnect Communication

All inter-service communication uses interconnect.h (a shared header, no external library):

Management channel (base port +0): Typed control messages — CALL_START, CALL_END, SPEECH_ACTIVE, SPEECH_IDLE, PING/PONG
Data channel (base port +1): Binary Packet frames — variable-length payloads tagged with call_id and PacketType (audio PCM, text, G.711)
Command port (base port +2): TCP command interface, one connection per request, 10s recv timeout
TCP_NODELAY: Enabled on all connections for minimum latency
Auto-reconnect: Downstream connections retry every 200ms until reachable; upstream server accepts reconnections at any time
LogForwarder: Sends structured log entries as UDP datagrams to FRONTEND_LOG_PORT (22022)

Log Level Control

Every service supports 5 levels: ERROR, WARN, INFO, DEBUG, TRACE.

Three ways to set log level:

Startup argument: --log-level DEBUG
Frontend UI: Log level checkboxes — applied immediately to the running service, persisted in SQLite for restarts
Direct command: Send SET_LOG_LEVEL:DEBUG to the service's cmd port via TCP

Testing

Pipeline WER Test

python3 tests/run_pipeline_test.py <MODEL_NAME> [TESTFILES_DIR]

Injects WAV samples through the full pipeline via Test SIP Provider, collects Whisper transcriptions from the frontend log API, and computes character-level similarity against ground truth.

PASS: ≥ 99.5% similarity
WARN: ≥ 90% similarity
FAIL: < 90% similarity

Test samples: Testfiles/sample_NN.wav + sample_NN.txt pairs.

Test SIP Provider (`bin/test_sip_provider`)

B2BUA test tool that injects audio files into the pipeline as if they were real phone calls. Supports WAV recording of both legs of each conference call.

./test_sip_provider --port 5060 --http-port 22011 --testfiles-dir Testfiles

HTTP API (port 22011):

Method	Endpoint	Description
POST	`/conference`	Create a test call with optional audio injection
POST	`/hangup`	Hang up a call
GET	`/calls`	List active calls
GET/POST	`/wav_recording`	Read/write WAV recording settings
POST	`/inject`	Inject an audio file into a call leg

Audio Quality Collection (`tests/run_stage7.py`)

End-to-end diagnostic script:

python3 tests/run_stage7.py [--iterations N]

Starts all services, connects test calls, enables WAV recording, injects samples, collects logs, and saves WAV files from both OAP and Test SIP Provider. Produces stage7_output/run_N/ directories with pipeline.log, oap_call_*.wav, and tsp_call_*.wav.

Unit/Integration Tests

./runmetobuildeverything
cd build && ctest --output-on-failure

Tests are built by default (BUILD_TESTS=ON). To skip them: ./runmetobuildeverything --no-tests

Test binaries: test_sanity, test_interconnect, test_kokoro_cpp, test_integration.

Whisper Model Benchmarks

Hardware: Apple M4, macOS 25.2.0
whisper.cpp: v1.8.3 (CoreML + Metal)

whisper-cli Direct Test (5 samples, clean 16kHz PCM)

All model/backend combinations achieved 5/5 perfect transcription on clean input.

Model	Size	Backend	Avg Time
large-v3	2.9 GB	CoreML + ANE	~2580ms
large-v3-q5_0	1.0 GB	CoreML + ANE	~2075ms
large-v3-turbo	1.5 GB	CoreML + ANE	~1575ms
large-v3-turbo-q5_0	547 MB	CoreML + ANE	~1060ms

Full Pipeline Test (20 samples, G.711 μ-law via SIP/RTP)

Audio path: WAV → 8kHz μ-law G.711 → RTP → SIP Client → IAP (8→16kHz) → Whisper

Model	Size	Backend	PASS	WARN	FAIL	Avg ms
large-v3	2.9 GB	Metal	12	8	0	1627
large-v3	2.9 GB	CoreML	11	8	1*	1301
large-v3-q5_0	1.0 GB	Metal	11	9	0	1789
large-v3-turbo	1.5 GB	CoreML	9	10	1*	688
large-v3-turbo-q5_0	547 MB	CoreML	8	11	1**	686

* CoreML warmup caused first-inference timeout
** Sample_01 failed (41.8%) due to CoreML warmup causing VAD to miss first half

Scoring: PASS = ≥99.5% similarity, WARN = ≥90%, FAIL = <90%

Recommendations

Accuracy priority: large-v3 + Metal — 1627ms avg, no warmup delay, best accuracy
Speed priority: large-v3-turbo + CoreML — 688ms avg after initial warmup

Key Findings

Quantization (q5_0): negligible accuracy impact vs. full-precision models
CoreML warmup: 20–35s first-inference compilation cost, one-time per service lifetime
Turbo trade-off: ~2× faster, slightly more WARN results. 4-layer decoder occasionally misses nuances in G.711-degraded audio.
Consistent failures: some samples fail across all model configs due to G.711 codec artifacts, not model limitations

Name		Name	Last commit message	Last commit date
Latest commit History 601 Commits
.github/workflows		.github/workflows
.zencoder/rules		.zencoder/rules
.zenflow/tasks		.zenflow/tasks
Testfiles		Testfiles
bin/models		bin/models
docs		docs
libpiper		libpiper
scripts		scripts
test-results		test-results
tests		tests
third_party/openssl/include/openssl		third_party/openssl/include/openssl
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
KOKORO.md		KOKORO.md
frontend.cpp		frontend.cpp
har_source.h		har_source.h
inbound-audio-processor.cpp		inbound-audio-processor.cpp
install.sh		install.sh
interconnect.h		interconnect.h
kokoro-service.cpp		kokoro-service.cpp
ktensor.h		ktensor.h
llama-service.cpp		llama-service.cpp
mongoose.c		mongoose.c
mongoose.h		mongoose.h
neutts-service.cpp		neutts-service.cpp
outbound-audio-processor.cpp		outbound-audio-processor.cpp
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
runmetobuildeverything		runmetobuildeverything
runmetoinstalleverythingfirst		runmetoinstalleverythingfirst
sip-client-main.cpp		sip-client-main.cpp
sqlite3.c		sqlite3.c
sqlite3.h		sqlite3.h
tts-common.h		tts-common.h
vad-service.cpp		vad-service.cpp
whisper-service.cpp		whisper-service.cpp

Folders and files

Latest commit

History

Repository files navigation

Prodigy

Architecture

Requirements

Quick Install & Build

Manual Build

CMake Options

Models

Whisper (ASR)

LLaMA (LLM)

Kokoro (TTS)

NeuTTS (alternative TTS)

Port Map

Services

1. SIP Client (bin/sip-client)

2. Inbound Audio Processor (bin/inbound-audio-processor)

3. VAD Service (bin/vad-service)

4. Whisper Service (bin/whisper-service)

5. LLaMA Service (bin/llama-service)

6. Kokoro TTS Service (bin/kokoro-service)

7. NeuTTS Service (bin/neutts-service) — Alternative TTS

8. Outbound Audio Processor (bin/outbound-audio-processor)

9. Frontend (bin/frontend)

Interconnect Communication

Log Level Control

Testing

Pipeline WER Test

Test SIP Provider (bin/test_sip_provider)

Audio Quality Collection (tests/run_stage7.py)

Unit/Integration Tests

Whisper Model Benchmarks

whisper-cli Direct Test (5 samples, clean 16kHz PCM)

Full Pipeline Test (20 samples, G.711 μ-law via SIP/RTP)

Recommendations

Key Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. SIP Client (`bin/sip-client`)

2. Inbound Audio Processor (`bin/inbound-audio-processor`)

3. VAD Service (`bin/vad-service`)

4. Whisper Service (`bin/whisper-service`)

5. LLaMA Service (`bin/llama-service`)

6. Kokoro TTS Service (`bin/kokoro-service`)

7. NeuTTS Service (`bin/neutts-service`) — Alternative TTS

8. Outbound Audio Processor (`bin/outbound-audio-processor`)

9. Frontend (`bin/frontend`)

Test SIP Provider (`bin/test_sip_provider`)

Audio Quality Collection (`tests/run_stage7.py`)

Packages