Providers
ph0ny is a thin orchestration layer over the best models in each modality. Pick a provider per request, or let us route to the lowest-latency / lowest-cost option that meets your quality bar. Every BYOK-eligible provider can run with your own API key — no markup, no metering on your side.
Speech synthesis (TTS)
12 providers, swappable per request via the `provider` field. Cartesia is the production default for sub-second latency.
Production default. Best latency in class.
32 languages. Highest perceived quality on long-form.
Pairs naturally with Deepgram STT for one-vendor pipelines.
Open-source friendly, good multilingual coverage.
Enterprise voice cloning with consent workflows.
Optimized for interactive characters.
Alibaba — strong on Chinese + English.
Tiny, fast, open-source.
Self-hosted, 50ms streaming when on GPU.
Open-source, excellent zero-shot cloning.
Resemble Chatterbox — GPU accelerated.
Internal CPU fallback for the free tier.
Speech recognition (STT)
10 providers covering streaming, batch, diarization, and word-level timestamps.
OpenAI — gold-standard accuracy on accented speech.
Lowest streaming latency on long calls.
Pairs with Cartesia TTS for the lowest end-to-end voice loop.
New from ElevenLabs — strong on noisy audio.
Whisper running on LPU — 100x faster than the OpenAI API.
Same vendor as their TTS — consistent voice loop.
CPU/GPU self-host. 4x faster than reference Whisper.
NVIDIA Parakeet — optimised for real-time transcription.
Alibaba — emotion + event detection alongside transcription.
Useful AI — optimised for on-device.
LLMs
Brain providers for agent reasoning. Use a single model or rotate across providers per request for cost / availability.
Default for high-stakes reasoning. Tool calling is rock-solid.
Best long-context comprehension. 1M context window on Opus.
LPU inference — 10x faster TTFT for streaming agents.
Strong on Chinese + English; cost-competitive.
Open-weights router for use cases mainstream APIs refuse. Apply at sales@ph0ny.com.
Video
17 video providers across talking-avatar, lipsync, generation, and understanding. Pick one explicitly or let ph0ny route by quality / latency. See the <a href="/api/video">Video API reference</a>.
OpenAI Sora — highest realism for short generative clips.
Gen-3 / Gen-4. Strongest cinematic motion.
Photorealistic talking avatars from a single photo + audio.
Best in class lipsync for dubbing pipelines.
Marengo / Pegasus — semantic video search and Q&A.
State-of-the-art open avatar; expressive head + torso motion.
Fast portrait animation; great for kiosk/web embeds.
Half-body talking avatar with hand motion.
Tencent — strong on subtle micro-expressions.
Kling-class avatar; longer continuity.
Open-source lipsync — high fidelity, GPU required.
Robust lipsync over existing footage.
Realtime lipsync; pairs with avatar generators.
Lightricks open-source; fast iteration.
Long-form video understanding + retrieval.
Vision-language, strong on charts and screen capture.
Latest Qwen vision-language; multilingual.
Telephony & messaging
Production telephony. BYOK strongly recommended for production.
Same account; for handoff after the call.
Direct integration; no Twilio middleman.
How provider selection works
You decide per request:
import { ph0ny } from '@ph0ny/sdk'
await ph0ny.tts.synthesize({
text: 'Welcome to ph0ny.',
voiceId: 'sonic-english-male',
provider: 'cartesia', // explicit
})
await ph0ny.video.avatar({
image_url: 'https://example.com/headshot.png',
text: "Hi, I'm your AI host.",
voice_id: 'rachel',
provider: 'heygen',
})Or omit provider and let ph0ny route to the lowest-latency option that supports the requested voice/feature. See Pricing for the BYOK metering model.