Skip to content

Providers

ph0ny is a thin orchestration layer over the best models in each modality. Pick a provider per request, or let us route to the lowest-latency / lowest-cost option that meets your quality bar. Every BYOK-eligible provider can run with your own API key — no markup, no metering on your side.

Speech synthesis (TTS)

12 providers, swappable per request via the `provider` field. Cartesia is the production default for sub-second latency.

Cartesia
cartesia
BYOK
streaming~40ms TTFBvoice clone

Production default. Best latency in class.

ElevenLabs
elevenlabs
BYOK
streamingvoice clonemultilingualconversational

32 languages. Highest perceived quality on long-form.

Deepgram Aura
deepgram
BYOK
streaminglow cost

Pairs naturally with Deepgram STT for one-vendor pipelines.

Fish Audio
fish-audio
BYOK
streamingvoice clone

Open-source friendly, good multilingual coverage.

Resemble AI
resemble-ai
BYOK
voice cloneenterprise

Enterprise voice cloning with consent workflows.

Inworld
inworld
BYOK
gamingcharacter voices

Optimized for interactive characters.

Qwen TTS
qwen-tts
BYOK
multilingual

Alibaba — strong on Chinese + English.

Kokoro
kokoro
lightweight

Tiny, fast, open-source.

F5 TTS
f5-tts
voice cloneopen-source

Self-hosted, 50ms streaming when on GPU.

CosyVoice
cosyvoice
multilingualvoice clone

Open-source, excellent zero-shot cloning.

Chatterbox
chatterbox
open-source

Resemble Chatterbox — GPU accelerated.

Pocket TTS
pocket-tts
free tierCPU

Internal CPU fallback for the free tier.

Speech recognition (STT)

10 providers covering streaming, batch, diarization, and word-level timestamps.

Whisper
whisper
batch99 languageshigh accuracy

OpenAI — gold-standard accuracy on accented speech.

Deepgram Nova
deepgram
BYOK
streamingdiarizeword timestamps

Lowest streaming latency on long calls.

Cartesia STT
cartesia
BYOK
streaming

Pairs with Cartesia TTS for the lowest end-to-end voice loop.

ElevenLabs Scribe
elevenlabs
BYOK
batchhigh quality

New from ElevenLabs — strong on noisy audio.

Groq Whisper
groq
BYOK
batchfastestlow cost

Whisper running on LPU — 100x faster than the OpenAI API.

Fish Audio
fish-audio
BYOK
multilingual

Same vendor as their TTS — consistent voice loop.

Faster Whisper
faster-whisper
self-hostedCTranslate2

CPU/GPU self-host. 4x faster than reference Whisper.

Parakeet
parakeet
streaminglow latency

NVIDIA Parakeet — optimised for real-time transcription.

SenseVoice
sensevoice
multilingualemotion

Alibaba — emotion + event detection alongside transcription.

Moonshine
moonshine
edgelow latency

Useful AI — optimised for on-device.

LLMs

Brain providers for agent reasoning. Use a single model or rotate across providers per request for cost / availability.

OpenAI
openai
BYOK
gpt-5gpt-5-minigpt-4otoolsvision

Default for high-stakes reasoning. Tool calling is rock-solid.

Anthropic
anthropic
BYOK
claude-opus-4claude-sonnet-4toolsvision

Best long-context comprehension. 1M context window on Opus.

Groq
groq
BYOK
llama-4mixtralstreaminglowest latency

LPU inference — 10x faster TTFT for streaming agents.

Zhipu GLM
zhipu
glm-4multilingual

Strong on Chinese + English; cost-competitive.

Obliteratus
obliteratus
uncensoredroleplay

Open-weights router for use cases mainstream APIs refuse. Apply at sales@ph0ny.com.

Video

17 video providers across talking-avatar, lipsync, generation, and understanding. Pick one explicitly or let ph0ny route by quality / latency. See the <a href="/api/video">Video API reference</a>.

Sora
sora
BYOK
text-to-videocommercial

OpenAI Sora — highest realism for short generative clips.

Runway
runway
BYOK
text-to-videoimage-to-videocommercial

Gen-3 / Gen-4. Strongest cinematic motion.

HeyGen
heygen
BYOK
avatarlipsynccommercial

Photorealistic talking avatars from a single photo + audio.

Sync Labs
sync-labs
BYOK
lipsyncmultilingual

Best in class lipsync for dubbing pipelines.

Twelve Labs
twelve-labs
BYOK
video understandingembeddingssearch

Marengo / Pegasus — semantic video search and Q&A.

Hallo3
hallo3
avatarself-hosted

State-of-the-art open avatar; expressive head + torso motion.

LivePortrait
liveportrait
avatarself-hosted

Fast portrait animation; great for kiosk/web embeds.

EchoMimic v2
echomimicv2
avatarself-hosted

Half-body talking avatar with hand motion.

V-Express
v-express
avatarself-hosted

Tencent — strong on subtle micro-expressions.

SkyReels A1
skyreels-a1
avatarself-hosted

Kling-class avatar; longer continuity.

LatentSync
latentsync
lipsyncself-hosted

Open-source lipsync — high fidelity, GPU required.

Video Retalking
video-retalking
lipsyncself-hosted

Robust lipsync over existing footage.

MuseTalk
musetalk
lipsyncself-hosted

Realtime lipsync; pairs with avatar generators.

LTX Video
ltx-video
text-to-videoself-hosted

Lightricks open-source; fast iteration.

InternVideo 2.5
internvideo2.5
understandingself-hosted

Long-form video understanding + retrieval.

Qwen2.5-VL
qwen2.5-vl
understandingself-hosted

Vision-language, strong on charts and screen capture.

Qwen3-VL
qwen3-vl
understandingself-hosted

Latest Qwen vision-language; multilingual.

Telephony & messaging

Twilio Voice
twilio
BYOK
inboundoutboundPSTNnumbers in 100+ countries

Production telephony. BYOK strongly recommended for production.

Twilio SMS
twilio-sms
BYOK
SMSWhatsApp Business

Same account; for handoff after the call.

WhatsApp Cloud API
whatsapp
BYOK
voicemessaging

Direct integration; no Twilio middleman.

How provider selection works

You decide per request:

ts
import { ph0ny } from '@ph0ny/sdk'

await ph0ny.tts.synthesize({
  text: 'Welcome to ph0ny.',
  voiceId: 'sonic-english-male',
  provider: 'cartesia',          // explicit
})

await ph0ny.video.avatar({
  image_url: 'https://example.com/headshot.png',
  text: "Hi, I'm your AI host.",
  voice_id: 'rachel',
  provider: 'heygen',
})

Or omit provider and let ph0ny route to the lowest-latency option that supports the requested voice/feature. See Pricing for the BYOK metering model.

Built by ph0ny.