Providers

ph0ny is a thin orchestration layer over the best models in each modality. Pick a provider per request, or let us route to the lowest-latency / lowest-cost option that meets your quality bar. Every BYOK-eligible provider can run with your own API key — no markup, no metering on your side.

Speech synthesis (TTS)

Swappable per request via the `provider` field. Cartesia is the production default for sub-second latency.

Cartesia

cartesia

BYOK

streaming~40ms TTFBvoice clone

Production default. Best latency in class.

ElevenLabs

elevenlabs

BYOK

streamingvoice clonemultilingualconversational

32 languages. Highest perceived quality on long-form.

Deepgram Aura

deepgram

BYOK

streaminglow cost

Pairs naturally with Deepgram STT for one-vendor pipelines.

Fish Audio

fish-audio

BYOK

streamingvoice clone

Open-source friendly, good multilingual coverage.

Resemble AI

resemble-ai

BYOK

voice cloneenterprise

Enterprise voice cloning with consent workflows.

Inworld

inworld

BYOK

gamingcharacter voices

Optimized for interactive characters.

Qwen TTS

qwen-tts

BYOK

multilingual

Alibaba — strong on Chinese + English.

Kokoro

kokoro

lightweight

Tiny, fast, open-source.

F5 TTS

f5-tts

voice cloneopen-source

Self-hosted, 50ms streaming when on GPU.

CosyVoice

cosyvoice

multilingualvoice clone

Open-source, excellent zero-shot cloning.

Chatterbox

chatterbox

open-source

Resemble Chatterbox — GPU accelerated.

Pocket TTS

pocket-tts

free tierCPU

Internal CPU fallback for the free tier.

Speech recognition (STT)

Covering streaming, batch, diarization, and word-level timestamps.

Whisper

whisper

batch99 languageshigh accuracy

OpenAI — gold-standard accuracy on accented speech.

Deepgram Nova

deepgram

BYOK

streamingdiarizeword timestamps

Lowest streaming latency on long calls.

Cartesia STT

cartesia

BYOK

streaming

Pairs with Cartesia TTS for the lowest end-to-end voice loop.

ElevenLabs Scribe

elevenlabs

BYOK

batchhigh quality

New from ElevenLabs — strong on noisy audio.

Groq Whisper

groq

BYOK

batchfastestlow cost

Whisper running on LPU — 100x faster than the OpenAI API.

Fish Audio

fish-audio

BYOK

multilingual

Same vendor as their TTS — consistent voice loop.

Faster Whisper

faster-whisper

self-hostedCTranslate2

CPU/GPU self-host. 4x faster than reference Whisper.

Parakeet

parakeet

streaminglow latency

NVIDIA Parakeet — optimised for real-time transcription.

SenseVoice

sensevoice

multilingualemotion

Alibaba — emotion + event detection alongside transcription.

Moonshine

moonshine

edgelow latency

Useful AI — optimised for on-device.

LLMs

Brain providers for agent reasoning. Use a single model or rotate across providers per request for cost / availability.

OpenAI

openai

BYOK

gpt-5gpt-5-minigpt-4otoolsvision

Default for high-stakes reasoning. Tool calling is rock-solid.

Anthropic

anthropic

BYOK

claude-opus-4claude-sonnet-4toolsvision

Best long-context comprehension. 1M context window on Opus.

Groq

groq

BYOK

llama-4mixtralstreaminglowest latency

LPU inference — 10x faster TTFT for streaming agents.

Zhipu GLM

zhipu

glm-4multilingual

Strong on Chinese + English; cost-competitive.

Obliteratus

obliteratus

uncensoredroleplay

Open-weights router for use cases mainstream APIs refuse. Apply at sales@ph0ny.com.

Video

Talking-avatar, lipsync, generation, and understanding. Pick one explicitly or let ph0ny route by quality / latency. See the <a href="/api/video">Video API reference</a>.

Sora

sora

BYOK

text-to-videocommercial

OpenAI Sora — highest realism for short generative clips.

Runway

runway

BYOK

text-to-videoimage-to-videocommercial

Gen-3 / Gen-4. Strongest cinematic motion.

HeyGen

heygen

BYOK

avatarlipsynccommercial

Photorealistic talking avatars from a single photo + audio.

Sync Labs

sync-labs

BYOK

lipsyncmultilingual

Best in class lipsync for dubbing pipelines.

Twelve Labs

twelve-labs

BYOK

video understandingembeddingssearch

Marengo / Pegasus — semantic video search and Q&A.

Hallo3

hallo3

avatarself-hosted

State-of-the-art open avatar; expressive head + torso motion.

LivePortrait

liveportrait

avatarself-hosted

Fast portrait animation; great for kiosk/web embeds.

EchoMimic v2

echomimicv2

avatarself-hosted

Half-body talking avatar with hand motion.

V-

V-Express

v-express

avatarself-hosted

Tencent — strong on subtle micro-expressions.

SkyReels A1

skyreels-a1

avatarself-hosted

Kling-class avatar; longer continuity.

LatentSync

latentsync

lipsyncself-hosted

Open-source lipsync — high fidelity, GPU required.

Video Retalking

video-retalking

lipsyncself-hosted

Robust lipsync over existing footage.

MuseTalk

musetalk

lipsyncself-hosted

Realtime lipsync; pairs with avatar generators.

LTX Video

ltx-video

text-to-videoself-hosted

Lightricks open-source; fast iteration.

InternVideo 2.5

internvideo2.5

understandingself-hosted

Long-form video understanding + retrieval.

Qwen2.5-VL

qwen2.5-vl

understandingself-hosted

Vision-language, strong on charts and screen capture.

Qwen3-VL

qwen3-vl

understandingself-hosted

Latest Qwen vision-language; multilingual.

Telephony & messaging

Twilio Voice

twilio

BYOK

inboundoutboundPSTNnumbers in 100+ countries

Production telephony. BYOK strongly recommended for production.

Twilio SMS

twilio-sms

BYOK

SMSWhatsApp Business

Same account; for handoff after the call.

WhatsApp Cloud API

whatsapp

BYOK

voicemessaging

Direct integration; no Twilio middleman.

How provider selection works

You decide per request:

import { ph0ny } from '@ph0ny/sdk'

await ph0ny.tts.synthesize({
  text: 'Welcome to ph0ny.',
  voiceId: 'sonic-english-male',
  provider: 'cartesia',          // explicit
})

await ph0ny.video.avatar({
  image_url: 'https://example.com/headshot.png',
  text: "Hi, I'm your AI host.",
  voice_id: 'rachel',
  provider: 'heygen',
})

Or omit provider and let ph0ny route to the lowest-latency option that supports the requested voice/feature. See Pricing for the BYOK metering model.

Providers ​

Speech synthesis (TTS) ​

Speech recognition (STT) ​

LLMs ​

Video ​

Telephony & messaging ​

How provider selection works ​

Providers

Speech synthesis (TTS)

Speech recognition (STT)

LLMs

Video

Telephony & messaging

How provider selection works