Video API
Five operations cover the entire video surface. Every endpoint is async (returns a VideoJob), every endpoint accepts a provider field to pin an engine, and every endpoint emits a webhook on completion when webhook_url is set.
POST /v1/video/generate text → video
POST /v1/video/avatar image + audio (or text + voice) → talking-head video
POST /v1/video/lipsync video + audio → lipsync'd video
POST /v1/video/dub video → dubbed in another language, lipsync preserved
POST /v1/video/analyze video → embeddings, transcription, semantic Q&A
GET /v1/video/:jobId poll job statusSee Providers → Video for the full engine matrix.
Generate
Text-to-video. Render up to 30 seconds of generated video from a prompt.
POST /v1/video/generateRequest
{
"prompt": "A barista pulling a perfect espresso shot, warm morning light, 35mm film",
"negative_prompt": "blurry, distorted faces",
"duration_seconds": 6,
"width": 1280,
"height": 720,
"fps": 24,
"provider": "runway",
"webhook_url": "https://your-app.com/hooks/video"
}| Field | Type | Default | Description |
|---|---|---|---|
prompt | string | required | 1–2,000 chars. |
negative_prompt | string | — | What to avoid. |
duration_seconds | number | 5 | 1–30. |
width | int | 768 | 256–1920. |
height | int | 512 | 256–1080. |
fps | int | 24 | 8–60. |
provider | enum | auto | sora, runway, ltx-video, … |
webhook_url | url | — | Notified on completion. |
Response 202
{
"id": "vj_01HM…",
"type": "generate",
"status": "pending",
"provider": "runway",
"created_at": "2026-05-05T18:00:00Z",
"owner_id": "user_…"
}Poll GET /v1/video/:jobId or wait on the webhook.
Avatar
Talking-head video from a single still image plus audio. Pre-render the audio yourself, or chain through TTS in one call.
POST /v1/video/avatarRequest — pre-rendered audio
{
"image_url": "https://uploads.example.com/headshot.jpg",
"audio_url": "https://uploads.example.com/script.mp3",
"provider": "heygen"
}Request — TTS chaining
{
"image_url": "https://uploads.example.com/headshot.jpg",
"text": "Welcome to our launch. We're shipping in three weeks.",
"voice_id": "rachel",
"tts_provider": "elevenlabs",
"provider": "hallo3"
}| Field | Type | Notes |
|---|---|---|
image_url | url | Required. JPG/PNG. Face must be visible. |
audio_url | url | Provide this or text + voice_id. |
text | string | 1–10,000 chars. Synthesizes audio first. |
voice_id | string | Library or cloned voice. See /voices. |
tts_provider | string | TTS engine. Default default. |
target_language | string | If set, translates text before synthesis. |
provider | enum | Avatar engine: hallo3, heygen, liveportrait, echomimicv2, v-express, skyreels-a1. |
Self-hosted engines (hallo3, liveportrait, echomimicv2, v-express, skyreels-a1) run on ph0ny GPUs and are billed per second of generated video. heygen requires BYOK.
Lipsync
Re-align lip motion in an existing video to a new audio track. Useful for dubbing, ADR, fixing flubbed takes.
POST /v1/video/lipsync{
"video_url": "https://uploads.example.com/clip.mp4",
"audio_url": "https://uploads.example.com/replacement.mp3",
"provider": "sync-labs"
}| Provider | Notes |
|---|---|
sync-labs | Best commercial fidelity. BYOK. |
heygen-lipsync | If you already use HeyGen. |
latentsync | Self-hosted, open-source. |
video-retalking | Self-hosted, robust on noisy footage. |
musetalk | Realtime; lower-latency at lower fidelity. |
Dub
End-to-end localization: translate audio, synthesize in the target language, lipsync the original video to the new audio. One call.
POST /v1/video/dub{
"video_url": "https://uploads.example.com/launch.mp4",
"target_language": "es-ES",
"voice_id": "rachel",
"tts_provider": "elevenlabs",
"lipsync_provider": "sync-labs"
}ph0ny will:
- Transcribe the source audio (Whisper).
- Translate to
target_languagepreserving timing. - Synthesize the new audio using
voice_idontts_provider. - Re-lipsync the video using
lipsync_provider.
If voice_id is omitted we use a multilingual voice on the same provider. If source_language is omitted we auto-detect.
Analyze
Multimodal video understanding — semantic search, transcription, scene detection, Q&A.
POST /v1/video/analyze{
"video_url": "https://uploads.example.com/meeting.mp4",
"provider": "twelve-labs",
"tasks": ["transcribe", "scenes", "embed"]
}Use cases:
- Search across hours of footage — embed once, query in natural language. Pairs with Collections.
- Auto-chapter long videos —
scenesreturns timestamps + descriptions. - Q&A over recorded calls — pass the analyze output as agent context.
| Provider | Strength |
|---|---|
twelve-labs | Best commercial — Marengo + Pegasus. BYOK. |
internvideo2.5 | Self-hosted; long-form. |
qwen2.5-vl / qwen3-vl | Self-hosted; charts and screen capture. |
videochat-flash | Realtime conversational video Q&A. |
Job lifecycle
Every endpoint returns:
{
"id": "vj_…",
"type": "generate" | "avatar" | "lipsync" | "dub" | "analyze",
"status": "pending" | "processing" | "completed" | "failed",
"provider": "runway",
"progress": 0.42,
"result": {
"video_url": "https://r2.ph0ny.com/jobs/vj_…/output.mp4",
"duration_seconds": 6.0,
"width": 1280,
"height": 720,
"provider": "runway"
},
"error": { "code": "provider_timeout", "message": "…" },
"created_at": "2026-05-05T18:00:00Z",
"completed_at": "2026-05-05T18:00:42Z",
"owner_id": "user_…"
}Poll: GET /v1/video/:jobId (1–5s interval recommended). Webhook: signed with HMAC-SHA256 over your webhook_url secret. See Sessions → Webhooks for verifier code.
SDK
import { ph0ny } from '@ph0ny/sdk'
const job = await ph0ny.video.avatar({
image_url: 'https://uploads.example.com/headshot.jpg',
text: 'Welcome to ph0ny.',
voice_id: 'rachel',
provider: 'hallo3',
})
const result = await ph0ny.video.waitForJob(job.id)
console.log(result.video_url)