Skip to content

Video API

Five operations cover the entire video surface. Every endpoint is async (returns a VideoJob), every endpoint accepts a provider field to pin an engine, and every endpoint emits a webhook on completion when webhook_url is set.

text
POST /v1/video/generate     text → video
POST /v1/video/avatar       image + audio (or text + voice) → talking-head video
POST /v1/video/lipsync      video + audio → lipsync'd video
POST /v1/video/dub          video → dubbed in another language, lipsync preserved
POST /v1/video/analyze      video → embeddings, transcription, semantic Q&A
GET  /v1/video/:jobId       poll job status

See Providers → Video for the full engine matrix.


Generate

Text-to-video. Render up to 30 seconds of generated video from a prompt.

POST /v1/video/generate

Request

json
{
  "prompt": "A barista pulling a perfect espresso shot, warm morning light, 35mm film",
  "negative_prompt": "blurry, distorted faces",
  "duration_seconds": 6,
  "width": 1280,
  "height": 720,
  "fps": 24,
  "provider": "runway",
  "webhook_url": "https://your-app.com/hooks/video"
}
FieldTypeDefaultDescription
promptstringrequired1–2,000 chars.
negative_promptstringWhat to avoid.
duration_secondsnumber51–30.
widthint768256–1920.
heightint512256–1080.
fpsint248–60.
providerenumautosora, runway, ltx-video, …
webhook_urlurlNotified on completion.

Response 202

json
{
  "id": "vj_01HM…",
  "type": "generate",
  "status": "pending",
  "provider": "runway",
  "created_at": "2026-05-05T18:00:00Z",
  "owner_id": "user_…"
}

Poll GET /v1/video/:jobId or wait on the webhook.


Avatar

Talking-head video from a single still image plus audio. Pre-render the audio yourself, or chain through TTS in one call.

POST /v1/video/avatar

Request — pre-rendered audio

json
{
  "image_url": "https://uploads.example.com/headshot.jpg",
  "audio_url": "https://uploads.example.com/script.mp3",
  "provider": "heygen"
}

Request — TTS chaining

json
{
  "image_url": "https://uploads.example.com/headshot.jpg",
  "text": "Welcome to our launch. We're shipping in three weeks.",
  "voice_id": "rachel",
  "tts_provider": "elevenlabs",
  "provider": "hallo3"
}
FieldTypeNotes
image_urlurlRequired. JPG/PNG. Face must be visible.
audio_urlurlProvide this or text + voice_id.
textstring1–10,000 chars. Synthesizes audio first.
voice_idstringLibrary or cloned voice. See /voices.
tts_providerstringTTS engine. Default default.
target_languagestringIf set, translates text before synthesis.
providerenumAvatar engine: hallo3, heygen, liveportrait, echomimicv2, v-express, skyreels-a1.

Self-hosted engines (hallo3, liveportrait, echomimicv2, v-express, skyreels-a1) run on ph0ny GPUs and are billed per second of generated video. heygen requires BYOK.


Lipsync

Re-align lip motion in an existing video to a new audio track. Useful for dubbing, ADR, fixing flubbed takes.

POST /v1/video/lipsync
json
{
  "video_url": "https://uploads.example.com/clip.mp4",
  "audio_url": "https://uploads.example.com/replacement.mp3",
  "provider": "sync-labs"
}
ProviderNotes
sync-labsBest commercial fidelity. BYOK.
heygen-lipsyncIf you already use HeyGen.
latentsyncSelf-hosted, open-source.
video-retalkingSelf-hosted, robust on noisy footage.
musetalkRealtime; lower-latency at lower fidelity.

Dub

End-to-end localization: translate audio, synthesize in the target language, lipsync the original video to the new audio. One call.

POST /v1/video/dub
json
{
  "video_url": "https://uploads.example.com/launch.mp4",
  "target_language": "es-ES",
  "voice_id": "rachel",
  "tts_provider": "elevenlabs",
  "lipsync_provider": "sync-labs"
}

ph0ny will:

  1. Transcribe the source audio (Whisper).
  2. Translate to target_language preserving timing.
  3. Synthesize the new audio using voice_id on tts_provider.
  4. Re-lipsync the video using lipsync_provider.

If voice_id is omitted we use a multilingual voice on the same provider. If source_language is omitted we auto-detect.


Analyze

Multimodal video understanding — semantic search, transcription, scene detection, Q&A.

POST /v1/video/analyze
json
{
  "video_url": "https://uploads.example.com/meeting.mp4",
  "provider": "twelve-labs",
  "tasks": ["transcribe", "scenes", "embed"]
}

Use cases:

  • Search across hours of footage — embed once, query in natural language. Pairs with Collections.
  • Auto-chapter long videosscenes returns timestamps + descriptions.
  • Q&A over recorded calls — pass the analyze output as agent context.
ProviderStrength
twelve-labsBest commercial — Marengo + Pegasus. BYOK.
internvideo2.5Self-hosted; long-form.
qwen2.5-vl / qwen3-vlSelf-hosted; charts and screen capture.
videochat-flashRealtime conversational video Q&A.

Job lifecycle

Every endpoint returns:

json
{
  "id": "vj_…",
  "type": "generate" | "avatar" | "lipsync" | "dub" | "analyze",
  "status": "pending" | "processing" | "completed" | "failed",
  "provider": "runway",
  "progress": 0.42,
  "result": {
    "video_url": "https://r2.ph0ny.com/jobs/vj_…/output.mp4",
    "duration_seconds": 6.0,
    "width": 1280,
    "height": 720,
    "provider": "runway"
  },
  "error": { "code": "provider_timeout", "message": "…" },
  "created_at": "2026-05-05T18:00:00Z",
  "completed_at": "2026-05-05T18:00:42Z",
  "owner_id": "user_…"
}

Poll: GET /v1/video/:jobId (1–5s interval recommended). Webhook: signed with HMAC-SHA256 over your webhook_url secret. See Sessions → Webhooks for verifier code.


SDK

ts
import { ph0ny } from '@ph0ny/sdk'

const job = await ph0ny.video.avatar({
  image_url: 'https://uploads.example.com/headshot.jpg',
  text: 'Welcome to ph0ny.',
  voice_id: 'rachel',
  provider: 'hallo3',
})

const result = await ph0ny.video.waitForJob(job.id)
console.log(result.video_url)

Built by ph0ny.