LongCat Video Avatar
LongCat-Video-Avatar: Audio-Driven AI Avatar for Long Video Generation
About
LongCat-Video-Avatar is a state-of-the-art audio-driven avatar model designed specifically for long-duration video generation. Built on the powerful LongCat-Video architecture, it delivers super-realistic lip synchronization, natural human dynamics, and long-term identity consistency, even across infinite-length video sequences.
Key Features
Unified Multi-Mode Generation
Supports Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and audio-conditioned video continuation within one pipeline for flexible inputs and workflows.
Long-Sequence Temporal Stability
Cross-chunk latent stitching prevents degradation and visual noise accumulation, enabling seamless, artifact-free video across very long or theoretically infinite-length sequences.
Natural Human Dynamics & Expressiveness
Disentangled motion guidance decouples speech from motion, producing natural gestures, idle movements and expressive behavior even during silent segments.
Identity Preservation Without Copy-Paste Artifacts
Reference Skip Attention maintains consistent character identity over long durations while avoiding rigid, pasted-reference artifacts common in other models.
Efficient High-Resolution Inference
Coarse-to-fine generation and block-sparse attention enable practical 720p/30fps inference performance suitable for production pipelines and rapid iteration.
How to Use LongCat Video Avatar
Upload your audio file (speech, narration, or music). Add an optional reference image or text description for character appearance. Configure settings such as resolution, video length, and multi-person options. Generate the video — LongCat Avatar produces a dynamic, expressive avatar video with smooth motion and synchronized audio.