Video (Experimental)#

Learn how to generate videos with Xinference.

Introduction#

The Video API provides the ability to interact with videos:

The text-to-video endpoint create videos from scratch based on a text prompt.
The image-to-video endpoint create videos from scratch based on an input image.
The firstlastframe-to-video endpoint creates videos based on the transition between a first and a last frame.

API	Endpoint
Text-to-Video API	/v1/video/generations
Image-to-Video API	/v1/video/generations/image
FirstLastFrame-to-Video API	/v1/video/generations/flf

Supported models#

The text-to-video API is supported with the following models in Xinference:

The image-to-video API is supported with the following models in Xinference:

The firstlastframe-to-video API is supported with the following models in Xinference:

Wan2.1-flf2v-14B-720p

Quickstart#

Text-to-video#

You can try text-to-video API out either via cURL, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/video/generations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "<your prompt>"
  }'

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
input_text = "an apple"
model.text_to_video(input_text)

Image-to-video#

You can try image-to-video API out either via cURL, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/video/generations/image' \
  -F model=<MODEL_UID> \
  -F image=@xxx.jpg \
  -F prompt=<prompt>

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("xxx.jpg", "rb") as f:
    prompt = ""
    model.image_to_video(image=f.read(), prompt=prompt)

FirstLastFrame-to-video#

You can try firstlastframe-to-video API out either via cURL, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/video/generations/flf' \
  -F model=<MODEL_UID> \
  -F first_frame=@xxx.jpg \
  -F last_frame=@xxx2.jpg \
  -F prompt=<prompt>

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("xxx.jpg", "rb") as f1, open("xxx2.jpg", "rb") as f2:
    prompt = ""
    model.flf_to_video(first_frame=f1.read(), last_frame=f2.read(), prompt=prompt)

Memory optimization#

Video generation will occupy huge GPU memory, for instance, running CogVideoX may require up to around 35 GB GPU memory.

Xinference supports several options to optimize video model memory (VRAM) usage.

CPU offloading or block level group offloading.
Layerwise casting.

Note

CPU offloading and Block Level Group Offloading cannot be enabled at the same time, but layerwise casting can be used in combination with either of them.

CPU offloading#

CPU offloading keeps the model weights on the CPU and only loads them to the GPU when a forward pass needs to be executed. It is suitable for scenarios with extremely limited GPU memory, but it has a significant impact on performance.

When running on GPU whose memory is less than 24 GB, we recommend to add --cpu_offload True when launching model. For Web UI, add an extra option, cpu_offload with value set to True.

xinference launch --model-name Wan2.1-i2v-14B-480p --model-type video --cpu_offload True

Block Level Group Offloading#

Block Level Group Offloading groups multiple internal layers of the model (such as torch.nn.ModuleList or torch.nn.Sequential) and loads these groups from the CPU to the GPU as needed during inference. Compared to CPU offloading, it uses more memory but has less impact on performance.

For the command line, add the --group_offload True option; for the Web UI, add an additional option group_offload with the value set to True.

We can speed up group offloading inference, by enabling the use of CUDA streams. However, using CUDA streams requires moving the model parameters into pinned memory. This allocation is handled by Pytorch under the hood, and can result in a significant spike in CPU RAM usage. Please consider this option if your CPU RAM is atleast 2X the size of the model you are group offloading. Enable CUDA streams via adding --use_stream True for command line; for the Web UI, add an additional option use_stream with the value set to True.

xinference launch --model-name Wan2.1-i2v-14B-480p --model-type video --group_offload True --use_stream True

Applying Layerwise Casting to the Transformer#

Layerwise casting will downcast each layer’s weights to torch.float8_e4m3fn, temporarily upcast to torch.bfloat16 during the forward pass of the layer, then revert to torch.float8_e4m3fn afterward. This approach reduces memory requirements by approximately 50% while introducing a minor quality reduction in the generated video due to the precision trade-off. Enable layerwise casting via adding --layerwise_cast True for command line; for the Web UI, add an additional option layerwise_cast with the value set to True.

This example will require 20GB of VRAM.

xinference launch --model-name Wan2.1-i2v-14B-480p --model-type video --layerwise_cast True --cpu_offload True