Distributed Inference#

Some language models including DeepSeek V3, DeepSeek R1, etc are too large to fit into GPus on a single machine, Xinference supported running these models across multiple machines.

Note

This feature is added in v1.3.0.

Supported Engines#

Now, Xinference supported below engines to run models across workers.

  • SGLang (supported in v1.3.0)

  • vLLM (supported in v1.4.1)

Upcoming supports. Below engine will support distributed inference soon:

Usage#

First you need at least 2 workers to support distributed inference. Refer to running Xinference in cluster to create a Xinference cluster including supervisor and workers.

Then if are using web UI, choose expected machines for worker count in the optional configurations, if you are using command line, add --n-worker <machine number> when launching a model. The model will be launched across multiple workers accordingly.

actor

GPU count on web UI, or --n-gpu for command line now mean GPUs count per worker if you are using distributed inference.