Distributed Inference#

Some language models including DeepSeek V3, DeepSeek R1, etc are too large to fit into GPus on a single machine, Xinference supported running these models across multiple machines.

Added in version v1.3.0.

Supported Engines#

Now, Xinference supported below engines to run models across workers.

SGLang (supported in v1.3.0)
vLLM (supported in v1.4.1)
MLX (supported in v1.7.1), MLX distributed currently does not support all models. The following model types are supported at this time. If you have additional requirements, feel free to submit a GitHub issue at xorbitsai/inference#issues to request support.
- DeepSeek v3 and R1
- Qwen2.5-instruct and the models have the same model architectures.
- Qwen3 and the models have the same model architectures.
- Qwen3-moe and the models have the same model architectures.

Usage#

First you need at least 2 workers to support distributed inference. Refer to running Xinference in cluster to create a Xinference cluster including supervisor and workers.

vLLM (v0.11.0+) note: Starting from vLLM v0.11.0, distributed deployment with vLLM requires Xinference >= v1.17.1. In addition to setting --n-worker as before, you must also set tensor_parallel_size (set it to the GPU count) and pipeline_parallel_size=1 when launching the model.

Then if are using web UI, choose expected machines for worker count in the optional configurations, if you are using command line, add --n-worker <machine number> when launching a model. The model will be launched across multiple workers accordingly.

GPU count on web UI, or --n-gpu for command line now mean GPUs count per worker if you are using distributed inference.