安装#

Xinference 在 Linux, Windows, MacOS 上都可以通过 pip 来安装。如果需要使用 Xinference 进行模型推理，可以根据不同的模型指定不同的引擎。

如果你希望能够推理所有支持的模型，可以用以下命令安装所有需要的依赖：

pip install "xinference[all]"

备注

如果你想使用 GGUF 格式的模型，建议根据当前使用的硬件手动安装所需要的依赖，以充分利用硬件的加速能力。更多细节可以参考 Llama.cpp 引擎这一章节。

如果你只想安装必要的依赖，接下来是如何操作的详细步骤。

Transformers 引擎#

PyTorch(transformers) 引擎支持几乎有所的最新模型，这是 Pytorch 模型默认使用的引擎：

pip install "xinference[transformers]"

vLLM 引擎#

vLLM 是一个支持高并发的高性能大模型推理引擎。当满足以下条件时，Xinference 会自动选择 vllm 作为引擎来达到更高的吞吐量：

模型格式为 pytorch ， gptq 或者 awq 。
当模型格式为 pytorch 时，量化选项需为 none 。
当模型格式为 awq 时，量化选项需为 Int4 。
当模型格式为 gptq 时，量化选项需为 Int3 、 Int4 或者 Int8 。
操作系统为 Linux 并且至少有一个支持 CUDA 的设备
自定义模型的 model_family 字段和内置模型的 model_name 字段在 vLLM 的支持列表中。

目前，支持的模型包括：

llama-2, llama-3, llama-3.1, llama-3.2-vision, llama-2-chat, llama-3-instruct, llama-3.1-instruct, llama-3.3-instruct
mistral-v0.1, mistral-instruct-v0.1, mistral-instruct-v0.2, mistral-instruct-v0.3, mistral-nemo-instruct, mistral-large-instruct
codestral-v0.1
Yi, Yi-1.5, Yi-chat, Yi-1.5-chat, Yi-1.5-chat-16k
code-llama, code-llama-python, code-llama-instruct
deepseek, deepseek-coder, deepseek-chat, deepseek-coder-instruct, deepseek-r1-distill-qwen, deepseek-v2-chat, deepseek-v2-chat-0628, deepseek-v2.5, deepseek-v3, deepseek-r1, deepseek-r1-distill-llama
yi-coder, yi-coder-chat
codeqwen1.5, codeqwen1.5-chat
qwen2.5, qwen2.5-coder, qwen2.5-instruct, qwen2.5-coder-instruct
baichuan-2-chat
internlm2-chat
internlm2.5-chat, internlm2.5-chat-1m
qwen-chat
mixtral-instruct-v0.1, mixtral-8x22B-instruct-v0.1
chatglm3, chatglm3-32k, chatglm3-128k
glm4-chat, glm4-chat-1m
codegeex4
qwen1.5-chat, qwen1.5-moe-chat
qwen2-instruct, qwen2-moe-instruct
QwQ-32B-Preview, QwQ-32B
marco-o1
gemma-it, gemma-2-it
gemma-3-it, gemma-3-27b-it, gemma-3-12b-it, gemma-3-4b-it, gemma-3-1b-it
orion-chat, orion-chat-rag
c4ai-command-r-v01
minicpm3-4b
internlm3-instruct
moonlight-16b-a3b-instruct

安装 xinference 和 vLLM：

pip install "xinference[vllm]"

# FlashInfer is optional but required for specific functionalities such as sliding window attention with Gemma 2.
# For CUDA 12.4 & torch 2.4 to support sliding window attention for gemma 2 and llama 3.1 style rope
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

Llama.cpp 引擎#

Xinference 通过 xllamacpp 或 llama-cpp-python 支持 gguf 格式的模型。xllamacpp 由 Xinference 团队开发，并将在未来成为 llama.cpp 的唯一后端。

备注

llama-cpp-python 是 llama.cpp 后端的默认选项。要启用 xllamacpp，请添加环境变量 USE_XLLAMACPP=1。

例如，通过以下方式启动本地 Xinference

USE_XLLAMACPP=1 xinference-local

警告

在即将发布的 Xinference v1.5.0 中，xllamacpp 将成为 llama.cpp 的默认选项，而 llama-cpp-python 将被弃用。在 Xinference v1.6.0 中，llama-cpp-python 将被移除。

初始步骤：

pip install xinference

xllamacpp 的安装说明：

CPU 或 Mac Metal：
```
pip install -U xllamacpp
```

Cuda:

pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124

llama-cpp-python 不同硬件的安装方式：

Apple M系列

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

英伟达显卡：

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

AMD 显卡：

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

SGLang 引擎#

SGLang 具有基于 RadixAttention 的高性能推理运行时。它通过在多个调用之间自动重用KV缓存，显著加速了复杂 LLM 程序的执行。它还支持其他常见推理技术，如连续批处理和张量并行处理。

初始步骤：

pip install "xinference[sglang]"

# For CUDA 12.4 & torch 2.4 to support sliding window attention for gemma 2 and llama 3.1 style rope
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

MLX 引擎#

MLX-lm 用来在苹果 silicon 芯片上提供高效的 LLM 推理。

初始步骤：

pip install "xinference[mlx]"

其他平台#

Ascend NPU

安装#

Transformers 引擎#

vLLM 引擎#

Llama.cpp 引擎#

SGLang 引擎#

MLX 引擎#

其他平台#

本页