故障排除#

没有 huggingface 仓库权限#

获取模型时，有时候会遇到权限问题。比如在获取 llama2 模型时可能会有以下提示：

Cannot access gated repo for url https://huggingface.co/api/models/meta-llama/Llama-2-7b-hf.
Repo model meta-llama/Llama-2-7b-hf is gated. You must be authenticated to access it.

这种情况一般是缺少 huggingface 仓库的权限，或者是没有配置 huggingface token。可以按照接下来的方式解决这个问题。

申请 huggingface 仓库权限#

想要获取访问权限，打开对应的 huggingface 仓库，同意其条款和注意事项。以 llama2 为例，可以打开这个链接去申请：https://huggingface.co/meta-llama/Llama-2-7b-hf.

设置访问 huggingface 凭证#

可以在 huggingface 页面找到凭证，https://huggingface.co/settings/tokens.

可以通过设置环境变量设置访问凭证，export HUGGING_FACE_HUB_TOKEN=your_token_here。

英伟达驱动和 PyTorch 版本不匹配#

如果你在使用英伟达显卡，你可能会遇到以下错误：

UserWarning: CUDA initialization: The NVIDIA driver on your system is too old
(found version 10010). Please update your GPU driver by downloading and installi
ng a new version from the URL: http://www.nvidia.com/Download/index.aspx Alterna
tively, go to: https://pytorch.org to install a PyTorch version that has been co
mpiled with your version of the CUDA driver. (Triggered internally at  ..\c10\cu
da\CUDAFunctions.cpp:112.)

这种情况一般是 CUDA 的版本和 Pytorch 版本不兼容导致的。

可以到 https://pytorch.org 官网安装和 CUDA 对应的预编译版本的 PyTorch。同时，请检查安装的 CUDA 版本不要小于 11.8，最好版本在 11.8 到 12.1之间。

比如你的 CUDA 版本是 11.8，可以使用以下命令安装对应的 PyTorch：

pip install torch==2.0.1+cu118

外部系统无法通过 `<IP>:9997` 访问 Xinference 服务#

在启动 Xinference 时记得要加上 -H 0.0.0.0 参数:

xinference-local -H 0.0.0.0

那么 Xinference 服务将监听所有网络接口（而不仅限于 127.0.0.1 或 localhost）。

如果使用的是 Docker 镜像，请在 Docker 运行命令中加上 -p <PORT>:9997 ，，你就可以通过本地机器的 <IP>:<PORT> 进行访问。

启动内置模型需要很长时间，模型有时下载失败#

Xinference 默认使用 HuggingFace作为模型源。如果你的机器在中国大陆，使用内置模型可能会有访问问题。

要解决这个问题，可以在启动 Xinference 时添加环境变量 XINFERENCE_MODEL_SRC=modelscope，将模型源更改为 ModelScope，在中国大陆速度下载更快。

如果你用 Docker 启动 Xinference，可以在 Docker 命令中包含 -e XINFERENCE_MODEL_SRC=modelscope 选项。

使用官方 Docker 映像时，RayWorkerVllm 因 OOM 而死亡，导致模型无法加载#

Docker 的 --shm-size 参数可以用来设置共享内存的大小。共享内存(/dev/shm)的默认大小是 64MB，对于 vLLM 后端来说可能不够。

你可以通过设置参数 --shm-size 来增加它的大小：

docker run --shm-size=128g ...

加载 LLM 模型时提示缺失 `model_engine` 参数#

自 v0.11.0 版本开始，加载 LLM 模型时需要传入额外参数 model_engine 。具体信息请参考这里。

解决 MKL 线程层冲突#

在启动 Xinference 服务器时，如果遇到错误：ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. . Please check the logs for more details.

日志中显示的根本原因是：

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

这通常是因为你的 NumPy 是通过 conda 安装的，而 conda 的 NumPy 是使用 Intel MKL 优化构建的，这导致它与环境中已加载的 GNU OpenMP 库（libgomp）产生冲突。

解决方案 1：重写线程层#

设置 MKL_THREADING_LAYER=GNU 可以强制 Intel 数学核心库（MKL）使用 GNU 的 OpenMP 实现：

MKL_THREADING_LAYER=GNU xinference-local

解决方案 2：使用 pip 重新安装 NumPy#

卸载 conda 安装的 numpy，然后使用 pip 重新安装。

pip uninstall -y numpy && pip install numpy
#Or just --force-reinstall
pip install --force-reinstall numpy

配置 PyPI 镜像以加快软件包安装速度#

如果你在中国大陆，使用 PyPI 镜像可以显著加快软件包的安装速度。以下是一些常用的镜像源：

清华大学镜像：https://pypi.tuna.tsinghua.edu.cn/simple
阿里云镜像：https://mirrors.aliyun.com/pypi/simple/
腾讯云镜像：https://mirrors.cloud.tencent.com/pypi/simple

但请注意，某些镜像源上可能缺少部分软件包。例如，如果你仅使用阿里云镜像安装 xinference[audio]，安装可能会失败。

这是因为 MeloTTS 所依赖的 num2words 软件包在阿里云镜像上不可用。因此，在执行 pip install xinference[audio] 时，可能会回退安装旧版本，如 xinference==1.2.0 和 xoscar==0.8.0 （截至 2025 年 10 月 27 日）。

这些旧版本不兼容，会导致以下错误：MainActorPool.append_sub_pool() got an unexpected keyword argument 'start_method'

curl -s https://mirrors.aliyun.com/pypi/simple/num2words/ | grep -i "num2words"
# Returns NOTHING! But it works on Tsinghua or Tencent mirrors.
# uv pip install "xinference[audio]" will then install the following packages (as of Oct 27, 2025):
+ x-transformers==2.10.2
+ xinference==1.2.0
+ xoscar==0.8.0

为避免在安装 xinference 音频包时出现此问题，建议同时使用多个镜像源：

uv pip install xinference[audio] --index-url https://mirrors.aliyun.com/pypi/simple --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple

# Optional: Set this globally in your uv config
mkdir -p ~/.config/uv
cat >> ~/.config/uv/uv.toml << EOF
index-url = "https://mirrors.aliyun.com/pypi/simple"
extra-index-url = ["https://pypi.tuna.tsinghua.edu.cn/simple"]
EOF

使用 uv 安装 Xinference 1.12.0 失败（截至 2025 年 11 月）#

注意： 这是一个临时性问题，原因在于当前的软件包生态系统以及 uv 的依赖解析策略——它会优先选择 直接依赖的高版本，而不是 间接依赖的版本。

症状#

在 2025 年 11 月使用 uv pip install xinference 安装 xinference 1.12.0 时，你可能会遇到安装到非常旧版本依赖包的问题，尤其是：

transformers==4.12.2 （来自 2021 年的版本）
tokenizers==0.10.3 （来自 2021 年的版本）
huggingface-hub==1.0.1

随后 uv 报错："Failed to build tokenizers==0.10.3"（构建 tokenizers==0.10.3 失败）

根本原因#

出现该问题的原因是 uv 会优先选择 直接依赖的高版本，而忽略 间接依赖 中的版本要求：

xinference 1.12.0 将 huggingface-hub>=0.19.4 指定为 直接依赖 （没有上限约束）
截至 2025 年 11 月 6 日，uv 会选择最新版本：huggingface-hub==1.0.1
然而，transformers<=4.57.3 （通过 peft 引入的 间接依赖 ）要求 huggingface-hub<1.0
为了解决依赖冲突，uv 保留了直接依赖 huggingface-hub==1.0.1，并将间接依赖 transformers 降级到了非常旧的版本 4.12.2。

这属于 uv 的设计特性：它会优先满足你显式指定的依赖（直接依赖），而非传递依赖。参考链接：https://github.com/astral-sh/uv/issues/16601

Update: The latest transformers 4.57.3 (as in 2026.01.05) still requires huggingface-hub<1.0.

解决方案#

解决方案 1：预先限定 huggingface-hub 版本（推荐）

显式地将 huggingface-hub 限定在一个兼容的版本范围内：

uv pip install "huggingface-hub>=0.34.0,<1.0" xinference

这样可以强制 uv 选择与现代版本 transformers 兼容的 huggingface-hub 版本。

解决方案 2：将 transformers 设为直接依赖

通过显式指定 transformers，它会成为直接依赖，uv 将优先选择更高版本：

uv pip install transformers xinference

解决方案 3：使用 pip

或者直接使用 pip install xinference，它会自动解析到以下版本组合：

transformers==4.57.1
huggingface-hub==0.36.0
tokenizers==0.22.1

vLLM + Torch + Xinference 兼容性问题（段错误）#

症状#

如果你安装的是 vLLM < 0.12.0，并且升级了 xinference （尤其是使用 uv pip install -U xinference 时），xinference 可能会在启动时因为段错误而失败：

root@server:/home# xinference-local --host 0.0.0.0 --port 9997
INFO 12-30 17:35:37 [__init__.py:216] Automatically detected platform cuda.
Aborted (core dumped)

根本原因#

该问题由三个因素共同导致：

二进制不兼容：vLLM 在 0.12.0 之前的版本是基于 PyTorch 2.8.0 编译的，这些版本与 PyTorch 2.9 不兼容。参考：vLLM v0.12.0 发布说明
Xinference 对 Torch 依赖未设置上限：Xinference 的 setup.cfg 中没有为 PyTorch 指定版本上限：
```
[options]
install_requires =
    torch                    # No version constraint!
```
This allows package managers to upgrade PyTorch to incompatible versions.
不同包管理器的行为差异：
- pip：较为保守 —— 仅在依赖不兼容时，才会升级相关依赖，否则只升级指定的包
- 使用 -U 参数的 uv：策略较为激进 —— 会重新解析**所有**依赖，并选择最新版本

因此，在你尚未准备好升级整个技术栈、而只是想升级 xinference 时，可以选择使用：

pip install -U xinference （保持 PyTorch 版本不变，仅升级 xinference）
uv pip install "xinference==1.16.0" （不使用 -U 参数，同样只会升级 xinference）

故障排除#

没有 huggingface 仓库权限#

申请 huggingface 仓库权限#

设置访问 huggingface 凭证#

英伟达驱动和 PyTorch 版本不匹配#

外部系统无法通过 <IP>:9997 访问 Xinference 服务#

启动内置模型需要很长时间，模型有时下载失败#

使用官方 Docker 映像时，RayWorkerVllm 因 OOM 而死亡，导致模型无法加载#

加载 LLM 模型时提示缺失 model_engine 参数#

解决 MKL 线程层冲突#

解决方案 1：重写线程层#

解决方案 2：使用 pip 重新安装 NumPy#

相关说明：vLLM 与 PyTorch#

配置 PyPI 镜像以加快软件包安装速度#

使用 uv 安装 Xinference 1.12.0 失败（截至 2025 年 11 月）#

症状#

根本原因#

解决方案#

vLLM + Torch + Xinference 兼容性问题（段错误）#

症状#

根本原因#

外部系统无法通过 `<IP>:9997` 访问 Xinference 服务#

加载 LLM 模型时提示缺失 `model_engine` 参数#