使用#

本地运行 Xinference#

让我们以一个经典的大语言模型 llama-2-chat 来展示如何在本地用 Xinference 运行大模型。

在这个快速入门之后，可以继续学习如何在一个分布式集群环境下部署 Xinference。

拉起本地服务#

首先，请根据这个文档的指导确保本地安装了 Xinference。使用以下命令拉起本地的 Xinference 服务：

xinference-local --host 0.0.0.0 --port 9997

INFO     Xinference supervisor 0.0.0.0:64570 started
INFO     Xinference worker 0.0.0.0:64570 started
INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)

备注

默认情况下，Xinference 会使用 <HOME>/.xinference 作为主目录来存储一些必要的信息，比如日志文件和模型文件，其中 <HOME> 就是当前用户的主目录。

你可以通过配置环境变量 XINFERENCE_HOME 修改主目录，比如：

XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

恭喜！你已经在本地拉起了 Xinference 服务。一旦 Xinference 服务运行起来，可以有多种方式来使用，包括使用网页、cURL 命令、命令行或者是 Xinference 的 Python SDK。

可以通过访问 http://127.0.0.1:9997/ui 来使用 UI，访问 http://127.0.0.1:9997/docs 来查看 API 文档。

可以通过以下命令安装后，利用 Xinference 命令行工具或者 Python 代码来使用：

pip install xinference

命令行工具是 xinference。可以通过以下命令查看有哪些可以使用的命令：

xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --version       Show the version and exit.
  --log-level TEXT
  -H, --host TEXT
  -p, --port INTEGER
  --help              Show this message and exit.

Commands:
  chat
  generate
  launch
  list
  register
  registrations
  terminate
  unregister

如果只需要安装 Xinference 的 Python SDK，可以使用以下命令安装最少依赖。需要注意的是版本必须和 Xinference 服务的版本保持匹配。

pip install xinference-client==${SERVER_VERSION}

关于模型的推理引擎#

自 v0.11.0 版本开始，在加载 LLM 模型之前，你需要指定具体的推理引擎。当前，Xinference 支持以下推理引擎：

vllm
sglang
llama.cpp
transformers

关于这些推理引擎的详细信息，请参考这里。

注意，当加载 LLM 模型时，所能运行的引擎与 model_format 和 quantization 参数息息相关。

Xinference 提供了 xinference engine 命令帮助你查询相关的参数组合。

例如：

我想查询与 qwen-chat 模型相关的参数组合，以决定它能够怎样跑在各种推理引擎上。

xinference engine -e <xinference_endpoint> --model-name qwen-chat

我想将 qwen-chat 跑在 VLLM 推理引擎上，但是我不知道什么样的其他参数符合这个要求。

xinference engine -e <xinference_endpoint> --model-name qwen-chat --model-engine vllm

我想加载 GGUF 格式的 qwen-chat 模型，我需要知道其余的参数组合。

xinference engine -e <xinference_endpoint> --model-name qwen-chat -f ggufv2

总之，相比于之前的版本，当加载 LLM 模型时，需要额外传入 model_engine 参数。你可以通过 xinference engine 命令查询你想运行的推理引擎与其他参数组合的关系。

运行 Llama-2#

让我们来运行一个内置的 llama-2-chat 模型。当你需要运行一个模型时，第一次运行是要从HuggingFace 下载模型参数，一般来说需要根据模型大小下载10到30分钟不等。当下载完成后，Xinference本地会有缓存的处理，以后再运行相同的模型不需要重新下载。

备注

Xinference 也允许从其他模型托管平台下载模型。可以通过在拉起 Xinference 时指定环境变量，比如，如果想要从 ModelScope 中下载模型，可以使用如下命令：

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

可以使用 --model-uid 或者 -u 参数指定模型的 UID，如果没有指定，Xinference 会随机生成一个 ID，下面的命令就是手动指定了 ID 为 my-llama-2:

xinference launch --model-engine <inference_engine> -u my-llama-2 -n llama-2-chat -s 13 -f pytorch

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<inference_engine>",
  "model_uid": "my-llama-2",
  "model_name": "llama-2-chat",
  "model_format": "pytorch",
  "size_in_billions": 13
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_uid="my-llama-2",
  model_name="llama-2-chat",
  model_format="pytorch",
  size_in_billions=13
)
print('Model uid: ' + model_uid)

Model uid: my-llama-2

备注

对于一些推理引擎，比如 vllm，用户需要在运行模型时指定引擎相关的参数，这种情况下直接在命令行中指定对应的参数名和值即可，比如：

xinference launch --model-engine vllm -u my-llama-2 -n llama-2-chat -s 13 -f pytorch --gpu_memory_utilization 0.9

在运行模型时，gpu_memory_utilization=0.9 会传到 vllm 后端。

到这一步，恭喜你已经成功通过 Xinference 将 llama-2-chat 运行起来了。一旦这个模型在运行中，我们可以通过命令行、cURL 或者是 Python 代码来预支交互：

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "my-llama-2",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ]
  }'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("my-llama-2")
print(model.chat(
    prompt="What is the largest animal?",
    system_prompt="You are a helpful assistant.",
    chat_history=[]
))

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "my-llama-2",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Xinference 提供了与 OpenAI 兼容的 API，所以可以将 Xinference 运行的模型当成 OpenAI的本地替代。比如：

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")

response = client.chat.completions.create(
    model="my-llama-2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

以下是支持的 OpenAI 的 API：

对话生成：https://platform.openai.com/docs/api-reference/chat
生成: https://platform.openai.com/docs/api-reference/completions
向量生成：https://platform.openai.com/docs/api-reference/embeddings

管理模型#

除了启动模型，Xinference 提供了管理模型整个生命周期的能力。同样的，你可以使用命令行、cURL 以及 Python 代码来管理：

可以列出所有 Xinference 支持的指定类型的模型：

xinference registrations -t LLM

curl http://127.0.0.1:9997/v1/model_registrations/LLM

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_model_registrations(model_type='LLM'))

接下来的命令可以列出所有在运行的模型：

xinference list

curl http://127.0.0.1:9997/v1/models

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_models())

当你不需要某个正在运行的模型，可以通过以下的方式来停止它并释放资源：

xinference terminate --model-uid "my-llama-2"

curl -X DELETE http://127.0.0.1:9997/v1/models/my-llama-2

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
client.terminate_model(model_uid="my-llama-2")

集群中部署 Xinference#

若要在集群环境中部署 Xinference，需要在一台机器中启动 supervisor 节点，并在当前或者其他节点启动 worker 节点

首先，根据文档确保所有的服务器上都安装了 Xinference。接下来按照步骤：

启动 Supervisor#

在服务器上执行以下命令来启动 Supervisor 节点：

xinference-supervisor -H "${supervisor_host}"

用当前节点的 IP 来替换 ${supervisor_host}。

可以在 http://${supervisor_host}:9997/ui 访问 web UI，在 http://${supervisor_host}:9997/docs 访问 API 文档。

启动 Worker#

在需要启动 Xinference worker 的机器上执行以下命令：

xinference-worker -e "http://${supervisor_host}:9997" -H "${worker_host}"

备注

需要注意的是，必须使用当前Worker节点的 IP 来替换 ${worker_host}。

备注

需要注意的是，如果你需要通过命令行与集群交互，应该通过 -e 或者 --endpoint 参数来指定 supervisor 的地址，比如：

xinference launch -n llama-2-chat -s 13 -f pytorch -e "http://${supervisor_host}:9997"

使用 Docker 部署 Xinference#

用以下命令在容器中运行 Xinference：

在拥有英伟达显卡的机器上运行#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0.0.0.0 --log-level debug

在只有 CPU 的机器上运行#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0.0.0.0 --log-level debug

将 <your_version> 替换为 Xinference 的版本，比如 v0.10.3，可以用 latest 来用于最新版本。

更多 docker 使用，请参考使用 docker 镜像。

在 Kubernetes 环境中运行 Xinference#

如果想在 Kubernetes 中运行 Xinference，需要通过 KubeBlocks 来帮助安装。

假设已经有一个可以使用的 Kubernetes 环境。

下载 KubeBlocks 的命令行工具，可以参考文档.

确保 kbcli 的版本至少为 v0.7.1。

通过 kbcli 安装 KubeBlocks，参考文档 kbcli.
用以下命令打开 Xinference 插件：

kbcli addon enable xinference

使用 kbcli 来拉起 Xinference 集群：

kbcli cluster create xinference

如果 Kubernetes 节点没有 GPU 设备，需要加上额外的参数：

kbcli cluster create xinference --cpu-mode

使用 -h 获取帮助文档

kbcli cluster create xinference -h