Using Xinference#

Run Xinference Locally#

Let’s start by running Xinference on a local machine and running a classic LLM model: llama-2-chat.

After this quickstart, you will move on to learning how to deploy Xinference in a cluster environment.

Start Local Server#

First, please ensure that you have installed Xinference according to the instructions provided here. To start a local instance of Xinference, run the following command:

xinference-local --host 0.0.0.0 --port 9997

Note

By default, Xinference uses <HOME>/.xinference as home path to store necessary files such as logs and models, where <HOME> is the home path of current user.

You can change this directory by configuring the environment variable XINFERENCE_HOME. For example:

XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

Congrats! You now have Xinference running on your local machine. Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client.

You can visit the web UI at http://127.0.0.1:9997/ui and visit http://127.0.0.1:9997/docs to inspect the API docs.

You can install the Xinference command line tool and Python client using the following command:

pip install xinference

The command line tool is xinference. You can list the commands that can be used by running:

xinference --help

You can install the Xinference Python client with minimal dependencies using the following command. Please ensure that the version of the client matches the version of the Xinference server.

pip install xinference-client==${SERVER_VERSION}

About Model Engine#

Since v0.11.0 , before launching the LLM model, you need to specify the inference engine you want to run. Currently, xinference supports the following inference engines:

  • vllm

  • sglang

  • llama.cpp

  • transformers

About the details of these inference engine, please refer to here.

Note that when launching a LLM model, the model_format and quantization of the model you want to launch is closely related to the inference engine.

You can use xinference engine command to query the combination of parameters of the model you want to launch. This will demonstrate under what conditions a model can run on which inference engines.

For example:

  1. I would like to query about which inference engines the qwen-chat model can run on, and what are their respective parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat
  1. I want to run qwen-chat with VLLM as the inference engine, but I don’t know how to configure the other parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat --model-engine vllm
  1. I want to launch the qwen-chat model in the GGUF format, and I need to know how to configure the remaining parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat -f ggufv2

In summary, compared to previous versions, when launching LLM models, you need to additionally pass the model_engine parameter. You can retrieve information about the supported inference engines and their related parameter combinations through the xinference engine command.

Run Llama-2#

Let’s start by running a built-in model: llama-2-chat. When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.

Note

Xinference also allows you to download models from other sites. You can do this by setting an environment variable when launching Xinference. For example, if you want to download models from modelscope, do the following:

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

We can specify the model’s UID using the --model-uid or -u flag. If not specified, Xinference will generate a unique ID. This create a new model instance with unique ID my-llama-2:

xinference launch --model-engine <inference_engine> -u my-llama-2 -n llama-2-chat -s 13 -f pytorch

Note

For some engines, such as vllm, users need to specify the engine-related parameters when running models. In this case, you can directly specify the parameter name and value in the command line, for example:

xinference launch --model-engine vllm -u my-llama-2 -n llama-2-chat -s 13 -f pytorch --gpu_memory_utilization 0.9

gpu_memory_utilization=0.9 will pass to vllm when launching model.

Congrats! You now have llama-2-chat running by Xinference. Once the model is running, we can try it out either via cURL, or via Xinference’s python client:

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "my-llama-2",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ]
  }'

Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")

response = client.chat.completions.create(
    model="my-llama-2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

The following OpenAI APIs are supported:

Manage Models#

In addition to launching models, Xinference offers various ways to manage the entire lifecycle of models. You can manage models in Xinference through the command line, cURL, or Xinference’s python client.

You can list all models of a certain type that are available to launch in Xinference:

xinference registrations -t LLM

The following command gives you the currently running models in Xinference:

xinference list

When you no longer need a model that is currently running, you can remove it in the following way to free up the resources it occupies:

xinference terminate --model-uid "my-llama-2"

Deploy Xinference In a Cluster#

To deploy Xinference in a cluster, you need to start a Xinference supervisor on one server and Xinference workers on the other servers.

First, make sure you have already installed Xinference on each of the servers according to the instructions provided here. Then follow the steps below:

Start the Supervisor#

On the server where you want to run the Xinference supervisor, run the following command:

xinference-supervisor -H "${supervisor_host}"

Replace ${supervisor_host} with the actual host of your supervisor server.

You can the supervisor’s web UI at http://${supervisor_host}:9997/ui and visit http://${supervisor_host}:9997/docs to inspect the API docs.

Start the Workers#

On each of the other servers where you want to run Xinference workers, run the following command:

xinference-worker -e "http://${supervisor_host}:9997" -H "${worker_host}"

Note

Note that you must replace ${worker_host} with the actual host of your worker server.

Note

Note that if you need to interact with the Xinference in a cluster via the command line, you should include the -e or --endpoint flag to specify the supervisor server’s endpoint. For example:

xinference launch -n llama-2-chat -s 13 -f pytorch -e "http://${supervisor_host}:9997"

Using Xinference With Docker#

To start Xinference in a Docker container, run the following command:

Run On Nvidia GPU Host#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0.0.0.0 --log-level debug

Run On CPU Only Host#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0.0.0.0 --log-level debug

Replace <your_version> with Xinference versions, e.g. v0.10.3, latest can be used for the latest version.

For more docker usage, refer to Using Docker Image.

Using Xinference On Kubernetes#

To use Xinference on Kubernetes, KubeBlocks is required to help the installation.

The following steps assume Kubernetes is already installed.

  1. Download cli tool kbcli for KubeBlocks, see install kbcli.

Make sure kbcli version is at least v0.7.1.

  1. Install KubeBlocks using kbcli command, see install KubeBlocks with kbcli.

  2. Enable Xinference addon, run the following command:

kbcli addon enable xinference
  1. Use kbcli to start Xinference cluster, run the following command:

kbcli cluster create xinference

If the Kubernetes node doesn’t have GPU on it, run the command with extra flag:

kbcli cluster create xinference --cpu-mode

Use -h to read the help documentation for more options:

kbcli cluster create xinference -h

What’s Next?#

Congratulations on getting started with Xinference! To help you navigate and make the most out of this powerful tool, here are some resources and guides: