Using Xinference#

Run Xinference Locally#

Let’s start by running Xinference on a local machine and running a classic LLM model: llama-2-chat.

After this quickstart, you will move on to learning how to deploy Xinference in a cluster environment.

Start Local Server#

First, please ensure that you have installed Xinference according to the instructions provided here. To start a local instance of Xinference, run the following command:

xinference-local --host 0.0.0.0 --port 9997

INFO     Xinference supervisor 0.0.0.0:64570 started
INFO     Xinference worker 0.0.0.0:64570 started
INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)

Note

By default, Xinference uses <HOME>/.xinference as home path to store necessary files such as logs and models, where <HOME> is the home path of current user.

You can change this directory by configuring the environment variable XINFERENCE_HOME. For example:

XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

Congrats! You now have Xinference running on your local machine. Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client.

You can visit the web UI at http://127.0.0.1:9997/ui and visit http://127.0.0.1:9997/docs to inspect the API docs.

You can install the Xinference command line tool and Python client using the following command:

pip install xinference

The command line tool is xinference. You can list the commands that can be used by running:

xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --version       Show the version and exit.
  --log-level TEXT
  -H, --host TEXT
  -p, --port INTEGER
  --help              Show this message and exit.

Commands:
  chat
  generate
  launch
  list
  register
  registrations
  terminate
  unregister

You can install the Xinference Python client with minimal dependencies using the following command. Please ensure that the version of the client matches the version of the Xinference server.

pip install xinference-client==${SERVER_VERSION}

About Model Engine#

Since v0.11.0 , before launching the LLM model, you need to specify the inference engine you want to run. Currently, xinference supports the following inference engines:

vllm
sglang
llama.cpp
transformers

About the details of these inference engine, please refer to here.

Note that when launching a LLM model, the model_format and quantization of the model you want to launch is closely related to the inference engine.

You can use xinference engine command to query the combination of parameters of the model you want to launch. This will demonstrate under what conditions a model can run on which inference engines.

For example:

I would like to query about which inference engines the qwen-chat model can run on, and what are their respective parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat

I want to run qwen-chat with VLLM as the inference engine, but I don’t know how to configure the other parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat --model-engine vllm

I want to launch the qwen-chat model in the GGUF format, and I need to know how to configure the remaining parameters.

xinference engine -e <xinference_endpoint> --model-name qwen-chat -f ggufv2

In summary, compared to previous versions, when launching LLM models, you need to additionally pass the model_engine parameter. You can retrieve information about the supported inference engines and their related parameter combinations through the xinference engine command.

Run Llama-2#

Let’s start by running a built-in model: llama-2-chat. When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.

Note

Xinference also allows you to download models from other sites. You can do this by setting an environment variable when launching Xinference. For example, if you want to download models from modelscope, do the following:

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

We can specify the model’s UID using the --model-uid or -u flag. If not specified, Xinference will generate a unique ID. This create a new model instance with unique ID my-llama-2:

xinference launch --model-engine <inference_engine> -u my-llama-2 -n llama-2-chat -s 13 -f pytorch

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<inference_engine>",
  "model_uid": "my-llama-2",
  "model_name": "llama-2-chat",
  "model_format": "pytorch",
  "size_in_billions": 13
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_uid="my-llama-2",
  model_name="llama-2-chat",
  model_format="pytorch",
  size_in_billions=13
)
print('Model uid: ' + model_uid)

Model uid: my-llama-2

Note

For some engines, such as vllm, users need to specify the engine-related parameters when running models. In this case, you can directly specify the parameter name and value in the command line, for example:

xinference launch --model-engine vllm -u my-llama-2 -n llama-2-chat -s 13 -f pytorch --gpu_memory_utilization 0.9

gpu_memory_utilization=0.9 will pass to vllm when launching model.

Congrats! You now have llama-2-chat running by Xinference. Once the model is running, we can try it out either via cURL, or via Xinference’s python client:

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "my-llama-2",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ]
  }'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("my-llama-2")
print(model.chat(
    prompt="What is the largest animal?",
    system_prompt="You are a helpful assistant.",
    chat_history=[]
))

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "my-llama-2",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")

response = client.chat.completions.create(
    model="my-llama-2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

The following OpenAI APIs are supported:

Chat Completions: https://platform.openai.com/docs/api-reference/chat
Completions: https://platform.openai.com/docs/api-reference/completions
Embeddings: https://platform.openai.com/docs/api-reference/embeddings

Manage Models#

In addition to launching models, Xinference offers various ways to manage the entire lifecycle of models. You can manage models in Xinference through the command line, cURL, or Xinference’s python client.

You can list all models of a certain type that are available to launch in Xinference:

xinference registrations -t LLM

curl http://127.0.0.1:9997/v1/model_registrations/LLM

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_model_registrations(model_type='LLM'))

The following command gives you the currently running models in Xinference:

xinference list

curl http://127.0.0.1:9997/v1/models

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_models())

When you no longer need a model that is currently running, you can remove it in the following way to free up the resources it occupies:

xinference terminate --model-uid "my-llama-2"

curl -X DELETE http://127.0.0.1:9997/v1/models/my-llama-2

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
client.terminate_model(model_uid="my-llama-2")

Deploy Xinference In a Cluster#

To deploy Xinference in a cluster, you need to start a Xinference supervisor on one server and Xinference workers on the other servers.

First, make sure you have already installed Xinference on each of the servers according to the instructions provided here. Then follow the steps below:

Start the Supervisor#

On the server where you want to run the Xinference supervisor, run the following command:

xinference-supervisor -H "${supervisor_host}"

Replace ${supervisor_host} with the actual host of your supervisor server.

You can the supervisor’s web UI at http://${supervisor_host}:9997/ui and visit http://${supervisor_host}:9997/docs to inspect the API docs.

Start the Workers#

On each of the other servers where you want to run Xinference workers, run the following command:

xinference-worker -e "http://${supervisor_host}:9997" -H "${worker_host}"

Note

Note that you must replace ${worker_host} with the actual host of your worker server.

Note

Note that if you need to interact with the Xinference in a cluster via the command line, you should include the -e or --endpoint flag to specify the supervisor server’s endpoint. For example:

xinference launch -n llama-2-chat -s 13 -f pytorch -e "http://${supervisor_host}:9997"

Using Xinference With Docker#

To start Xinference in a Docker container, run the following command:

Run On Nvidia GPU Host#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0.0.0.0 --log-level debug

Run On CPU Only Host#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0.0.0.0 --log-level debug

Replace <your_version> with Xinference versions, e.g. v0.10.3, latest can be used for the latest version.

For more docker usage, refer to Using Docker Image.

Using Xinference On Kubernetes#

To use Xinference on Kubernetes, KubeBlocks is required to help the installation.

The following steps assume Kubernetes is already installed.

Download cli tool kbcli for KubeBlocks, see install kbcli.

Make sure kbcli version is at least v0.7.1.

Install KubeBlocks using kbcli command, see install KubeBlocks with kbcli.
Enable Xinference addon, run the following command:

kbcli addon enable xinference

Use kbcli to start Xinference cluster, run the following command:

kbcli cluster create xinference

If the Kubernetes node doesn’t have GPU on it, run the command with extra flag:

kbcli cluster create xinference --cpu-mode

Use -h to read the help documentation for more options:

kbcli cluster create xinference -h

What’s Next?#

Congratulations on getting started with Xinference! To help you navigate and make the most out of this powerful tool, here are some resources and guides: