Continuous Batching (experimental)#
Continuous batching, as a means to improve throughput during model serving, has already been implemented in inference engines like VLLM.
Xinference aims to provide this optimization capability when using the transformers engine as well.
Usage#
Currently, this feature can be enabled under the following conditions:
First, set the environment variable
XINFERENCE_TRANSFORMERS_ENABLE_BATCHINGto1when starting xinference. For example:
XINFERENCE_TRANSFORMERS_ENABLE_BATCHING=1 xinference-local --log-level debug
Then, ensure that the
transformersengine is selected when launching the model. For example:
xinference launch -e <endpoint> --model-engine transformers -n qwen1.5-chat -s 4 -f pytorch -q none
curl -X 'POST' \
'http://127.0.0.1:9997/v1/models' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model_engine": "transformers",
"model_name": "qwen1.5-chat",
"model_format": "pytorch",
"size_in_billions": 4,
"quantization": "none"
}'
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
model_uid = client.launch_model(
model_engine="transformers",
model_name="qwen1.5-chat",
model_format="pytorch",
model_size_in_billions=4,
quantization="none"
)
print('Model uid: ' + model_uid)
Once this feature is enabled, all requests for LLMs will be managed by continuous batching, and the average throughput of requests made to a single model will increase. The usage of the LLM interface remains exactly the same as before, with no differences.
Abort your request#
In this mode, you can abort requests that are in the process of inference.
First, add
request_idoption ingenerate_config. For example:
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
model = client.get_model("<model_uid>")
model.chat("<prompt>", generate_config={"request_id": "<your_unique_request_id>"})
Then, abort the request using the
request_idyou have set. For example:
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
client.abort_request("<model_uid>", "<your_unique_request_id>")
Note that if your request has already finished, aborting the request will be a no-op.
Note#
Currently, this feature only supports the
generate,chatandvisiontasks forLLMmodels. Thetool calltasks are not supported.For
visiontasks, currently onlyqwen-vl-chat,cogvlm2, andglm-4vmodels are supported. More models will be supported in the future. Please let us know your requirements.If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model. The
launch_modelinterface provides themax_num_seqsparameter to adjust the concurrency level, with a default value of16.This feature is still in the experimental stage, and we welcome your active feedback on any issues.
After a period of testing, this method will remain enabled by default, and the original inference method will be deprecated.