Continuous Batching#
Continuous batching, as a means to improve throughput during model serving, has already been implemented in inference engines like VLLM
.
Xinference aims to provide this optimization capability when using the transformers engine as well.
Usage#
LLM#
Currently, this feature can be enabled under the following conditions:
First, set the environment variable
XINFERENCE_TRANSFORMERS_ENABLE_BATCHING
to1
when starting xinference. For example:
XINFERENCE_TRANSFORMERS_ENABLE_BATCHING=1 xinference-local --log-level debug
Note
Since v0.16.0
, this feature is turned on by default and
is no longer required to set the XINFERENCE_TRANSFORMERS_ENABLE_BATCHING
environment variable.
This environment variable has been removed.
Then, ensure that the
transformers
engine is selected when launching the model. For example:
xinference launch -e <endpoint> --model-engine transformers -n qwen1.5-chat -s 4 -f pytorch -q none
curl -X 'POST' \
'http://127.0.0.1:9997/v1/models' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model_engine": "transformers",
"model_name": "qwen1.5-chat",
"model_format": "pytorch",
"size_in_billions": 4,
"quantization": "none"
}'
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
model_uid = client.launch_model(
model_engine="transformers",
model_name="qwen1.5-chat",
model_format="pytorch",
model_size_in_billions=4,
quantization="none"
)
print('Model uid: ' + model_uid)
Once this feature is enabled, all requests for LLMs will be managed by continuous batching, and the average throughput of requests made to a single model will increase. The usage of the LLM interface remains exactly the same as before, with no differences.
Image Model#
Currently, for image models, only the text_to_image
interface is supported for FLUX.1
series models.
Enabling this feature requires setting the environment variable XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE
, which indicates the size
of the generated images.
For example, starting xinference like this:
XINFERENCE_TEXT_TO_IMAGE_BATCHING_SIZE=1024*1024 xinference-local --log-level debug
Then just use the text_to_image
interface as before, and nothing else needs to be changed.
Abort your request#
In this mode, you can abort requests that are in the process of inference.
First, add
request_id
option ingenerate_config
. For example:
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
model = client.get_model("<model_uid>")
model.chat([{"role": "user", "content": "<prompt>"}], generate_config={"request_id": "<your_unique_request_id>"})
Then, abort the request using the
request_id
you have set. For example:
from xinference.client import Client
client = Client("http://127.0.0.1:9997")
client.abort_request("<model_uid>", "<your_unique_request_id>")
Note that if your request has already finished, aborting the request will be a no-op. Image models also support this feature.
Note#
Currently, for
LLM
models, this feature only supports thegenerate
,chat
,tool call
andvision
tasks.Currently, for
image
models, this feature only supports thetext_to_image
tasks. OnlyFLUX.1
series models are supported.For
vision
tasks, currently onlyqwen-vl-chat
,cogvlm2
,glm-4v
andMiniCPM-V-2.6
(only for image tasks) models are supported. More models will be supported in the future. Please let us know your requirements.If using GPU inference, this method will consume more GPU memory. Please be cautious when increasing the number of concurrent requests to the same model. The
launch_model
interface provides themax_num_seqs
parameter to adjust the concurrency level, with a default value of16
.