Backends#
Xinference supports multiple backends for different models. After the user specifies the model, xinference will automatically select the appropriate backend.
llama.cpp#
Xinference now supports xllamacpp which developed by Xinference team to run llama.cpp backend. llama.cpp is developed based on the tensor library ggml, supporting inference of the LLaMA series models and their variants.
Warning
Since Xinference v1.5.0,
xllamacpp becomes default option for llama.cpp, and llama-cpp-python is deprecated.
Since Xinference v1.6.0, llama-cpp-python has been removed.
For all configurable llama.cpp parameters, please refer to the definition of the common_params structure in llama.cpp common.h
There may be some nested parameters. For example, sampling.top_k. Just use the . to separate nested parameters.
Here is an example of setting nested sampling parameters in WebUI:

Auto NGL#
Added in version v1.6.1: Auto GPU layers estimation is enabled since v1.6.1 when n-gpu-layers is not specified (default is -1).
This feature automatically detects the number of GPU layers (NGL) for the llama.cpp backend. Please be aware that this
is not an accurate calculation. Therefore, the -ngl result might not be the most optimized, and there is still a
chance of encountering an out-of-memory error.
Currently, there is no official implementation for auto ngl. Please refer to the following issues for more information:
Our implementation is based on the Ollama auto ngl, but there are some differences:
We utilize device information detected by xllamacpp.
We have removed support for less popular architectures, these architectures will use the default calculation.
We fall back to offloading all the layers to the GPU if the auto ngl fails.
We do not support multimodal projectors embedded into the model GGUF, as this is a very experimental feature.
Common Issues#
Server error: {‘code’: 500, ‘message’: ‘failed to process image’, ‘type’: ‘server_error’}
The error logs from server:
encoding image or slice... slot update_slots: id 0 | task 0 | kv cache rm [10, end) srv process_chun: processing image... ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Internal Error (0000000e:Internal Error) clip_image_batch_encode: ggml_backend_sched_graph_compute failed with error -1 failed to encode image srv process_chun: image processed in 2288 ms mtmd_helper_eval failed with status 1 slot update_slots: id 0 | task 0 | failed to process image, res = 1
This could be caused by running out of memory. You can try reducing memory usage by decreasing
n_ctx.Server error: {‘code’: 400, ‘message’: ‘the request exceeds the available context size. try increasing the context size or enable context shift’, ‘type’: ‘invalid_request_error’}
If you are using the multimodal feature, the
ctx_shiftis disabled by default. Please increase the context size by either increasingn_ctxor reducingn_parallel.Server error: {‘code’: 500, ‘message’: ‘Input prompt is too big compared to KV size. Please try increasing KV size.’, ‘type’: ‘server_error’}
The error logs from server:
ggml_metal_graph_compute: command buffer 1 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) graph_compute: ggml_backend_sched_graph_compute_async failed with error -1 llama_decode: failed to decode, ret = -3 srv update_slots: failed to decode the batch: KV cache is full - try increasing it via the context size, i = 0, n_batch = 2048, ret = -3
This could be caused by the KV cache allocation failure. You can try to reduce the context size by either reducing
n_ctxor increasingn_parallel, or loading a partial model onto the GPU by adjustingn_gpu_layers. Be aware that if you are handling inference requests serially, increasingn_parallelcan’t improve the latency or throughput.
transformers#
Transformers supports the inference of most state-of-art models. It is the default backend for models in PyTorch format.
vLLM#
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
When the following conditions are met, Xinference will choose vLLM as the inference engine:
The model format is
pytorch,gptq,awq,fp4,fp8orbnb.When the model format is
pytorch, the quantization isnone.When the model format is
awq, the quantization isInt4.When the model format is
gptq, the quantization isInt3,Int4orInt8.The system is Linux and has at least one CUDA device
The model family (for custom models) / model name (for builtin models) is within the list of models supported by vLLM
Currently, supported model includes:
code-llama,code-llama-instruct,code-llama-python,deepseek,deepseek-chat,deepseek-coder,deepseek-coder-instruct,deepseek-r1-distill-llama,gorilla-openfunctions-v2,HuatuoGPT-o1-LLaMA-3.1,llama-2,llama-2-chat,llama-3,llama-3-instruct,llama-3.1,llama-3.1-instruct,llama-3.3-instruct,tiny-llama,wizardcoder-python-v1.0,wizardmath-v1.0,Yi,Yi-1.5,Yi-1.5-chat,Yi-1.5-chat-16k,Yi-200k,Yi-chatcodestral-v0.1,mistral-instruct-v0.1,mistral-instruct-v0.2,mistral-instruct-v0.3,mistral-large-instruct,mistral-nemo-instruct,mistral-v0.1,openhermes-2.5,seallm_v2Baichuan-M2,codeqwen1.5,codeqwen1.5-chat,deepseek-r1-distill-qwen,DianJin-R1,fin-r1,HuatuoGPT-o1-Qwen2.5,KAT-V1,marco-o1,qwen1.5-chat,qwen2-instruct,qwen2.5,qwen2.5-coder,qwen2.5-coder-instruct,qwen2.5-instruct,qwen2.5-instruct-1m,qwenLong-l1,QwQ-32B,QwQ-32B-Preview,seallms-v3,skywork-or1,skywork-or1-preview,XiYanSQL-QwenCoder-2504llama-3.2-vision,llama-3.2-vision-instructbaichuan-2,baichuan-2-chatInternLM2ForCausalLMqwen-chatmixtral-8x22B-instruct-v0.1,mixtral-instruct-v0.1,mixtral-v0.1cogagentglm-edge-chat,glm4-chat,glm4-chat-1mcodegeex4,glm-4vseallm_v2.5orion-chatqwen1.5-moe-chat,qwen2-moe-instructCohereForCausalLMdeepseek-v2-chat,deepseek-v2-chat-0628,deepseek-v2.5,deepseek-vl2deepseek-prover-v2,deepseek-r1,deepseek-r1-0528,deepseek-v3,deepseek-v3-0324,Deepseek-V3.1,moonlight-16b-a3b-instructdeepseek-r1-0528-qwen3,qwen3minicpm3-4binternlm3-instructgemma-3-1b-itglm4-0414minicpm-2b-dpo-bf16,minicpm-2b-dpo-fp16,minicpm-2b-dpo-fp32,minicpm-2b-sft-bf16,minicpm-2b-sft-fp32,minicpm4Ernie4.5Qwen3-Coder,Qwen3-Instruct,Qwen3-Thinkingglm-4.5,GLM-4.6,GLM-4.7gpt-ossseed-ossQwen3-Next-Instruct,Qwen3-Next-ThinkingDeepSeek-V3.2,DeepSeek-V3.2-ExpMiniMax-M2,MiniMax-M2.5,MiniMax-M2.7glm-5,glm-5.1
SGLang#
SGLang has a high-performance inference runtime with RadixAttention. It significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. And it also supports other common techniques like continuous batching and tensor parallelism.
MLX#
MLX provides efficient runtime to run LLM on Apple silicon. It’s recommended to use for Mac users when running on Apple silicon if the model has MLX format support.