xinference.client.Client.launch_model#

Launch the model based on the parameters on the server via RESTful APIs.

Parameters:

model_name (str) – The name of model.
model_type (str) – type of model.
model_engine (Optional[str]) – Specify the inference engine of the model when launching LLM.
model_uid (str) – UID of model, auto generate a UUID if is None.
model_size_in_billions (Optional[Union[int, str, float]]) – The size (in billions) of the model.
model_format (Optional[str]) – The format of the model.
quantization (Optional[str]) – The quantization of model.
replica (Optional[int]) – The replica of model, default is 1.
n_worker (int) – Number of workers to run.
n_gpu (Optional[Union[int, str]],) – The number of GPUs used by the model, default is “auto”. If n_worker>1, means number of GPUs per worker. n_gpu=None means cpu only, n_gpu=auto lets the system automatically determine the best number of GPUs to use.
peft_model_config (Optional[Dict]) –
- “lora_list”: A List of PEFT (Parameter-Efficient Fine-Tuning) model and path.
- ”image_lora_load_kwargs”: A Dict of lora load parameters for image model
- ”image_lora_fuse_kwargs”: A Dict of lora fuse parameters for image model
request_limits (Optional[int]) – The number of request limits for this model, default is None. request_limits=None means no limits for this model.
worker_ip (Optional[str]) – Specify the worker ip where the model is located in a distributed scenario.
gpu_idx (Optional[Union[int, List[int]]]) – Specify the GPU index where the model is located.
model_path (Optional[str]) – Model path, if gguf format, should be the file path, otherwise, should be directory of the model.
enable_thinking (Optional[bool]) – Enable or disable thinking mode for hybrid reasoning LLMs (e.g., Qwen3). None uses the model default.
enable_virtual_env (Optional[bool]) – If enable virtual env.
virtual_env_packages (Optional[List[str]]) – Packages to specify in virtual env, can be used to override builtin packages in virtual env.
envs (Optional[Dict[str, str]]) – Environment variables to pass when launching model.
**kwargs – Any other parameters been specified. e.g. multimodal_projector for multimodal inference with the llama.cpp backend.

Returns:

The unique model_uid for the launched model.

Return type:

str