xinference.client.Client.launch_model#

Client.launch_model(model_name: str, model_type: str = 'LLM', model_engine: str | None = None, model_uid: str | None = None, model_size_in_billions: int | str | float | None = None, model_format: str | None = None, quantization: str | None = None, replica: int = 1, n_gpu: int | str | None = 'auto', peft_model_config: Dict | None = None, request_limits: int | None = None, worker_ip: str | None = None, gpu_idx: int | List[int] | None = None, **kwargs) str[source]#

Launch the model based on the parameters on the server via RESTful APIs.

Parameters:
  • model_name (str) – The name of model.

  • model_type (str) – type of model.

  • model_engine (Optional[str]) – Specify the inference engine of the model when launching LLM.

  • model_uid (str) – UID of model, auto generate a UUID if is None.

  • model_size_in_billions (Optional[Union[int, str, float]]) – The size (in billions) of the model.

  • model_format (Optional[str]) – The format of the model.

  • quantization (Optional[str]) – The quantization of model.

  • replica (Optional[int]) – The replica of model, default is 1.

  • n_gpu (Optional[Union[int, str]],) – The number of GPUs used by the model, default is “auto”. n_gpu=None means cpu only, n_gpu=auto lets the system automatically determine the best number of GPUs to use.

  • peft_model_config (Optional[Dict]) –

    • “lora_list”: A List of PEFT (Parameter-Efficient Fine-Tuning) model and path.

    • ”image_lora_load_kwargs”: A Dict of lora load parameters for image model

    • ”image_lora_fuse_kwargs”: A Dict of lora fuse parameters for image model

  • request_limits (Optional[int]) – The number of request limits for this model, default is None. request_limits=None means no limits for this model.

  • worker_ip (Optional[str]) – Specify the worker ip where the model is located in a distributed scenario.

  • gpu_idx (Optional[Union[int, List[int]]]) – Specify the GPU index where the model is located.

  • **kwargs – Any other parameters been specified.

Returns:

The unique model_uid for the launched model.

Return type:

str