xinference.client.Client.launch_model#

Client.launch_model(model_name: str, model_type: str = 'LLM', model_engine: str | None = None, model_uid: str | None = None, model_size_in_billions: int | str | float | None = None, model_format: str | None = None, quantization: str | None = None, replica: int = 1, n_worker: int = 1, n_gpu: int | str | None = 'auto', peft_model_config: Dict | None = None, request_limits: int | None = None, worker_ip: str | None = None, gpu_idx: int | List[int] | None = None, model_path: str | None = None, enable_thinking: bool | None = None, enable_virtual_env: bool | None = None, virtual_env_packages: List[str] | None = None, envs: Dict[str, str] | None = None, **kwargs) str[源代码]#

Launch the model based on the parameters on the server via RESTful APIs.

参数:
  • model_name (str) -- The name of model.

  • model_type (str) -- type of model.

  • model_engine (Optional[str]) -- Specify the inference engine of the model when launching LLM.

  • model_uid (str) -- UID of model, auto generate a UUID if is None.

  • model_size_in_billions (Optional[Union[int, str, float]]) -- The size (in billions) of the model.

  • model_format (Optional[str]) -- The format of the model.

  • quantization (Optional[str]) -- The quantization of model.

  • replica (Optional[int]) -- The replica of model, default is 1.

  • n_worker (int) -- Number of workers to run.

  • n_gpu (Optional[Union[int, str]],) -- The number of GPUs used by the model, default is "auto". If n_worker>1, means number of GPUs per worker. n_gpu=None means cpu only, n_gpu=auto lets the system automatically determine the best number of GPUs to use.

  • peft_model_config (Optional[Dict]) --

    • "lora_list": A List of PEFT (Parameter-Efficient Fine-Tuning) model and path.

    • "image_lora_load_kwargs": A Dict of lora load parameters for image model

    • "image_lora_fuse_kwargs": A Dict of lora fuse parameters for image model

  • request_limits (Optional[int]) -- The number of request limits for this model, default is None. request_limits=None means no limits for this model.

  • worker_ip (Optional[str]) -- Specify the worker ip where the model is located in a distributed scenario.

  • gpu_idx (Optional[Union[int, List[int]]]) -- Specify the GPU index where the model is located.

  • model_path (Optional[str]) -- Model path, if gguf format, should be the file path, otherwise, should be directory of the model.

  • enable_thinking (Optional[bool]) -- Enable or disable thinking mode for hybrid reasoning LLMs (e.g., Qwen3). None uses the model default.

  • enable_virtual_env (Optional[bool]) -- If enable virtual env.

  • virtual_env_packages (Optional[List[str]]) -- Packages to specify in virtual env, can be used to override builtin packages in virtual env.

  • envs (Optional[Dict[str, str]]) -- Environment variables to pass when launching model.

  • **kwargs -- Any other parameters been specified. e.g. multimodal_projector for multimodal inference with the llama.cpp backend.

返回:

The unique model_uid for the launched model.

返回类型:

str