.. _models_custom: ============= Custom Models ============= Xinference provides a flexible and comprehensive way to integrate, manage, and utilize custom models. Directly launch an existing model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since ``v0.14.0``, you can directly launch an existing model by passing ``model_path`` to the launch interface without downloading it. This way requires that the model's ``model_family`` is among the built-in supported models, and eliminates the hassle of registering the model. For example: .. tabs:: .. code-tab:: bash shell xinference launch --model-path --model-engine -n qwen1.5-chat .. code-tab:: bash cURL curl -X 'POST' \ 'http://127.0.0.1:9997/v1/models' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model_engine": "", "model_name": "qwen1.5-chat", "model_path": "" }' .. code-tab:: python from xinference.client import RESTfulClient client = RESTfulClient("http://127.0.0.1:9997") model_uid = client.launch_model( model_engine="", model_name="qwen1.5-chat", model_path="" ) print('Model uid: ' + model_uid) The above example demonstrates how to directly launch a qwen1.5-chat model file without registering it. For distributed scenarios, if your model file is on a specific worker, you can directly launch it using the ``worker_ip`` and ``model_path`` parameters with the launch interface. .. note:: For CLI usage, prefer ``--model-path`` (kebab-case). ``--model_path`` is legacy-compatible but not recommended. Define a custom model ~~~~~~~~~~~~~~~~~~~~~~~~~ Web UI: Automatic LLM Config Parsing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. versionadded:: v2.0.0 When registering a custom LLM via the Web UI, Xinference can automatically parse the model configuration and pre-fill key fields for you. You only need to provide: - **Model path / Model ID** (where the model lives, local path or hub ID) - **Model Family** After parsing, the UI can auto-populate fields such as: - ``Context Length`` - ``Model_Languages`` - ``Model_Abilities`` - ``Model_Specs`` You can review and edit these fields before saving the custom model. Define a custom model based on the following templates: .. tabs:: .. tab:: LLM .. code-block:: json { "version": 2, "context_length": 32768, "model_name": "custom-qwen-2.5", "model_lang": [ "en", "zh" ], "model_ability": [ "generate" ], "model_description": "This is a custom model description.", "model_family": "my-custom-qwen-2.5", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": "0_5", "quantization": "none", "model_id": null, "model_hub": "huggingface", "model_uri": "file:///path/to/models--Qwen--Qwen2.5-0.5B", "model_revision": null, "activated_size_in_billions": null } ], "chat_template": null, "stop_token_ids": null, "stop": null, "reasoning_start_tag": null, "reasoning_end_tag": null, "cache_config": null, "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "is_builtin": false } .. tab:: embedding .. code-block:: json { "version": 2, "model_name": "my-bge-large-zh-v1.5", "dimensions": 1024, "max_tokens": 512, "language": [ "zh" ], "model_specs": [ { "model_format": "pytorch", "model_hub": "huggingface", "model_id": null, "model_uri": "file:///path/to/my-bge-large-zh-v1.5", "model_revision": null, "quantization": "none" } ], "cache_config": null, "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "is_builtin": false } .. tab:: Rerank .. code-block:: json { "version": 2, "model_name": "my-bge-reranker-base", "model_specs": [ { "model_format": "pytorch", "model_hub": "huggingface", "model_id": null, "model_revision": null, "model_uri": "file:///path/to/my-bge-reranker-base", "quantization": "none" } ], "language": [ "en", "zh" ], "type": "unknown", "max_tokens": 512, "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "is_builtin": false } .. tab:: image .. code-block:: json { "model_name": "my-qwen-image", "model_id": null, "model_revision": null, "model_hub": "huggingface", "cache_config": null, "version": 2, "model_family": "stable_diffusion", "model_ability": null, "controlnet": [], "default_model_config": {}, "default_generate_config": {}, "gguf_model_id": null, "gguf_quantizations": null, "gguf_model_file_name_template": null, "lightning_model_id": null, "lightning_versions": null, "lightning_model_file_name_template": null, "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "model_uri": "file:///path/to/my-qwen-image", "is_builtin": false } .. tab:: audio .. code-block:: json { "model_name": "my-ChatTTS", "model_id": null, "model_revision": null, "model_hub": "huggingface", "cache_config": null, "version": 2, "model_family": "ChatTTS", "multilingual": false, "language": null, "model_ability": [ "text2audio" ], "default_model_config": null, "default_transcription_config": null, "engine": null, "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "model_uri": "file:///path/to/my-ChatTTS", "is_builtin": false } .. tab:: flexible .. code-block:: json { "model_name": "my-flexible-model", "model_id": null, "model_revision": null, "model_hub": "huggingface", "cache_config": null, "version": 2, "model_description": "This is a model description.", "model_uri": "file:///path/to/my-flexible-model", "launcher": "xinference.model.flexible.launchers.transformers", "launcher_args": "{}", "virtualenv": { "packages": [], "inherit_pip_config": true, "index_url": null, "extra_index_url": null, "find_links": null, "trusted_host": null, "no_build_isolation": null }, "is_builtin": false } * model_name: A string defining the name of the model. The name must start with a letter or a digit and can only contain letters, digits, underscores, or dashes. * context_length: An optional integer that specifies the maximum context size the model was trained to accommodate, encompassing both the input and output lengths. If not defined, the default value is 2048 tokens (~1,500 words). * dimensions: An interger defining the size of the vector output by the embedding model. * max_tokens: An interger defining the maximum number of input tokens the embedding model can process in a single request. * model_lang: A list of strings representing the supported languages for the model. Example: ["en"], which means that the model supports English. * model_ability: A list of strings defining the abilities of the model. It could include options like "embed", "generate", and "chat". In this case, the model has the ability to "generate". * model_family: A required string representing the family of the model you want to register. This parameter must not conflict with any builtin model names. * model_specs: An array of objects defining the specifications of the model. These include: * model_format: A string that defines the model format, like "pytorch" or "ggufv2". * model_size_in_billions: An integer defining the size of the model in billions of parameters. * quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be "4-bit", "8-bit", or "none". For ggufv2 models, the quantizations should correspond to values that work with the ``model_file_name_template``. Some engines also support ``fp4`` / ``fp8`` / ``bnb`` formats (see :ref:`installation` for backend support details). * model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face. **If model_uri is missing, Xinference will try to download the model from the huggingface repository specified here.**. * model_hub: A string representing where to download the model from, like "Huggingface" or "modelscope" * model_uri: A string representing the URI where the model can be loaded from, such as "file:///path/to/llama-2-7b". **When the model format is ggufv2, model_uri must be the specific file path. When the model format is pytorch, model_uri must be the path to the directory containing the model files.** If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID. * model_revision: A string representing the specific version or commit hash of the model files to use from the repository. * chat_template: If ``model_ability`` includes ``chat`` , you must configure this option to generate the correct full prompt during chat. This is a Jinja template string. Usually, you can find it in the ``tokenizer_config.json`` file within the model directory. * stop_token_ids: If ``model_ability`` includes ``chat`` , you can configure this option to control when the model stops during chat. This is a list of integers, and you can typically extract the corresponding values from the ``generation_config.json`` or ``tokenizer_config.json`` file in the model directory. * stop: If ``model_ability`` includes ``chat`` , you can configure this option to control when the model stops during chat. This is a list of strings, and you can typically extract the corresponding values from the ``generation_config.json`` or ``tokenizer_config.json`` file in the model directory. * reasoning_start_tag: A special token or prompt used to explicitly instruct the LLM to begin its chain-of-thought or reasoning process in its output. * reasoning_end_tag: A special token or prompt used to explicitly mark the end of the model's chain-of-thought or reasoning process in its output. * cache_config: A string representing the parameters and rules for how the system stores and manages temporary data (cache). * virtualenv: A settings object for model dependency isolation. Please refer to :ref:`this document ` for details. Register a Custom Model ~~~~~~~~~~~~~~~~~~~~~~~ Register a custom model programmatically: .. code-block:: python import json from xinference.client import Client with open('model.json') as fd: model = fd.read() # replace with real xinference endpoint endpoint = 'http://localhost:9997' client = Client(endpoint) client.register_model(model_type="", model=model, persist=False) Or via CLI: .. code-block:: bash xinference register --model-type --file model.json --persist Note that replace the ```` above with ``LLM``, ``embedding`` or ``rerank``. The same as below. List the Built-in and Custom Models ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ List built-in and custom models programmatically: .. code-block:: python registrations = client.list_model_registrations(model_type="") Or via CLI: .. code-block:: bash xinference registrations --model-type Launch the Custom Model ~~~~~~~~~~~~~~~~~~~~~~~ Launch the custom model programmatically: .. code-block:: python uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch') Or via CLI: .. code-block:: bash xinference launch --model-name custom-llama-2 --model-format pytorch Interact with the Custom Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Invoke the model programmatically: .. code-block:: python model = client.get_model(model_uid=uid) model.generate('What is the largest animal in the world?') Result: .. code-block:: json { "id":"cmpl-a4a9d9fc-7703-4a44-82af-fce9e3c0e52a", "object":"text_completion", "created":1692024624, "model":"43e1f69a-3ab0-11ee-8f69-fa163e74fa2d", "choices":[ { "text":"\nWhat does an octopus look like?\nHow many human hours has an octopus been watching you for?", "index":0, "logprobs":"None", "finish_reason":"stop" } ], "usage":{ "prompt_tokens":10, "completion_tokens":23, "total_tokens":33 } } Or via CLI, replace ``${UID}`` with real model UID: .. code-block:: bash xinference generate --model-uid ${UID} Unregister the Custom Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unregister the custom model programmatically: .. code-block:: python model = client.unregister_model(model_type="", model_name='custom-llama-2') Or via CLI: .. code-block:: bash xinference unregister --model-type --model-name custom-llama-2