Custom Models#

Xinference provides a flexible and comprehensive way to integrate, manage, and utilize custom models.

Directly launch an existing model#

Since v0.14.0, you can directly launch an existing model by passing model_path to the launch interface without downloading it. This way requires that the model’s model_family is among the built-in supported models, and eliminates the hassle of registering the model.

For example:

xinference launch --model-path <model_file_path> --model-engine <engine> -n qwen1.5-chat

The above example demonstrates how to directly launch a qwen1.5-chat model file without registering it.

For distributed scenarios, if your model file is on a specific worker, you can directly launch it using the worker_ip and model_path parameters with the launch interface.

Note

For CLI usage, prefer --model-path (kebab-case). --model_path is legacy-compatible but not recommended.

Define a custom model#

Web UI: Automatic LLM Config Parsing#

Added in version v2.0.0.

When registering a custom LLM via the Web UI, Xinference can automatically parse the model configuration and pre-fill key fields for you.

You only need to provide:

  • Model path / Model ID (where the model lives, local path or hub ID)

  • Model Family

After parsing, the UI can auto-populate fields such as:

  • Context Length

  • Model_Languages

  • Model_Abilities

  • Model_Specs

You can review and edit these fields before saving the custom model.

Define a custom model based on the following templates:

{
    "version": 2,
    "context_length": 32768,
    "model_name": "custom-qwen-2.5",
    "model_lang": [
        "en",
        "zh"
    ],
    "model_ability": [
        "generate"
    ],
    "model_description": "This is a custom model description.",
    "model_family": "my-custom-qwen-2.5",
    "model_specs": [
        {
            "model_format": "pytorch",
            "model_size_in_billions": "0_5",
            "quantization": "none",
            "model_id": null,
            "model_hub": "huggingface",
            "model_uri": "file:///path/to/models--Qwen--Qwen2.5-0.5B",
            "model_revision": null,
            "activated_size_in_billions": null
        }
    ],
    "chat_template": null,
    "stop_token_ids": null,
    "stop": null,
    "reasoning_start_tag": null,
    "reasoning_end_tag": null,
    "cache_config": null,
    "virtualenv": {
        "packages": [],
        "inherit_pip_config": true,
        "index_url": null,
        "extra_index_url": null,
        "find_links": null,
        "trusted_host": null,
        "no_build_isolation": null
    },
    "is_builtin": false
}
  • model_name: A string defining the name of the model. The name must start with a letter or a digit and can only contain letters, digits, underscores, or dashes.

  • context_length: An optional integer that specifies the maximum context size the model was trained to accommodate, encompassing both the input and output lengths. If not defined, the default value is 2048 tokens (~1,500 words).

  • dimensions: An interger defining the size of the vector output by the embedding model.

  • max_tokens: An interger defining the maximum number of input tokens the embedding model can process in a single request.

  • model_lang: A list of strings representing the supported languages for the model. Example: [“en”], which means that the model supports English.

  • model_ability: A list of strings defining the abilities of the model. It could include options like “embed”, “generate”, and “chat”. In this case, the model has the ability to “generate”.

  • model_family: A required string representing the family of the model you want to register. This parameter must not conflict with any builtin model names.

  • model_specs: An array of objects defining the specifications of the model. These include:
    • model_format: A string that defines the model format, like “pytorch” or “ggufv2”.

  • model_size_in_billions: An integer defining the size of the model in billions of parameters.

  • quantizations: A list of strings defining the available quantizations for the model. For PyTorch models, it could be “4-bit”, “8-bit”, or “none”. For ggufv2 models, the quantizations should correspond to values that work with the model_file_name_template. Some engines also support fp4 / fp8 / bnb formats (see Installation for backend support details).

    • model_id: A string representing the model ID, possibly referring to an identifier used by Hugging Face. If model_uri is missing, Xinference will try to download the model from the huggingface repository specified here..

    • model_hub: A string representing where to download the model from, like “Huggingface” or “modelscope”

    • model_uri: A string representing the URI where the model can be loaded from, such as “file:///path/to/llama-2-7b”. When the model format is ggufv2, model_uri must be the specific file path. When the model format is pytorch, model_uri must be the path to the directory containing the model files. If model URI is absent, Xinference will try to download the model from Hugging Face with the model ID.

    • model_revision: A string representing the specific version or commit hash of the model files to use from the repository.

  • chat_template: If model_ability includes chat , you must configure this option to generate the correct full prompt during chat. This is a Jinja template string. Usually, you can find it in the tokenizer_config.json file within the model directory.

  • stop_token_ids: If model_ability includes chat , you can configure this option to control when the model stops during chat. This is a list of integers, and you can typically extract the corresponding values from the generation_config.json or tokenizer_config.json file in the model directory.

  • stop: If model_ability includes chat , you can configure this option to control when the model stops during chat. This is a list of strings, and you can typically extract the corresponding values from the generation_config.json or tokenizer_config.json file in the model directory.

  • reasoning_start_tag: A special token or prompt used to explicitly instruct the LLM to begin its chain-of-thought or reasoning process in its output.

  • reasoning_end_tag: A special token or prompt used to explicitly mark the end of the model’s chain-of-thought or reasoning process in its output.

  • cache_config: A string representing the parameters and rules for how the system stores and manages temporary data (cache).

  • virtualenv: A settings object for model dependency isolation. Please refer to this document for details.

Register a Custom Model#

Register a custom model programmatically:

import json
from xinference.client import Client

with open('model.json') as fd:
    model = fd.read()

# replace with real xinference endpoint
endpoint = 'http://localhost:9997'
client = Client(endpoint)
client.register_model(model_type="<model_type>", model=model, persist=False)

Or via CLI:

xinference register --model-type <model_type> --file model.json --persist

Note that replace the <model_type> above with LLM, embedding or rerank. The same as below.

List the Built-in and Custom Models#

List built-in and custom models programmatically:

registrations = client.list_model_registrations(model_type="<model_type>")

Or via CLI:

xinference registrations --model-type <model_type>

Launch the Custom Model#

Launch the custom model programmatically:

uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch')

Or via CLI:

xinference launch --model-name custom-llama-2 --model-format pytorch

Interact with the Custom Model#

Invoke the model programmatically:

model = client.get_model(model_uid=uid)
model.generate('What is the largest animal in the world?')

Result:

{
   "id":"cmpl-a4a9d9fc-7703-4a44-82af-fce9e3c0e52a",
   "object":"text_completion",
   "created":1692024624,
   "model":"43e1f69a-3ab0-11ee-8f69-fa163e74fa2d",
   "choices":[
      {
         "text":"\nWhat does an octopus look like?\nHow many human hours has an octopus been watching you for?",
         "index":0,
         "logprobs":"None",
         "finish_reason":"stop"
      }
   ],
   "usage":{
      "prompt_tokens":10,
      "completion_tokens":23,
      "total_tokens":33
   }
}

Or via CLI, replace ${UID} with real model UID:

xinference generate --model-uid ${UID}

Unregister the Custom Model#

Unregister the custom model programmatically:

model = client.unregister_model(model_type="<model_type>", model_name='custom-llama-2')

Or via CLI:

xinference unregister --model-type <model_type> --model-name custom-llama-2