Model Memory Calculation#

For better planning of VMEM usage, xinference provided tool for model memory calculation: cal-model-mem

Use algorithm from RahulSChand/gpu_poor

Output: model_mem, kv_cache, overhead, active_mem

Example: To calculate memory usage for qwen1.5-chat, run the following command:

xinference cal-model-mem -s 7 -q Int4 -f gptq -c 16384 -n qwen1.5-chat

Syntax#

  • –size-in-billions {model_size}

    • -s {model_size}

    Set the model size. Specify the model size in billions of parameters. Format accept 1_8 and 1.8. For example, 7 for 7.0B model size.

  • –quantization {precision}

    • -q {precision} (Optional)

    Define the quantization settings for the model. For example, Int4 for INT4 quantization.

  • –model-name {model_name}

    • -n {model_name} (Optional)

    Specify the model’s name. If provided, fetch model config from huggingface/modelscope; If not specified, use default model layer to estimate.

  • –context-length {context_length}

    • -c {context_length}

    Specify the maximum number of tokens(context length) that your model support.

  • –model-format {format}

    • -f {format}

    Specify the format of the model, e.g. pytorch, ggmlv3, etc.

Note

The environment variable HF_ENDPOINT could set the endpoint of HuggingFace. e.g. hf-mirror, etc. Please refer to this document