Model Memory Calculation#
For better planning of VMEM usage, xinference provided tool for model memory calculation: cal-model-mem
Use algorithm from RahulSChand/gpu_poor
Output: model_mem, kv_cache, overhead, active_mem
Example: To calculate memory usage for qwen1.5-chat, run the following command:
xinference cal-model-mem -s 7 -q Int4 -f gptq -c 16384 -n qwen1.5-chat
model_name: qwen1.5-chat
kv_cache_dtype: 16
model size: 7.0 B
quant: Int4
context: 16384
gpu mem usage:
model mem: 4139 MB
kv_cache: 8192 MB
overhead: 650 MB
active: 17024 MB
total: 30005 MB (30 GB)
Syntax#
–size-in-billions {model_size}
-s {model_size}
Set the model size. Specify the model size in billions of parameters. Format accept 1_8 and 1.8. For example, 7 for 7.0B model size.
–quantization {precision}
-q {precision} (Optional)
Define the quantization settings for the model. For example, Int4 for INT4 quantization.
–model-name {model_name}
-n {model_name} (Optional)
Specify the model’s name. If provided, fetch model config from huggingface/modelscope; If not specified, use default model layer to estimate.
–context-length {context_length}
-c {context_length}
Specify the maximum number of tokens(context length) that your model support.
–model-format {format}
-f {format}
Specify the format of the model, e.g. pytorch, ggmlv3, etc.
Note
The environment variable HF_ENDPOINT
could set the endpoint of HuggingFace. e.g. hf-mirror, etc.
Please refer to this document