Audio#

Learn how to turn audio into text or text into audio with Xinference.

Introduction#

The Audio API provides three methods for interacting with audio:

  • The transcriptions endpoint transcribes audio into the input language.

  • The translations endpoint translates audio into English.

  • The speech endpoint generates audio from the input text.

API ENDPOINT

OpenAI-compatible ENDPOINT

Transcription API

/v1/audio/transcriptions

Translation API

/v1/audio/translations

Speech API

/v1/audio/speech

Supported models#

The audio API is supported with the following models in Xinference:

Audio to text#

For Mac M-series chips only:

Text to audio#

For Mac M-series chips only:

Quickstart#

Transcription#

The Transcription API mimics OpenAI’s create transcriptions API. We can try Transcription API out either via cURL, OpenAI Client, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

Translation#

The Translation API mimics OpenAI’s create translations API. We can try Translation API out either via cURL, OpenAI Client, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

Speech#

The Speech API mimics OpenAI’s create speech API. We can try Speech API out either via cURL, OpenAI Client, or Xinference’s python client:

Speech API use non-stream by default as

  1. The stream output of ChatTTS is not as good as the non-stream output, please refer to: 2noise/ChatTTS#564

  2. The stream requires ffmpeg<7: https://pytorch.org/audio/stable/installation.html#optional-dependencies

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    "voice": "echo",
    "stream": True,
  }'

ChatTTS Usage#

Basic usage, refer to audio speech usage.

Fixed tone color. We can use fixed tone color provided by 6drf21e/ChatTTS_Speaker, Download the evaluation_result.csv , take seed_2155 as example, we get the emb_data of it.

import pandas as pd

df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]

Use the fixed tone color of seed_2155 to generate speech.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
    voice=emb_data_2155,
    input=<The text to generate audio for>
)

CosyVoice Usage#

Basic usage, launch model CosyVoice-300M-SFT.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    "voice": "中文女"
  }'

Clone voice, launch model CosyVoice-300M.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

zero_shot_prompt_text = ""
# The zero shot prompt file is the voice file
# the words said in the file should be identical to zero_shot_prompt_text
with open(zero_shot_prompt_file, "rb") as f:
    zero_shot_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_text=zero_shot_prompt_text,
    prompt_speech=zero_shot_prompt,
)

Cross lingual usage, launch model CosyVoice-300M.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# the file that reads in some language
with open(cross_lingual_prompt_file, "rb") as f:
    cross_lingual_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",  # text could be another language
    prompt_speech=cross_lingual_prompt,
)

Instruction based, launch model CosyVoice-300M-Instruct.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

response = model.speech(
    "在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
    voice="中文男",
    instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
    "Fights with fervor for justice, but struggles with impulsiveness.",
)

CosyVoice 2.0 stream usage, launch model CosyVoice2-0.5B.

Note

Please note that the latest CosyVoice 2.0 requires use_flow_cache=True for stream generation.

# Launch model
from xinference.client import Client

model_uid = client.launch_model(
    model_name=model_name,
    model_type="audio",
    download_hub="modelscope",
    use_flow_cache=True,
)

endpoint = "http://127.0.0.1:9997"
input_string = "你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?"

# Stream request by openai client
import openai
import tempfile

openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
    model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    response.stream_to_file(f.name)
    assert os.stat(f.name).st_size > 0

# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    for chunk in response:
        f.write(chunk)

More instructions and examples, could be found at https://fun-audio-llm.github.io/ .

FishSpeech Usage#

Basic usage, refer to audio speech usage.

Clone voice, launch model FishSpeech-1.5. Please use prompt_speech instead of reference_audio and prompt_text instead of reference_text to clone voice from the reference audio for the FishSpeech model. This arguments is aligned to voice cloning of CosyVoice.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
    reference_audio = f.read()
reference_text = ""  # text in the audio

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_speech=reference_audio,
    prompt_text=reference_text
)

SenseVoiceSmall Offline Usage#

Now SenseVoiceSmall use a small vad model fsmn-vad, it will be downloaded thus network required.

For offline environment, you can download the vad model in advance.

Download from huggingface or modelscope. Assume downloaded to /path/to/fsmn-vad.

Then when launching SenseVoiceSmall with Web UI, you can add an additional parameter with key vad_model and value /path/to/fsmn-vad which is the downloaded path. When launching with command line, you can add an option --vad_model /path/to/fsmn-vad.

Kokoro Usage#

The Kokoro model supports multiple languages, but the default language is English. If you want to use other languages, such as Chinese, you need to install additional dependency packages and add an additional parameter when starting the model.

  1. pip install misaki[zh]

  2. Initialize the model with the parameter lang_code=’z’, For all available lang_code options, please refer to kokoro source code. If the model is started through the web UI, an additional parameter needs to be added, with the key as lang_code and the value as z. If the model is started through the xinference client, the parameters are passed via the launch_model interface:

    model_uid = client.launch_model(
        model_name="Kokoro-82M",
        model_type="audio",
        compile=False,
        download_hub="huggingface",
        lang_code="z",
    )
    
  3. When inferring, the voice must start with ‘z’, for example: zf_xiaoyi. The currently supported voices are: https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices. For example:

    input_string = "重新启动即可更新"
    response = model.speech(input_string, voice="zf_xiaoyi")