Audio#

Learn how to turn audio into text or text into audio with Xinference.

Introduction#

The Audio API provides three methods for interacting with audio:

The transcriptions endpoint transcribes audio into the input language.
The translations endpoint translates audio into English.
The speech endpoint generates audio from the input text.

API ENDPOINT	OpenAI-compatible ENDPOINT
Transcription API	/v1/audio/transcriptions
Translation API	/v1/audio/translations
Speech API	/v1/audio/speech

Supported models#

The audio API is supported with the following models in Xinference:

Audio to text#

For Mac M-series chips only:

Text to audio (TTS)#

Models supporting zero-shot (direct synthesis without reference audio):

Models supporting voice cloning (requires reference audio):

Models supporting emotion control:

IndexTTS2

For Mac M-series chips only:

Quickstart#

Transcription#

The Transcription API mimics OpenAI’s create transcriptions API. We can try Transcription API out either via cURL, OpenAI Client, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.transcriptions.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.transcriptions(audio=audio_file.read())

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Translation#

The Translation API mimics OpenAI’s create translations API. We can try Translation API out either via cURL, OpenAI Client, or Xinference’s python client:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.translations.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.translations(audio=audio_file.read())

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Speech#

The Speech API mimics OpenAI’s create speech API. We can try Speech API out either via cURL, OpenAI Client, or Xinference’s python client:

Speech API use non-stream by default as

The stream output of ChatTTS is not as good as the non-stream output, please refer to: 2noise/ChatTTS#564
The stream requires ffmpeg<7: https://pytorch.org/audio/stable/installation.html#optional-dependencies

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    "voice": "echo",
    "stream": True,
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    voice="echo",
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
model.speech(
    input=<The text to generate audio for>,
    voice="echo",
    stream: True,
)

The output will be an audio binary.

ChatTTS Usage#

Basic usage, refer to audio speech usage.

Fixed tone color. We can use fixed tone color provided by 6drf21e/ChatTTS_Speaker, Download the evaluation_result.csv , take seed_2155 as example, we get the emb_data of it.

import pandas as pd

df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]

Use the fixed tone color of seed_2155 to generate speech.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
    voice=emb_data_2155,
    input=<The text to generate audio for>
)

CosyVoice Usage#

CosyVoice has two versions: CosyVoice 1.0 and CosyVoice 2.0. CosyVoice 1.0 has three different models:

CosyVoice-300M-SFT: Choose this model if you just want to convert text to audio. There are pretrained voices available: [‘中文女’, ‘中文男’, ‘日语男’, ‘粤语女’, ‘英文女’, ‘英文男’, ‘韩语女’]
CosyVoice-300M: Choose this model if you want to clone voice or convert text to audio in different languages. The prompt_speech is always required and should be a WAV file. For optimal performance, use a sample rate of 16,000 Hz.
CosyVoice-300M-Instruct: Choose this model If you need precise control over the tone and pitch.

Basic usage, launch model CosyVoice-300M-SFT.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    "voice": "中文女"
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女",
)
response.stream_to_file('1.mp3')

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女"
)
with open('1.mp3', 'wb') as f:
    f.write(speech_bytes)

Clone voice, launch model CosyVoice-300M.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

zero_shot_prompt_text = ("<the words in the text exactly match "
                         "the audio file of the zero-shot prompt>")
# The words said in the audio file should be identical
# to zero_shot_prompt_text.
#
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(zero_shot_prompt_file, "rb") as f:
    zero_shot_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_text=zero_shot_prompt_text,
    prompt_speech=zero_shot_prompt,
)

Cross lingual usage, launch model CosyVoice-300M.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(cross_lingual_prompt_file, "rb") as f:
    cross_lingual_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",  # text could be another language
    prompt_speech=cross_lingual_prompt,
)

Instruction based, launch model CosyVoice-300M-Instruct.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

response = model.speech(
    "在面对挑战时，他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
    voice="中文男",
    instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
    "Fights with fervor for justice, but struggles with impulsiveness.",
)

CosyVoice 2.0 only has one model, it provides all the capabilities of the three CosyVoice models. The usage is the same as CosyVoice.

CosyVoice 2.0 stream usage, launch model CosyVoice2-0.5B.

# Launch model
from xinference.client import Client

model_uid = client.launch_model(
    model_name=model_name,
    model_type="audio",
    download_hub="modelscope",
)

endpoint = "http://127.0.0.1:9997"
input_string = "你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？"

# Stream request by openai client
import openai
import tempfile

openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
    model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    response.stream_to_file(f.name)
    assert os.stat(f.name).st_size > 0

# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    for chunk in response:
        f.write(chunk)

More instructions and examples, could be found at https://fun-audio-llm.github.io/ .

FishSpeech Usage#

Basic usage, refer to audio speech usage.

Clone voice, launch model FishSpeech-1.5. Please use prompt_speech instead of reference_audio and prompt_text instead of reference_text to clone voice from the reference audio for the FishSpeech model. This arguments is aligned to voice cloning of CosyVoice.

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
    reference_audio = f.read()
reference_text = ""  # text in the audio

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_speech=reference_audio,
    prompt_text=reference_text
)

Paraformer Usage#

model	vad	punc	timestamp	speaker	hotword
paraformer-zh	yes	yes	no	no	no
paraformer-zh-hotword	yes	yes	no	no	yes
paraformer-zh-spk	yes	yes	yes	yes	no
paraformer-zh-long	yes	yes	yes	yes	no
seaco-paraformer-zh (recommend)	yes	yes	yes	yes	yes

VAD & Punctuation Usage

All Paraformer models support VAD and punctuation.
Timestamp & Speaker Usage

Only the following models support timestamp and speaker:
- paraformer-zh-spk
- paraformer-zh-long
- seaco-paraformer-zh
Among them, only paraformer-zh-spk enables speaker info by default.

If you need speaker info when using paraformer-zh-long or seaco-paraformer-zh:
- In Web UI: add an extra parameter with key spk_model and value cam++
- In command line: add the option --spk_model cam++
Example:
```
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, response_format="verbose_json")
```

Hotword Usage

Only the following models support hotword:

paraformer-zh-hotword
seaco-paraformer-zh

Example:

from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, hotword="小艾 魔搭")

SenseVoiceSmall Offline Usage#

Now SenseVoiceSmall use a small vad model fsmn-vad, it will be downloaded thus network required.

For offline environment, you can download the vad model in advance.

Download from huggingface or modelscope. Assume downloaded to /path/to/fsmn-vad.

Then when launching SenseVoiceSmall with Web UI, you can add an additional parameter with key vad_model and value /path/to/fsmn-vad which is the downloaded path. When launching with command line, you can add an option --vad_model /path/to/fsmn-vad.

Kokoro Usage#

The Kokoro model supports multiple languages, but the default language is English. If you want to use other languages, such as Chinese, you need to install additional dependency packages and add an additional parameter when starting the model.

pip install misaki[zh]
Initialize the model with the parameter lang_code=’z’, For all available lang_code options, please refer to kokoro source code. If the model is started through the web UI, an additional parameter needs to be added, with the key as lang_code and the value as z. If the model is started through the xinference client, the parameters are passed via the launch_model interface:
```
model_uid = client.launch_model(
    model_name="Kokoro-82M",
    model_type="audio",
    compile=False,
    download_hub="huggingface",
    lang_code="z",
)
```
When inferring, the voice must start with ‘z’, for example: zf_xiaoyi. The currently supported voices are: https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices. For example:
```
input_string = "重新启动即可更新"
response = model.speech(input_string, voice="zf_xiaoyi")
```

IndexTTS2 Usage#

The IndexTTS2 model supports emotion control, you can use this feature by specifying some additional parameters. Here are several examples of how to use IndexTTS2:

Synthesize new speech with a single reference audio file (voice cloning):

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Translate for me, what is a surprise!",
    prompt_speech=test_prompt_speech,
)

Using a separate, emotional reference audio file to condition the speech synthesis:

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

with open("example/emo_sad.wav", "rb") as f:
    emo_prompt_speech = f.read()

response = model.speech(
    input="It's such a shame the singer didn't make it to the finals.",
    prompt_speech=test_prompt_speech,
    emo_audio_prompt=emo_prompt_speech
)

When an emotional reference audio file is specified, you can optionally set the emo_alpha to adjust how much it affects the output. Valid range is 0.0 - 1.0 , and the default value is 1.0 (100%):

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

with open("example/emo_sad.wav", "rb") as f:
    emo_prompt_speech = f.read()

response = model.speech(
    input="It's such a shame the singer didn't make it to the finals.",
    prompt_speech=test_prompt_speech,
    emo_audio_prompt=emo_prompt_speech,
    emo_alpha=0.9
)

It’s also possible to omit the emotional reference audio and instead provide an 8-float list specifying the intensity of each emotion, in the following order: [happy, angry, sad, afraid, disgusted, melancholic, surprised, calm] . You can additionally use the use_random parameter to introduce stochasticity during inference; the default is False , and setting it to True enables randomness:

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Wow, I'm so lucky!",
    prompt_speech=test_prompt_speech,
    emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0],
    use_random=False
)

Alternatively, you can enable use_emo_text to guide the emotions based on your provided text script. Your text script will then automatically be converted into emotion vectors. It’s recommended to use emo_alpha around 0.6 (or lower) when using the text emotion modes, for more natural sounding speech. You can introduce randomness with use_random (default: False; True enables randomness):

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Quick, hide! He's coming! He's coming to get us!",
    prompt_speech=test_prompt_speech,
    emo_alpha=0.6,
    use_emo_text=True,
    use_random=False
)

It’s also possible to directly provide a specific text emotion description via the emo_text parameter. Your emotion text will then automatically be converted into emotion vectors. This gives you separate control of the text script and the text emotion description:

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Quick, hide! He's coming! He's coming to get us!",
    prompt_speech=test_prompt_speech,
    emo_alpha=0.6,
    use_emo_text=True,
    emo_text="You scared the hell out of me! Are you a ghost?",
    use_random=False
)

IndexTTS2 Offline Usage#

IndexTTS2 requires several small models that are downloaded automatically during initialization. For offline environments, you can download these models to a single directory and specify the directory path.

Easy Setup Method

The simplest way to set up offline usage is to Use the hf download command to download the small model in advance:

# Create your local models directory
mkdir -p /path/to/small_models

# Download models from Hugging Face
hf download facebook/w2v-bert-2.0 --local-dir /path/to/small_models/w2v-bert-2.0
hf download funasr/campplus --local-dir /path/to/small_models/campplus
hf download nvidia/bigvgan_v2_22khz_80band_256x --local-dir /path/to/small_models/bigvgan
hf download amphion/MaskGCT --local-dir /path/to/small_models/MaskGCT

The final directory structure should look like this:

/path/to/small_models/
├── w2v-bert-2.0/                 # Feature extraction model
├── campplus/                     # Speaker recognition model
├── bigvgan/                      # Vocoder model
└── MaskGCT/                      # Semantic codec model

Required Models

The small models are automatically mapped as follows:

w2v-bert-2.0 (models--facebook--w2v-bert-2.0) - Feature extraction model
campplus (models--funasr--campplus) - Speaker recognition model
bigvgan (models--nvidia--bigvgan_v2_22khz_80band_256x) - Vocoder model
semantic_codec (models--amphion--MaskGCT) - Semantic encoding/decoding model

Launching IndexTTS2 with Offline Models

When launching IndexTTS2 with Web UI, you can add an additional parameter: - small_models_dir - Path to directory containing all small models

When launching with command line, you can add the option:

xinference launch --model-name IndexTTS2 --model-type audio \
    --small_models_dir /path/to/small_models

When launching with Python client:

model_uid = client.launch_model(
    model_name="IndexTTS2",
    model_type="audio",
    small_models_dir="/path/to/small_models"
)