音频#

学习如何使用 Xinference 将音频转换为文本或将文本转换为音频。

介绍#

Audio API提供了三种与音频交互的方法：

转录终端将音频转录为输入语言。
翻译端点将音频转换为英文。
转录终端将音频转录为输入语言。

API 端点	OpenAI 兼容端点
Transcription API	/v1/audio/transcriptions
Translation API	/v1/audio/translations
Speech API	/v1/audio/speech

支持的模型列表#

在Xinference中，以下模型支持音频API：

语音转文本#

仅适用于 Mac M 系列芯片：

文本转语音#

仅适用于 Mac M 系列芯片：

F5-TTS-MLX

快速入门#

转录#

Transcription API 模仿了 OpenAI 的 create transcriptions API。你可以通过 cURL、OpenAI Client 或者 Xinference 的 Python 客户端来尝试 Transcription API：

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.transcriptions.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.transcriptions(audio=audio_file.read())

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

翻译#

Translation API 模仿了 OpenAI 的 create translations API。你可以通过 cURL、OpenAI Client 或 Xinference 的 Python 客户端来尝试使用 Translation API：

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.translations.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.translations(audio=audio_file.read())

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

语音#

Transcription API 模仿了 OpenAI 的 create speech API。你可以通过 cURL、OpenAI Client 或者 Xinference 的 Python 客户端来尝试 Speech API：

Speech API 默认使用非流式

ChatTTS 的流式输出不如非流式的效果好，参考：2noise/ChatTTS#564
流式要求 ffmpeg<7：https://pytorch.org/audio/stable/installation.html#optional-dependencies

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    "voice": "echo",
    "stream": True,
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    voice="echo",
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
model.speech(
    input=<The text to generate audio for>,
    voice="echo",
    stream: True,
)

The output will be an audio binary.

ChatTTS 使用#

基本使用，参考语音使用章节。

固定音色。我们可以使用由 6drf21e/ChatTTS_Speaker 提供的固定音色，下载 evaluation_result.csv ，以 seed_2155 音色作为例子，我们使用 emb_data 列的数据。

import pandas as pd

df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]

使用 seed_2155 固定音色来创建语音。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
    voice=emb_data_2155,
    input=<The text to generate audio for>
)

CosyVoice 模型使用#

CosyVoice 有两个版本：CosyVoice 1.0 和 CosyVoice 2.0。CosyVoice 1.0 有 3 个不同模型：

CosyVoice-300M-SFT: 如果你只想把文本转换为语音，选择这个模型。它提供了一些预训练的音色: [‘中文女’, ‘中文男’, ‘日语男’, ‘粤语女’, ‘英文女’, ‘英文男’, ‘韩语女’]
CosyVoice-300M: 如果你想克隆声音或者把文本转换成另一种语言的语音，选择这个模型。使用这个模型，你必须提供 prompt_speech WAV格式音频文件，请使用 16,000 Hz 采样率以获得更好的性能。
CosyVoice-300M-Instruct: 如果你想精确控制音调和音色，选择这个模型。

基本使用，加载模型 CosyVoice-300M-SFT。

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    "voice": "中文女"
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女",
)
response.stream_to_file('1.mp3')

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女"
)
with open('1.mp3', 'wb') as f:
    f.write(speech_bytes)

克隆声音，加载模型 CosyVoice-300M。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

zero_shot_prompt_text = ("<the words in the text exactly match "
                         "the audio file of the zero-shot prompt>")
# The words said in the audio file should be identical
# to zero_shot_prompt_text.
#
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(zero_shot_prompt_file, "rb") as f:
    zero_shot_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_text=zero_shot_prompt_text,
    prompt_speech=zero_shot_prompt,
)

跨语言使用，加载模型 CosyVoice-300M。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(cross_lingual_prompt_file, "rb") as f:
    cross_lingual_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",  # text could be another language
    prompt_speech=cross_lingual_prompt,
)

基于指令的声音合成，加载模型 CosyVoice-300M-Instruct。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

response = model.speech(
    "在面对挑战时，他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
    voice="中文男",
    instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
    "Fights with fervor for justice, but struggles with impulsiveness.",
)

CosyVoice 2.0 只有一个模型，但它包含了 CosyVoice 三个模型的所有能力。使用方法与 CosyVoice 一样，唯一需要注意的是在 CosyVoice 2.0 模型启动时设置 use_flow_cache=True 启动参数以使用流式生成。

CosyVoice 2.0 流式使用，加载模型 CosyVoice2-0.5B。

备注

请注意，最新版本的 CosyVoice 2.0 在进行流式生成时需要设置 use_flow_cache=True。

# Launch model
from xinference.client import Client

model_uid = client.launch_model(
    model_name=model_name,
    model_type="audio",
    download_hub="modelscope",
    use_flow_cache=True,
)

endpoint = "http://127.0.0.1:9997"
input_string = "你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？"

# Stream request by openai client
import openai
import tempfile

openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
    model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    response.stream_to_file(f.name)
    assert os.stat(f.name).st_size > 0

# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    for chunk in response:
        f.write(chunk)

更多指令和例子，可以参考 https://fun-audio-llm.github.io/ 。

FishSpeech 模型使用#

基本使用，参考语音使用章节。

克隆语音，启动模型 FishSpeech-1.5。请使用 prompt_speech`而不是 `reference_audio 以及 prompt_text 而不是 reference_text 来为 FishSpeech 模型提供参考音频。这个参数和 CosyVoice 的语音克隆保持一致。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
    reference_audio = f.read()
reference_text = ""  # text in the audio

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_speech=reference_audio,
    prompt_text=reference_text
)

Paraformer 使用说明#

model	语音活动检测（vad）	标点恢复（punc）	时间戳	说话人	热词
paraformer-zh	是	是	否	否	否
paraformer-zh-hotword	是	是	否	否	是
paraformer-zh-spk	是	是	是	是	否
paraformer-zh-long	是	是	是	是	否
seaco-paraformer-zh （推荐）	是	是	是	是	是

VAD 与标点符号的使用

所有 Paraformer 模型均支持 VAD 和标点功能。
时间戳和说话人识别使用说明

仅以下模型支持时间戳和说话人识别：
- paraformer-zh-spk
- paraformer-zh-long
- seaco-paraformer-zh
其中，仅 paraformer-zh-spk 默认启用说话人识别功能。

如果你使用的是 paraformer-zh-long 或 seaco-paraformer-zh，且需要启用说话人识别功能：
- 在 Web UI 中：添加名为 spk_model、值为 cam++ 的参数
- 在命令行中：添加参数 --spk_model cam++
示例：
```
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, response_format="verbose_json")
```

热词功能使用说明

仅以下模型支持 hotword （热词功能）：

paraformer-zh-hotword
seaco-paraformer-zh

示例：

from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, hotword="小艾 魔搭")

SenseVoiceSmall 离线使用#

现在 SenseVoiceSmall 使用一个小的 VAD 模型 fsmn-vad，因此它需要网络来下载。

对于离线环境，你可以提前下载这个 VAD 模型。

从 huggingface 或者 modelscope 下载。假设下载到 /path/to/fsmn-vad。

然后当用 Web UI 加载 SenseVoiceSmall 时，添加额外选项，key 是 vad_model，值是之前的下载路径 /path/to/fsmn-vad。用命令行加载时，增加选项 --vad_model /path/to/fsmn-vad。

Kokoro 模型使用#

Kokoro模型支持多语言，默认是英文。如果你想使用非默认语言，例如中文，则需要安装额外依赖包并且在模型启动时增加对应参数。

pip install misaki[zh]
使用 lang_code=’z’ 参数初始化模型，可以参考 kokoro source code 查看所有支持的 lang_code。如果你是通过 Web UI启动的模型，则需要添加额外参数，key是 lang_code，value是 z。如果你是通过 xinference client启动的模型，则可以参考如下代码传递参数：
```
model_uid = client.launch_model(
    model_name="Kokoro-82M",
    model_type="audio",
    compile=False,
    download_hub="huggingface",
    lang_code="z",
)
```
当推理时，需要使用 ‘z’ 开头的 voice，例如：zf_xiaoyi。目前支持的 voices 可以参考 https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices。使用方法如下：
```
input_string = "重新启动即可更新"
response = model.speech(input_string, voice="zf_xiaoyi")
```

音频#

介绍#