音频#
学习如何使用 Xinference 将音频转换为文本或将文本转换为音频。
介绍#
Audio API提供了三种与音频交互的方法:
转录终端将音频转录为输入语言。
翻译端点将音频转换为英文。
转录终端将音频转录为输入语言。
API 端点 |
OpenAI 兼容端点 |
|---|---|
转录 API |
/v1/audio/transcriptions |
翻译 API |
/v1/audio/translations |
语音 API |
/v1/audio/speech |
支持的模型列表#
在Xinference中,以下模型支持音频API:
语音转文本#
仅适用于 Mac M 系列芯片:
文本转语音(TTS)#
支持zero-shot的模型 (无需参考音频)
MeloTTS series
支持语音克隆的模型 (需要参考音频)
支持情感控制的模型
仅适用于 Mac M 系列芯片:
快速入门#
转录#
Transcription API 模仿了 OpenAI 的 create transcriptions API。你可以通过 cURL、OpenAI Client 或者 Xinference 的 Python 客户端来尝试 Transcription API:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.transcriptions.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.transcriptions(audio=audio_file.read())
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
翻译#
Translation API 模仿了 OpenAI 的 create translations API。你可以通过 cURL、OpenAI Client 或 Xinference 的 Python 客户端来尝试使用 Translation API:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.translations.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.translations(audio=audio_file.read())
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
语音#
Transcription API 模仿了 OpenAI 的 create speech API。你可以通过 cURL、OpenAI Client 或者 Xinference 的 Python 客户端来尝试 Speech API:
Speech API 默认使用非流式
ChatTTS 的流式输出不如非流式的效果好,参考:2noise/ChatTTS#564
流式要求 ffmpeg<7:https://pytorch.org/audio/stable/installation.html#optional-dependencies
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
"voice": "echo",
"stream": True,
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
voice="echo",
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.speech(
input=<The text to generate audio for>,
voice="echo",
stream: True,
)
The output will be an audio binary.
ChatTTS 使用#
基本使用,参考 语音使用章节。
固定音色。我们可以使用由 6drf21e/ChatTTS_Speaker 提供的固定音色,下载 evaluation_result.csv ,以 seed_2155 音色作为例子,我们使用 emb_data 列的数据。
import pandas as pd
df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]
使用 seed_2155 固定音色来创建语音。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
voice=emb_data_2155,
input=<The text to generate audio for>
)
CosyVoice 模型使用#
CosyVoice 有两个版本:CosyVoice 1.0 和 CosyVoice 2.0。CosyVoice 1.0 有 3 个不同模型:
CosyVoice-300M-SFT: 如果你只想把文本转换为语音,选择这个模型。它提供了一些预训练的音色: ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
CosyVoice-300M: 如果你想克隆声音或者把文本转换成另一种语言的语音,选择这个模型。使用这个模型,你必须提供
prompt_speechWAV格式音频文件,请使用 16,000 Hz 采样率以获得更好的性能。CosyVoice-300M-Instruct: 如果你想精确控制音调和音色,选择这个模型。
基本使用,加载模型 CosyVoice-300M-SFT。
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
"voice": "中文女"
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女",
)
response.stream_to_file('1.mp3')
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女"
)
with open('1.mp3', 'wb') as f:
f.write(speech_bytes)
克隆声音,加载模型 CosyVoice-300M。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
zero_shot_prompt_text = ("<the words in the text exactly match "
"the audio file of the zero-shot prompt>")
# The words said in the audio file should be identical
# to zero_shot_prompt_text.
#
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(zero_shot_prompt_file, "rb") as f:
zero_shot_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>",
prompt_text=zero_shot_prompt_text,
prompt_speech=zero_shot_prompt,
)
跨语言使用,加载模型 CosyVoice-300M。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(cross_lingual_prompt_file, "rb") as f:
cross_lingual_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>", # text could be another language
prompt_speech=cross_lingual_prompt,
)
基于指令的声音合成,加载模型 CosyVoice-300M-Instruct。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
response = model.speech(
"在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
voice="中文男",
instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
"Fights with fervor for justice, but struggles with impulsiveness.",
)
CosyVoice 2.0 只有一个模型,但它包含了 CosyVoice 三个模型的所有能力。使用方法与 CosyVoice 一样。
CosyVoice 2.0 流式使用,加载模型 CosyVoice2-0.5B。
# Launch model
from xinference.client import Client
model_uid = client.launch_model(
model_name=model_name,
model_type="audio",
download_hub="modelscope",
)
endpoint = "http://127.0.0.1:9997"
input_string = "你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?"
# Stream request by openai client
import openai
import tempfile
openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
response.stream_to_file(f.name)
assert os.stat(f.name).st_size > 0
# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
for chunk in response:
f.write(chunk)
更多指令和例子,可以参考 https://fun-audio-llm.github.io/ 。
FishSpeech 模型使用#
基本使用,参考 语音使用章节。
克隆语音,启动模型 FishSpeech-1.5。请使用 prompt_speech`而不是 `reference_audio 以及 prompt_text 而不是 reference_text 来为 FishSpeech 模型提供参考音频。这个参数和 CosyVoice 的语音克隆保持一致。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
reference_audio = f.read()
reference_text = "" # text in the audio
speech_bytes = model.speech(
"<The text to generate audio for>",
prompt_speech=reference_audio,
prompt_text=reference_text
)
Paraformer 使用说明#
model |
语音活动检测(vad) |
标点恢复(punc) |
时间戳 |
说话人 |
热词 |
|---|---|---|---|---|---|
是 |
是 |
否 |
否 |
否 |
|
是 |
是 |
否 |
否 |
是 |
|
是 |
是 |
是 |
是 |
否 |
|
是 |
是 |
是 |
是 |
否 |
|
seaco-paraformer-zh (推荐) |
是 |
是 |
是 |
是 |
是 |
VAD 与标点符号的使用
所有 Paraformer 模型均支持 VAD 和标点功能。
时间戳和说话人识别使用说明
仅以下模型支持 时间戳 和 说话人 识别:
paraformer-zh-spk
paraformer-zh-long
seaco-paraformer-zh
其中,仅 paraformer-zh-spk 默认启用说话人识别功能。
如果你使用的是 paraformer-zh-long 或 seaco-paraformer-zh,且需要启用说话人识别功能:
在 Web UI 中:添加名为
spk_model、值为cam++的参数在命令行中:添加参数
--spk_model cam++
示例:
from xinference.client import Client client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>") model = client.get_model("seaco-paraformer-zh") with open("asr_example.wav", "rb") as audio_file: audio = audio_file.read() model.transcriptions(audio, response_format="verbose_json")
热词功能使用说明
仅以下模型支持 hotword (热词功能):
paraformer-zh-hotword
seaco-paraformer-zh
示例:
from xinference.client import Client client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>") model = client.get_model("seaco-paraformer-zh") with open("asr_example.wav", "rb") as audio_file: audio = audio_file.read() model.transcriptions(audio, hotword="小艾 魔搭")
SenseVoiceSmall 离线使用#
现在 SenseVoiceSmall 使用一个小的 VAD 模型 fsmn-vad,因此它需要网络来下载。
对于离线环境,你可以提前下载这个 VAD 模型。
从 huggingface 或者 modelscope 下载。假设下载到 /path/to/fsmn-vad。
然后当用 Web UI 加载 SenseVoiceSmall 时,添加额外选项,key 是 vad_model,值是之前的下载路径 /path/to/fsmn-vad。用命令行加载时,增加选项 --vad_model /path/to/fsmn-vad。
Kokoro 模型使用#
Kokoro模型支持多语言,默认是英文。如果你想使用非默认语言,例如中文,则需要安装额外依赖包并且在模型启动时增加对应参数。
pip install misaki[zh]
使用 lang_code='z' 参数初始化模型,可以参考 kokoro source code 查看所有支持的 lang_code。如果你是通过 Web UI启动的模型,则需要添加额外参数,key是
lang_code,value是z。如果你是通过 xinference client启动的模型,则可以参考如下代码传递参数:model_uid = client.launch_model( model_name="Kokoro-82M", model_type="audio", compile=False, download_hub="huggingface", lang_code="z", )
当推理时,需要使用 'z' 开头的 voice,例如:
zf_xiaoyi。目前支持的 voices 可以参考 https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices。使用方法如下:input_string = "重新启动即可更新" response = model.speech(input_string, voice="zf_xiaoyi")
IndexTTS2 使用#
IndexTTS2模型支持情感控制,你可以通过使用一些额外的参数来时用这个功能。以下为IndexTTS2的使用方式:
单一参考音频(音色克隆):
from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Translate for me, what is a surprise!", prompt_speech=test_prompt_speech, )
指定情感参考音频:
from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() with open("example/emo_sad.wav", "rb") as f: emo_prompt_speech = f.read() response = model.speech( input="It's such a shame the singer didn't make it to the finals.", prompt_speech=test_prompt_speech, emo_audio_prompt=emo_prompt_speech )
当指定情感参考音频时,可以选择设置
emo_alpha参数以调整其对输出的影响程度。有效范围为0.0 - 1.0,默认值为1.0(100%)。from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() with open("example/emo_sad.wav", "rb") as f: emo_prompt_speech = f.read() response = model.speech( input="It's such a shame the singer didn't make it to the finals.", prompt_speech=test_prompt_speech, emo_audio_prompt=emo_prompt_speech, emo_alpha=0.9 )
可以省略情绪参考音频,转而提供一个包含8个浮点数的列表,按以下顺序指定每种情绪的强度:
[快乐, 愤怒, 悲伤, 恐惧, 厌恶, 忧郁, 惊讶, 平静]。您还可以使用use_random参数在推理过程中引入随机性情绪;默认值为False,设置为True即可启用随机性情绪。from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Wow, I'm so lucky!", prompt_speech=test_prompt_speech, emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0], use_random=False )
或者,您可以启用
use_emo_text功能,根据您提供的text脚本引导情感表达。您的文本脚本将自动转换为情感向量。使用文本情感模式时,建议将emo_alpha设置为 0.6 左右(或更低),以获得更自然的语音效果。您可通过use_random引入随机性(默认值:False;True启用随机性):from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Quick, hide! He's coming! He's coming to get us!", prompt_speech=test_prompt_speech, emo_alpha=0.6, use_emo_text=True, use_random=False )
您也可以通过
emo_text参数直接提供特定的文本情绪描述。您的情绪文本将自动转换为情绪向量。这使您能够分别控制文本脚本和文本情绪描述:from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Quick, hide! He's coming! He's coming to get us!", prompt_speech=test_prompt_speech, emo_alpha=0.6, use_emo_text=True, emo_text="You scared the hell out of me! Are you a ghost?", use_random=False )
IndexTTS2 离线使用#
IndexTTS2需要多个小型模型,这些模型会在初始化过程中自动下载。在离线环境中,您可以将这些模型下载到单一目录,并指定该目录路径。
简易设置方法
设置离线使用的最简单方法是使用: hf download 命令去提前下载所有小模型
# Create your local models directory
mkdir -p /path/to/small_models
# Download models from Hugging Face
hf download facebook/w2v-bert-2.0 --local-dir /path/to/small_models/w2v-bert-2.0
hf download funasr/campplus --local-dir /path/to/small_models/campplus
hf download nvidia/bigvgan_v2_22khz_80band_256x --local-dir /path/to/small_models/bigvgan
hf download amphion/MaskGCT --local-dir /path/to/small_models/MaskGCT
最终的目录结构应如下所示:
/path/to/small_models/
├── w2v-bert-2.0/ # Feature extraction model
├── campplus/ # Speaker recognition model
├── bigvgan/ # Vocoder model
└── MaskGCT/ # Semantic codec model
支持的模型列表
小型模型将按以下方式自动映射:
w2v-bert-2.0 (
models--facebook--w2v-bert-2.0) - 特征提取模型campplus (
models--funasr--campplus) - 说话人识别模型bigvgan (
models--nvidia--bigvgan_v2_22khz_80band_256x) - 语音编码器模型语义编解码器 (
models--amphion--MaskGCT) - 语义编码/解码模型
使用离线模式启动IndexTTS2
在通过Web UI启动IndexTTS2时,可添加额外参数:- small_models_dir - 包含所有小型模型的目录路径
在通过命令行启动时,您可以添加以下选项:
xinference launch --model-name IndexTTS2 --model-type audio \
--small_models_dir /path/to/small_models
使用 Python 客户端启动时:
model_uid = client.launch_model(
model_name="IndexTTS2",
model_type="audio",
small_models_dir="/path/to/small_models"
)