Audio (Experimental)#
Learn how to turn audio into text or text into audio with Xinference.
Introduction#
The Audio API provides three methods for interacting with audio:
The transcriptions endpoint transcribes audio into the input language.
The translations endpoint translates audio into English.
The speech endpoint generates audio from the input text.
API ENDPOINT |
OpenAI-compatible ENDPOINT |
---|---|
Transcription API |
/v1/audio/transcriptions |
Translation API |
/v1/audio/translations |
Speech API |
/v1/audio/speech |
Supported models#
The audio API is supported with the following models in Xinference:
Audio to text#
whisper-tiny
whisper-tiny.en
whisper-base
whisper-base.en
whisper-medium
whisper-medium.en
whisper-large-v3
whisper-large-v3-turbo
Belle-distilwhisper-large-v2-zh
Belle-whisper-large-v2-zh
Belle-whisper-large-v3-zh
SenseVoiceSmall
Text to audio#
ChatTTS
CosyVoice
Quickstart#
Transcription#
The Transcription API mimics OpenAI’s create transcriptions API. We can try Transcription API out either via cURL, OpenAI Client, or Xinference’s python client:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.transcriptions.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.transcriptions(audio=audio_file.read())
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
Translation#
The Translation API mimics OpenAI’s create translations API. We can try Translation API out either via cURL, OpenAI Client, or Xinference’s python client:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.translations.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.translations(audio=audio_file.read())
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
Speech#
The Speech API mimics OpenAI’s create speech API. We can try Speech API out either via cURL, OpenAI Client, or Xinference’s python client:
Speech API use non-stream by default as
The stream output of ChatTTS is not as good as the non-stream output, please refer to: 2noise/ChatTTS#564
The stream requires ffmpeg<7: https://pytorch.org/audio/stable/installation.html#optional-dependencies
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
"voice": "echo",
"stream": True,
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
voice="echo",
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.speech(
input=<The text to generate audio for>,
voice="echo",
stream: True,
)
The output will be an audio binary.
ChatTTS Usage#
Basic usage, refer to audio speech usage.
Fixed tone color. We can use fixed tone color provided by
6drf21e/ChatTTS_Speaker,
Download the evaluation_result.csv ,
take seed_2155
as example, we get the emb_data
of it.
import pandas as pd
df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]
Use the fixed tone color of seed_2155
to generate speech.
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
voice=emb_data_2155,
input=<The text to generate audio for>
)
CosyVoice Usage#
Basic usage, launch model CosyVoice-300M-SFT
.
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
"voice": "中文女"
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女",
)
response.stream_to_file('1.mp3')
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女"
)
with open('1.mp3', 'wb') as f:
f.write(speech_bytes)
Clone voice, launch model CosyVoice-300M
.
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
zero_shot_prompt_text = ""
# The zero shot prompt file is the voice file
# the words said in the file should be identical to zero_shot_prompt_text
with open(zero_shot_prompt_file, "rb") as f:
zero_shot_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>",
prompt_text=zero_shot_prompt_text,
prompt_speech=zero_shot_prompt,
)
Cross lingual usage, launch model CosyVoice-300M
.
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
# the file that reads in some language
with open(cross_lingual_prompt_file, "rb") as f:
cross_lingual_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>", # text could be another language
prompt_speech=cross_lingual_prompt,
)
Instruction based, launch model CosyVoice-300M-Instruct
.
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
response = model.speech(
"在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
voice="中文男",
instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
"Fights with fervor for justice, but struggles with impulsiveness.",
)
More instructions and examples, could be found at https://fun-audio-llm.github.io/ .
FishSpeech Usage#
Basic usage, refer to audio speech usage.
Clone voice, launch model FishSpeech-1.4
. Please use prompt_speech instead of reference_audio
to provide the reference audio to the FishSpeech model.
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
reference_text = ""
# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
reference_audio = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>",
reference_text=reference_text,
prompt_speech=reference_audio,
enable_reference_audio=True,
)