OpenAI
OpenAI's voices, accessible via Azure, offer an enhanced audio experience. We are using OpenAI TTS through Azure in your geography for compliant usage. It is important to acknowledge existing constraints, notably the absence of Speech Synthesis Markup Language (SSML) support. Here are methods to effectively employ OpenAI's TTS features within these boundaries:
To activate OpenAI as your Text-to-Speech (TTS) provider, please get in touch with your Customer Success Manager.
Integrating OpenAI TTS
To integrate OpenAI's TTS capabilities:
From the available options, choose an OpenAI voice that suits your needs.
Date and Time Formatting
To achieve the most natural and accurate voice output from OpenAI TTS, it's crucial to format dates and times in a way that the system can easily recognize and correctly articulate. Below are recommended formats and examples to avoid:
Dates
English - Effective
I can confirm the booking for May 25th 2024.
I can confirm the booking for May 25th 2024.
Written form of months and numbers with respective suffix.
English - Ineffective
I can confirm the booking for the 02.02.2024.
I can confirm the booking for the zo 2, oh 2, 2024.
Date not recognized.
German - Effective
Ich bestätige die Buchung für den 15ten Februar 2024.
Ich bestätige die Buchung für den 15ten Februar 2024.
Correct recognition and pronunciation of ordinal numbers and months in date format.
German -Ineffective
Ich bestätige die Buchung für den 15.2.2024.
Ich bestätige die Buchung für den 15, zwei, 24.
Date is not recognized and not pronounced correctly.
Times
Properly formatting times is equally essential to ensure that OpenAI TTS can interpret and vocalize them accurately. Here are the effective formats alongside examples of what to avoid:
English - Effective
The bus arrives at 11:15 AM.
The bus arrives at 11:15 AM.
Time is correctly recognized and vocalized in the 12-hour format, which is standard in English.
English - Ineffective
The bus arrives at 17:00.
The bus arrives at 17.
Misinterpretation of format.
German - Effective
Der Flug geht um 17 Uhr.
Der Flug geht um 17 Uhr.
Time is correctly recognized and vocalized in the 24-hour format which is standard in German.
German - Ineffective
Der Flug geht um 17:00 Uhr.
Der Flug geht um 17
"Uhr" will be ignored.
Prices and Currencies
Ensuring that prices and currencies are expressed in a format that OpenAI TTS can accurately interpret is crucial for clear communication. The following table outlines the recommended practices for formatting prices and currencies, as well as common pitfalls to avoid:
Price Formatting
English - Effective
It costs thirteen Euro and forty-five cents.
It costs 13 Euro and 45 cents.
Price is correctly recognized and vocalized.
English - Ineffective
It costs 13.45€.
It costs 13.45.
German - Effective
Es kostet dreizehn Euro und fünfundvierzig Cent.
Es kostet 13 Euro und 45 Cent.
Price is correctly recognized and vocalized in German.
German - Ineffective
Es kostet 13,45€.
Es kostet deiteenand for firth eurs.
OpenAI currently cannot deal with € sign following price.
Numbers and Alphanumerics
For numbers and alphanumeric sequences, transforming them into a format that OpenAI TTS processes without errors ensures accurate and complete voice output. Below are effective inputs alongside formats that may result in less accurate articulation:
Effective
The confirmation number is one two three four five six seven eight.
The confirmation number is 12345678.
Sequence of numbers is articulated clearly and accurately as individual digits.
Ineffective
The confirmation number is 12345678.
The confirmation number is 12567.
Single numbers are swallowed when using pure number format.
Enhancing Intonation
Emotional Tone
OpenAI's voices may sound monotonous in certain contexts. Incorporating emotive language and enthusiastic expressions, such as 'Great!', 'Fantastic!', 'Klasse!', and 'Super!', can significantly improve the listener's engagement and the overall appeal of the speech output, infusing the bot’s vibe with a more enthusiastic energy.
Pronunciation and Pauses
Although SSML tags for structured pronunciations aren't supported, experimenting with punctuation or separators such as "--" may offer a workaround for inserting pauses. The effectiveness of these techniques varies, emphasizing the importance of testing in your specific use case.
Last updated
Was this helpful?