Skip to content

feat: KokoroTTS text-to-speech-engine option (decoupled from OpenAI TTS) #16036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

silentoplayz
Copy link
Collaborator

@silentoplayz silentoplayz commented Jul 26, 2025

Pull Request Checklist

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests to validate the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Adds a new, decoupled KokoroTTS text-to-speech engine option to Open-WebUI, enabling users to use a KokoroTTS endpoint for Open WebUI's text-to-speech engine independently of OpenAI TTS engine. Configuration is driven by the new AUDIO_TTS_KOKORO_API_BASE_URL environment variable (or UI setting) and integrates KokoroTTS model/voice discovery and speech generation throughout the audio pipeline.

Added

  • New KokoroTTS TTS engine (kokoro) selectable in the UI alongside openai, azure, and elevenlabs.
  • Support for fetching KokoroTTS models and voices via /v1/models and /v1/audio/voices endpoints, respectively.
  • Dedicated KokoroTTS speech generation path that POSTs to TTS_KOKORO_API_BASE_URL/v1/audio/speech.
  • Environment variable AUDIO_TTS_KOKORO_API_BASE_URL and corresponding persistent config entry.
  • Cache keys now include TTS_KOKORO_API_BASE_URL to prevent cache collisions when switching engines.
  • UI fields for TTS_KOKORO_API_BASE_URL and an optional TTS_API_KEY for KokoroTTS.
  • Option in the UI to input custom KokoroTTS voice combinations (e.g., af_bella+af_sky or af_bella(2)+af_sky(1)).
  • UI toggle to enable/disable text normalization for KokoroTTS, driven by KOKORO_NORMALIZATION_OPTIONS.normalize in the config.

Changed

  • Updated backend/open_webui/config.py to register AUDIO_TTS_KOKORO_API_BASE_URL.
  • Updated backend/open_webui/main.py to expose TTS_KOKORO_API_BASE_URL to the application state.
  • Extended backend/open_webui/routers/audio.py:
    • Added Kokoro-related fields to TTSConfigForm, get_audio_config, and update_audio_config, including KOKORO_NORMALIZATION_OPTIONS.
    • Added Kokoro branch in /speech endpoint for KokoroTTS requests.
    • Added Kokoro branches in get_available_models and get_voices for dynamic discovery.
  • Audio.svelte (frontend UI):
    • Implemented dynamic fetching and display of KokoroTTS models and voices based on the TTS_ENGINE selection.
    • Introduced a "Custom Combination..." option for KokoroTTS voices, allowing users to input complex voice strings, with client-side validation for empty custom inputs.
    • Added a toggle for "Enable Text Normalization" for KokoroTTS, reflecting the KOKORO_NORMALIZATION_OPTIONS.normalize setting.
    • Modified the voice and model selection logic to clear previously selected values when switching TTS engines, and to set default OpenAI values when selecting the OpenAI engine.
    • Updated voice and model type definitions for better clarity and consistency across different TTS engines.

Fixed

  • Made the API Key for KokoroTTS actually optional. It wasn't before with the OpenAI TTS engine.
  • Ensured that the "TTS Model" and "TTS Voice" fields are now marked as required for all TTS engines that utilize them, including KokoroTTS, to prevent users from leaving these essential settings blank.

Hopefully there aren't any breaking changes. All existing TTS engines remain unchanged and KokoroTTS is an optional text-to-speech engine.

Before vs After — KokoroTTS in Open WebUI

BEFORE AFTER
Manual Setup – Users had to know and either type out or copy-paste the exact KokoroTTS model name and voice(s) string into the “TTS Voice” and “TTS Model” fields. Automatic Discovery – Open WebUI queries your KokoroTTS server (/v1/models and /v1/audio/voices) endpoints and lists every available model and voice in drop-down menus—no typing required!
Hidden Support – KokoroTTS only worked if you tricked the “OpenAI” engine into pointing at a KokoroTTS endpoint; the UI never mentioned KokoroTTS, so many users assumed it simply wasn’t supported or couldn't figure out setup for it. First-class Option – “KokoroTTS” now appears as its own TTS engine in Settings > Audio > Text-to-Speech Engine. Selecting it instantly tells the UI to use the correct endpoints and parameters (as long as a valid URL is entered in the TTS engine base URL input field), removing all guesswork.
Confusion – Users often asked, “Does Open WebUI support Kokoro?” and struggled to configure voices manually. Clarity – The dedicated Kokoro toggle and auto-populated lists make the answer obvious: Yes, and it’s one click away.
Text-to-Speech Engine: Only "OpenAI" is explicitly available in the dropdown, requiring users to configure KokoroTTS by pointing the OpenAI engine at a KokoroTTS endpoint. Text-to-Speech Engine: "KokoroTTS" is now a dedicated option in the dropdown, making its support explicit and direct.
TTS Voice: Requires manual entry of voice combinations (e.g., af_alloy+af_heart+af_sky+af_bella) into a single text field. TTS Voice: Offers a "Custom Combination..." option with a dropdown, suggesting predefined choices are now available or expected to be. When "Custom Combination..." is selected, a text input field appears below for manual entry of combinations, along with guidance on how to format them (e.g., af_bella(2)+af_sky(1)).
TTS Model: Requires manual entry of the model name (e.g., kokoro) into a text field. TTS Model: Provides a dropdown menu to select the model, implying automatic discovery of available models from the KokoroTTS server.
Missing Feature: "Enable Text Normalization" toggle is not present. New Feature: "Enable Text Normalization" toggle is introduced, with a description: "Disable text normalization if words are missing or timestamps are incorrect in the generated audio." This suggests enhanced control over audio generation.

Additional Information

  • Weighted voice combinations using ratios (e.g., "af_bella(2)+af_heart(1)" for 67%/33% mix)
  • Ratios are automatically normalized to sum to 100%
  • Available through any endpoint by adding weights in parentheses"

"The api will automaticly do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding "normalization_options":{"normalize": false} to your request json"

  • I'm unsure if the API key (Optional) logic is working properly for KokoroTTS - UNTESTED!
  • The toggle for the added Enable Text Normalization option is colored gray and still needs styling to be green.
  • AUDIO_TTS_KOKORO_API_KEY likely should be an environment variable added to this PR. Thoughts?

Testing is definitely desired with this PR and any feedback is certainly appreciated. This PR was made entirely possible with the companionship of Gemini 2.5 Flash model! DO NOT JUST BLINDLY MERGE THIS!

BEFORE THIS PR

image

AFTER MODIFICATIONS MADE IN THIS PR

image image image image image

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

@silentoplayz
Copy link
Collaborator Author

silentoplayz commented Jul 26, 2025

Please, please, PLEASE review this PR before considering to merge it. If this is merged, I encourage refactors where necessary to bring the code up to speed!

Reason for me not being able to test this PR thoroughly: Ubuntu Kernel Panics upon testing TTS models from KokoroTTS unless I make the test less than 2 seconds; I'm serious. ☹️

@silentoplayz silentoplayz marked this pull request as ready for review July 26, 2025 03:47
@silentoplayz
Copy link
Collaborator Author

silentoplayz commented Jul 26, 2025

Browser console errors when KokoroTTS endpoint is not reachable/not running. I am not sure if this is a problem to solve or not.

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://host.docker.internal:8880/v1/audio/voices. (Reason: CORS request did not succeed). Status code: (null).

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://host.docker.internal:8880/v1/models. (Reason: CORS request did not succeed). Status code: (null).

@rgaricano
Copy link
Contributor

I think that the voices server Connection should to be managed as for MCP Tools, server side (https://github.com/open-webui/open-webui/blob/main/src/lib/components/admin/Settings/Connections.svelte) & client side (https://github.com/open-webui/open-webui/blob/main/src/lib/components/chat/Settings/Connections.svelte)
(client side have to be public accesible from browser)
( I'll try to take a closer look, if time allows)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants