TTS on Producthunt daily

Product Hunt Daily | 2025-10-21

Tue, 21 Oct 2025 07:30:35 +0000

1. Fish Audio S1

Tagline: Expressive Voice Cloning and Text-to-Speech
Description: Fish Audio S1 is the most expressive and emotionally rich TTS model—creating lifelike voices that capture emotion, rhythm, and nuance. Clone any voice in 10 seconds, preserving accent, tone, and speaking habits with unmatched realism.
Website: open
Product Hunt: View on Product Hunt

Keyword: Voice cloning, text-to-speech, TTS, expressive, lifelike voices, emotion, rhythm, nuance, voice cloning, accent, tone, realism, Fish Audio S1
VotesCount: 🔺413
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

2. Replymer

Tagline: Human replies that sell your product
Description: Replymer helps your brand grow through authentic, human‑written replies that recommend your product in the right conversations.
Website: open
Product Hunt: View on Product Hunt

Keyword: Human replies, product recommendations, brand growth, authentic replies, social selling, conversation marketing
VotesCount: 🔺379
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

3. Logic, Inc.

Tagline: Automate recurring decisions in plain English
Description: Logic automates recurring decisions and reviews. Write your process once in plain English, and automate it anywhere. From content moderation to invoice processing, Logic lets you deploy in minutes, not months.
Website: open
Product Hunt: View on Product Hunt

Keyword: Automation, Decisions, Plain English, Process Automation, No-Code, Content Moderation, Invoice Processing, Deploy Quickly
VotesCount: 🔺291
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

4. Voice Gecko

Tagline: Voice dictation at your fingertips—type less, say more.
Description: Instant dictation for desktop. Press a shortcut, speak, and instantly get accurate text on your clipboard—perfect for emails, coding, AI prompts, or brain dumps.
Website: open
Product Hunt: View on Product Hunt

Keyword: voice dictation, dictation software, voice to text, speech to text, clipboard, desktop, productivity, typing, shortcut, AI prompts, brain dump, voice input
VotesCount: 🔺237
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

5. Simplora

Tagline: Meetings that make you smarter, not confused
Description: Never feel lost in a meeting again! Simplora turns every conversation into a unique learning experience, in real-time and beyond. Available wherever you meet. No download required. Get started for free.
Website: open
Product Hunt: View on Product Hunt

Keyword: Meetings, learning, real-time, no download, free, smarter, confusion, conversation, Simplora
VotesCount: 🔺188
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

6. diny

Tagline: From git diff to clean commits
Description: diny automates commit messages from your staged changes. Clean, consistent, conventional. Includes a timeline view of past commits to keep your history crystal clear.
Website: open
Product Hunt: View on Product Hunt

Keyword: git commits, commit messages, automation, git diff, clean commits, conventional commits, commit history, timeline view, developer tools
VotesCount: 🔺156
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

7. Pylon

Tagline: The support platform built for B2B
Description: AI-Native support platform built for B2B companies. One tool for your ticketing, chat, knowledge base, AI support, account intelligence, and more.
Website: open
Product Hunt: View on Product Hunt

Keyword: B2B support, AI support, ticketing, chat, knowledge base, account intelligence, support platform, AI-native
VotesCount: 🔺138
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

8. App2.dev

Tagline: Turn ideas & Figma designs into complete web & mobile apps
Description: Turn your ideas & Figma designs into web & mobile apps in minutes with backend, database, and authentication - all powered by AI.
Website: open
Product Hunt: View on Product Hunt

Keyword: App development, Figma to app, web app, mobile app, AI, no-code, backend, database, authentication, rapid development
VotesCount: 🔺114
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

9. Aden AI

Tagline: Turn any file into a chatbot course & get certified with AI
Description: We built the Aden Training Agent - it transforms any file or manual into an interactive AI course for workforce training or certification. Try our Mindfulness Agent that teaches focus under pressure, or upload your own file to create a smart, adaptive course.
Website: open
Product Hunt: View on Product Hunt

Keyword: AI chatbot course, file to course, workforce training, AI certification, adaptive learning, Mindfulness Agent, training agent, smart course
VotesCount: 🔺104
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

10. VibeOnly

Tagline: Helping companies screen and hire AI-fluent employees
Description: Everyone says “AI won’t take your job. People who use it will”. Vibeonly helps you hire those people. It’s a test that shows who really knows how to use AI tools really well. Perfect for founders and hiring managers who want elite AI fluent talent.
Website: open
Product Hunt: View on Product Hunt

Keyword: AI hiring, AI fluency, employee screening, AI talent, hiring, AI tools, VibeOnly, talent acquisition
VotesCount: 🔺100
Featured: Yes
CreatedAt: 2025-10-20 07:01 AM (UTC)

abogen

Tue, 02 Sep 2025 15:30:10 +0800

denizsafak/abogen

abogen

Abogen is a powerful text-to-speech conversion tool that makes it easy to turn ePub, PDF, or text files into high-quality audio with matching subtitles in seconds. Use it for audiobooks, voiceovers for Instagram, YouTube, TikTok, or any project that needs natural-sounding text-to-speech, using Kokoro-82M.

Demo

https://github.com/user-attachments/assets/094ba3df-7d66-494a-bc31-0e4b41d0b865

This demo was generated in just 5 seconds, producing ∼1 minute of audio with perfectly synced subtitles. To create a similar video, see the demo guide.

`How to install?`

Windows

Go to espeak-ng latest release download and run the *.msi file.

OPTION 1: Install using script

Download the repository
Extract the ZIP file
Run WINDOWS_INSTALL.bat by double-clicking it

This method handles everything automatically - installing all dependencies including CUDA in a self-contained environment without requiring a separate Python installation. (You still need to install espeak-ng.)

[!NOTE] You don’t need to install Python separately. The script will install Python automatically.

OPTION 2: Install using pip

# Create a virtual environment (optional)
mkdir abogen && cd abogen
python -m venv venv
venv\Scripts\activate

# For NVIDIA GPUs:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# For AMD GPUs:
# Not supported yet, because ROCm is not available on Windows. Use Linux if you have AMD GPU.

# Install abogen
pip install abogen

Mac

# Install espeak-ng
brew install espeak-ng

# Create a virtual environment (recommended)
mkdir abogen && cd abogen
python3 -m venv venv
source venv/bin/activate

# Install abogen
pip3 install abogen

# For Silicon Mac (M1, M2 etc.)
# After installing abogen, we need to install Kokoro's development version which includes MPS support.
pip3 install git+https://github.com/hexgrad/kokoro.git

Linux

# Install espeak-ng
sudo apt install espeak-ng # Ubuntu/Debian
sudo pacman -S espeak-ng # Arch Linux
sudo dnf install espeak-ng # Fedora

# Create a virtual environment (recommended)
mkdir abogen && cd abogen
python3 -m venv venv
source venv/bin/activate

# Install abogen
pip3 install abogen

# For NVIDIA GPUs:
# Already supported, no need to install CUDA separately.

# For AMD GPUs:
# After installing abogen, we need to uninstall the existing torch package
pip3 uninstall torch 
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4

[!TIP] If you get WARNING: The script abogen-cli is installed in '/home/username/.local/bin' which is not on PATH. error, run the following command to add it to your PATH:
1
echo "export PATH=\"/home/$USER/.local/bin:\$PATH\"" >> ~/.bashrc && source ~/.bashrc

[!TIP] If you get “No matching distribution found” error, try installing it on supported Python (3.10 to 3.12). You can use pyenv to manage multiple Python versions easily in Linux. Watch this video by NetworkChuck for a quick guide.

Special thanks to @hg000125 for his contribution in #23. AMD GPU support is possible thanks to his work.

`How to run?`

If you installed using pip, you can simply run the following command to start Abogen:

abogen

[!TIP] If you installed using the Windows installer (WINDOWS_INSTALL.bat), It should have created a shortcut in the same folder, or your desktop. You can run it from there. If you lost the shortcut, Abogen is located in python_embedded/Scripts/abogen.exe. You can run it from there directly.

`How to use?`

Drag and drop any ePub, PDF, or text file (or use the built-in text editor)
Configure the settings:
- Set speech speed
- Select a voice (or create a custom voice using voice mixer)
- Select subtitle generation style (by sentence, word, etc.)
- Select output format
- Select where to save the output
Hit Start

`In action`

Here’s Abogen in action: in this demo, it processes ∼3,000 characters of text in just 11 seconds and turns it into 3 minutes and 28 seconds of audio, and I have a low-end RTX 2060 Mobile laptop GPU. Your results may vary depending on your hardware.

`Configuration`

Options	Description
Input Box	Drag and drop `ePub`, `PDF`, or `.TXT` files (or use built-in text editor)
Queue options	Add multiple files to a queue and process them in batch, with individual settings for each file. See Queue mode for more details.
Speed	Adjust speech rate from `0.1x` to `2.0x`
Select Voice	First letter of the language code (e.g., `a` for American English, `b` for British English, etc.), second letter is for `m` for male and `f` for female.
Voice mixer	Create custom voices by mixing different voice models with a profile system. See Voice Mixer for more details.
Voice preview	Listen to the selected voice before processing.
Generate subtitles	`Disabled`, `Sentence`, `Sentence + Comma`, `Sentence + Highlighting`, `1 word`, `2 words`, `3 words`, etc. (Represents the number of words in each subtitle entry)
Output voice format	`.WAV`, `.FLAC`, `.MP3`, `.OPUS (best compression)` and `M4B (with chapters)` (Special thanks to @jborza for chapter support in PR #10)
Output subtitle format	Configures the subtitle format as `SRT (standard)`, `ASS (wide)`, `ASS (narrow)`, `ASS (centered wide)`, or `ASS (centered narrow)`.
Replace single newlines with spaces	Replaces single newlines with spaces in the text. This is useful for texts that have imaginary line breaks.
Save location	`Save next to input file`, `Save to desktop`, or `Choose output folder`

Book handler options	Description
Chapter Control	Select specific `chapters` from ePUBs or `chapters + pages` from PDFs.
Save each chapter separately	Save each chapter in e-books as a separate audio file.
Create a merged version	Create a single audio file that combines all chapters. (If `Save each chapter separately` is disabled, this option will be the default behavior.)
Save in a project folder with metadata	Save the converted items in a project folder with available metadata files.

Menu options	Description
Theme	Change the application’s theme using `System`, `Light`, or `Dark` options.
Configure max words per subtitle	Configures the maximum number of words per subtitle entry.
Configure max lines in log window	Configures the maximum number of lines to display in the log window.
Separate chapters audio format	Configures the audio format for separate chapters as `wav`, `flac`, `mp3`, or `opus`.
Create desktop shortcut	Creates a shortcut on your desktop for easy access.
Open config directory	Opens the directory where the configuration file is stored.
Open cache directory	Opens the cache directory where converted text files are stored.
Clear cache files	Deletes cache files created during the conversion or preview.
Check for updates at startup	Automatically checks for updates when the program starts.
Disable Kokoro’s internet access	Prevents Kokoro from downloading models or voices from HuggingFace Hub, useful for offline use.
Reset to default settings	Resets all settings to their default values.

Special thanks to @robmckinnon for adding Sentence + Highlighting feature in PR #65

`Voice Mixer`

With voice mixer, you can create custom voices by mixing different voice models. You can adjust the weight of each voice and save your custom voice as a profile for future use. The voice mixer allows you to create unique and personalized voices. (Huge thanks to @jborza for making this possible through his contributions in #5)

`Queue Mode`

Abogen supports queue mode, allowing you to add multiple files to a processing queue. This is useful if you want to convert several files in one batch.

You can add text files (.txt) directly using the Add files button in the Queue Manager. To add PDF or EPUB files, use the input box in the main window and click the Add to Queue button.
Each file in the queue keeps the configuration settings that were active when it was added. Changing the main window configuration afterward does not affect files already in the queue.
You can view each file’s configuration by hovering over them.

Abogen will process each item in the queue automatically, saving outputs as configured.

Special thanks to @jborza for adding queue mode in PR #35

`About Chapter Markers`

When you process ePUB or PDF files, Abogen converts them into text files stored in your cache directory. When you click “Edit,” you’re actually modifying these converted text files. In these text files, you’ll notice tags that look like this:

`1`	`<<CHAPTER_MARKER:Chapter Title>>`

These are chapter markers. They are automatically added when you process ePUB or PDF files, based on the chapters you select. They serve an important purpose:

Allow you to split the text into separate audio files for each chapter
Save time by letting you reprocess only specific chapters if errors occur, rather than the entire file

You can manually add these markers to plain text files for the same benefits. Simply include them in your text like this:

<<CHAPTER_MARKER:Introduction>>
This is the beginning of my text...  

<<CHAPTER_MARKER:Main Content>> 
Here's another part...

When you process the text file, Abogen will detect these markers automatically and ask if you want to save each chapter separately and create a merged version.

`About Metadata Tags`

Similar to chapter markers, it is possible to add metadata tags for M4B files. This is useful for audiobook players that support metadata, allowing you to add information like title, author, year, etc. Abogen automatically adds these tags when you process ePUB or PDF files, but you can also add them manually to your text files. Add metadata tags at the beginning of your text file like this:

<<METADATA_TITLE:Title>>
<<METADATA_ARTIST:Author>>
<<METADATA_ALBUM:Album Title>>
<<METADATA_YEAR:Year>>
<<METADATA_ALBUM_ARTIST:Album Artist>>
<<METADATA_COMPOSER:Narrator>>
<<METADATA_GENRE:Audiobook>>

`Supported Languages`

# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇧🇷 'p' => Brazilian Portuguese pt-br
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]

For a complete list of supported languages and voices, refer to Kokoro’s VOICES.md. To listen to sample audio outputs, see SAMPLES.md.

[!NOTE] Japanese audio may require additional configuration. Please check #56 for more information.

`MPV Config`

I highly recommend using MPV to play your audio files, as it supports displaying subtitles even without a video track. Here’s my mpv.conf:

# --- MPV Settings ---
save-position-on-quit
keep-open=yes
# --- Subtitle ---
sub-ass-override=no
sub-margin-y=50
sub-margin-x=50
# --- Audio Quality ---
audio-spdif=ac3,dts,eac3,truehd,dts-hd
audio-channels=auto
audio-samplerate=48000
volume-max=200

`Docker Guide`

If you want to run Abogen in a Docker container:

Download the repository and extract, or clone it using git.
Go to abogen folder. You should see Dockerfile there.
Open your termminal in that directory and run the following commands:

# Build the Docker image:
docker build --progress plain -t abogen .

# Note that building the image may take a while.
# After building is complete, run the Docker container:

# Windows
docker run --name abogen -v %cd%:/shared -p 5800:5800 -p 5900:5900 --gpus all abogen

# Linux
docker run --name abogen -v $(pwd):/shared -p 5800:5800 -p 5900:5900 --gpus all abogen

# MacOS
docker run --name abogen -v $(pwd):/shared -p 5800:5800 -p 5900:5900 abogen

# We expose port 5800 for use by a web browser, 5900 if you want to connect with a VNC client.

Abogen launches automatically inside the container.

You can access it via a web browser at http://localhost:5800 or connect to it using a VNC client at localhost:5900.
You can use /shared directory to share files between your host and the container.
For later use, start it with docker start abogen and stop it with docker stop abogen.

Known issues:

Audio preview is not working inside container (ALSA error).
Open cache directory and Open configuration directory options in settings not working. (Tried pcmanfm, did not work with Abogen).

(Special thanks to @geo38 from Reddit, who provided the Dockerfile and instructions in this comment.)

`Similar Projects`

Abogen is a standalone project, but it is inspired by and shares some similarities with other projects. Here are a few:

audiblez: Generate audiobooks from e-books. (Has CLI and GUI support)
autiobooks: Automatically convert epubs to audiobooks
pdf-narrator: Convert your PDFs and EPUBs into audiobooks effortlessly.
epub_to_audiobook: EPUB to audiobook converter, optimized for Audiobookshelf
ebook2audiobook: Convert ebooks to audiobooks with chapters and metadata using dynamic AI models and voice cloning

`Roadmap`

Add OCR scan feature for PDF files using docling/teserract.
Add chapter metadata for .m4a files. (Issue #9, PR #10)
Add support for different languages in GUI.
Add voice formula feature that enables mixing different voice models. (Issue #1, PR #5)
Add support for kokoro-onnx (If it’s necessary).
Add dark mode.

`Troubleshooting`

If you encounter any issues while running Abogen, try launching it from the command line with:

`1`	`abogen-cli`

This will start Abogen in command-line mode and display detailed error messages. Please open a new issue on the Issues page with the error message and a description of your problem.

`Contributing`

I welcome contributions! If you have ideas for new features, improvements, or bug fixes, please fork the repository and submit a pull request.

For developers and contributors

If you’d like to modify the code and contribute to development, you can download the repository, extract it and run the following commands to build or install the package:

# Go to the directory where you extracted the repository and run:
pip install -e .      # Installs the package in editable mode
pip install build     # Install the build package
python -m build       # Builds the package in dist folder (optional)
abogen                # Opens the GUI

Feel free to explore the code and make any changes you like.

`Credits`

Abogen uses Kokoro for its high-quality, natural-sounding text-to-speech synthesis. Huge thanks to the Kokoro team for making this possible.
Thanks to @wojiushixiaobai for Embedded Python packages. These modified packages include pip pre-installed, enabling Abogen to function as a standalone application without requiring users to separately install Python in Windows.
Thanks to creators of EbookLib, a Python library for reading and writing ePub files, which is used for extracting text from ePub files.
Special thanks to the PyQt team for providing the cross-platform GUI toolkit that powers Abogen’s interface.
Icons: US, Great Britain, Spain, France, India, Italy, Japan, Brazil, China, Female, Male, Adjust and Voice Id icons by Icons8.

`License`

This project is available under the MIT License - see the LICENSE file for details. Kokoro is licensed under Apache-2.0 which allows commercial use, modification, distribution, and private use.

[!IMPORTANT] Subtitle generation currently works only for English. This is because Kokoro provides timestamp tokens only for English text. If you want subtitles in other languages, please request this feature in the Kokoro project. For more technical details, see this line in the Kokoro’s code.

Tags: audiobook, kokoro, text-to-speech, TTS, audiobook generator, audiobooks, text to speech, audiobook maker, audiobook creator, audiobook generator, voice-synthesis, text to audio, text to audio converter, text to speech converter, text to speech generator, text to speech software, text to speech app, epub to audio, pdf to audio, content-creation, media-generation

NeMo

Sat, 10 May 2025 15:25:32 +0800

NVIDIA/NeMo

NVIDIA NeMo Framework

Latest News

Pretrain and finetune :hugs:Hugging Face models via AutoModel

Nemo Framework's latest feature AutoModel enables broad support for :hugs:Hugging Face models, with 25.02 focusing on AutoModelForCausalLM in the text generation category . Future releases will enable support for more model families such as Vision Language Model.

Training on Blackwell using Nemo

NeMo Framework has added Blackwell support, with 25.02 focusing on functional parity for B200. More optimizations to come in the upcoming releases.

NeMo Framework 2.0

We've released NeMo 2.0, an update on the NeMo Framework which prioritizes modularity and ease-of-use. Please refer to the NeMo Framework User Guide to get started.

New Cosmos World Foundation Models Support

Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform (2025-01-09)

The end-to-end NVIDIA Cosmos platform accelerates world model development for physical AI systems. Built on CUDA, Cosmos combines state-of-the-art world foundation models, video tokenizers, and AI-accelerated data processing pipelines. Developers can accelerate world model development by fine-tuning Cosmos world foundation models or building new ones from the ground up. These models create realistic synthetic videos of environments and interactions, providing a scalable foundation for training complex systems, from simulating humanoid robots performing advanced actions to developing end-to-end autonomous driving models.

Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities (2025-01-07)

The NeMo Framework now supports training and customizing the NVIDIA Cosmos collection of world foundation models. Cosmos leverages advanced text-to-world generation techniques to create fluid, coherent video content from natural language prompts.

You can also now accelerate your video processing step using the NeMo Curator library, which provides optimized video processing and captioning features that can deliver up to 89x faster video processing when compared to an unoptimized CPU pipeline.

Large Language Models and Multimodal Models

State-of-the-Art Multimodal Generative AI Model Development with NVIDIA NeMo (2024-11-06)

NVIDIA recently announced significant enhancements to the NeMo platform, focusing on multimodal generative AI models. The update includes NeMo Curator and the Cosmos tokenizer, which streamline the data curation process and enhance the quality of visual data. These tools are designed to handle large-scale data efficiently, making it easier to develop high-quality AI models for various applications, including robotics and autonomous driving. The Cosmos tokenizers, in particular, efficiently map visual data into compact, semantic tokens, which is crucial for training large-scale generative models. The tokenizer is available now on the NVIDIA/cosmos-tokenizer GitHub repo and on Hugging Face.

New Llama 3.1 Support (2024-07-23)

The NeMo Framework now supports training and customizing the Llama 3.1 collection of LLMs from Meta.

Accelerate your Generative AI Distributed Training Workloads with the NVIDIA NeMo Framework on Amazon EKS (2024-07-16)

NVIDIA NeMo Framework now runs distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. For step-by-step instructions on creating an EKS cluster and running distributed training workloads with NeMo, see the GitHub repository here.

NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support (2024/06/17)

NVIDIA NeMo and Megatron Core now support pre-training and fine-tuning of state space models (SSMs). NeMo also supports training models based on the Griffin architecture as described by Google DeepMind.

NVIDIA releases 340B base, instruct, and reward models pretrained on a total of 9T tokens. (2024-06-18)

See documentation and tutorials for SFT, PEFT, and PTQ with Nemotron 340B in the NeMo Framework User Guide.

NVIDIA sets new generative AI performance and scale records in MLPerf Training v4.0 (2024/06/12)

Using NVIDIA NeMo Framework and NVIDIA Hopper GPUs NVIDIA was able to scale to 11,616 H100 GPUs and achieve near-linear performance scaling on LLM pretraining. NVIDIA also achieved the highest LLM fine-tuning performance and raised the bar for text-to-image training.

Accelerate your generative AI journey with NVIDIA NeMo Framework on GKE (2024/03/16)

An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.

Speech Recognition

Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo (2024/09/24)

NVIDIA NeMo team released a number of inference optimizations for CTC, RNN-T, and TDT models that resulted in up to 10x inference speed-up. These models now exceed an inverse real-time factor (RTFx) of 2,000, with some reaching RTFx of even 6,000.

New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model (2024/04/18)

The NeMo team just released Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization. Canary also provides bi-directional translation, between English and the three other supported languages.

Pushing the Boundaries of Speech Recognition with NVIDIA NeMo Parakeet ASR Models (2024/04/18)

NVIDIA NeMo, an end-to-end platform for the development of multimodal generative AI models at scale anywhere—on any cloud and on-premises—released the Parakeet family of automatic speech recognition (ASR) models. These state-of-the-art ASR models, developed in collaboration with Suno.ai, transcribe spoken English with exceptional accuracy.

Turbocharge ASR Accuracy and Speed with NVIDIA NeMo Parakeet-TDT (2024/04/18)

NVIDIA NeMo, an end-to-end platform for developing multimodal generative AI models at scale anywhere—on any cloud and on-premises—recently released Parakeet-TDT. This new addition to the  NeMo ASR Parakeet model family boasts better accuracy and 64% greater speed over the previously best model, Parakeet-RNNT-1.1B.

Introduction

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and PyTorch developers working on Large Language Models (LLMs), Multimodal Models (MMs), Automatic Speech Recognition (ASR), Text to Speech (TTS), and Computer Vision (CV) domains. It is designed to help you efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

For technical documentation, please see the NeMo Framework User Guide.

What’s New in NeMo 2.0

NVIDIA NeMo 2.0 introduces several significant improvements over its predecessor, NeMo 1.0, enhancing flexibility, performance, and scalability.

Python-Based Configuration - NeMo 2.0 transitions from YAML files to a Python-based configuration, providing more flexibility and control. This shift makes it easier to extend and customize configurations programmatically.
Modular Abstractions - By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 simplifies adaptation and experimentation. This modular approach allows developers to more easily modify and experiment with different components of their models.
Scalability - NeMo 2.0 seamlessly scaling large-scale experiments across thousands of GPUs using NeMo-Run, a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across computing environments.

Overall, these enhancements make NeMo 2.0 a powerful, scalable, and user-friendly framework for AI model development.

[!IMPORTANT]
NeMo 2.0 is currently supported by the LLM (large language model) and VLM (vision language model) collections.

Get Started with NeMo 2.0

Refer to the Quickstart for examples of using NeMo-Run to launch NeMo 2.0 experiments locally and on a slurm cluster.
For more information about NeMo 2.0, see the NeMo Framework User Guide.
NeMo 2.0 Recipes contains additional examples of launching large-scale runs using NeMo 2.0 and NeMo-Run.
For an in-depth exploration of the main features of NeMo 2.0, see the Feature Guide.
To transition from NeMo 1.0 to 2.0, see the Migration Guide for step-by-step instructions.

Get Started with Cosmos

NeMo Curator and NeMo Framework support video curation and post-training of the Cosmos World Foundation Models, which are open and available on NGC and Hugging Face. For more information on video datasets, refer to NeMo Curator. To post-train World Foundation Models using the NeMo Framework for your custom physical AI tasks, see the Cosmos Diffusion models and the Cosmos Autoregressive models.

LLMs and MMs Training, Alignment, and Customization

All NeMo models are trained with Lightning. Training is automatically scalable to 1000s of GPUs. You can check the performance benchmarks using the latest NeMo Framework container here.

When applicable, NeMo models leverage cutting-edge distributed training techniques, incorporating parallelism strategies to enable efficient training of very large models. These techniques include Tensor Parallelism (TP), Pipeline Parallelism (PP), Fully Sharded Data Parallelism (FSDP), Mixture-of-Experts (MoE), and Mixed Precision Training with BFloat16 and FP8, as well as others.

NeMo Transformer-based LLMs and MMs utilize NVIDIA Transformer Engine for FP8 training on NVIDIA Hopper GPUs, while leveraging NVIDIA Megatron Core for scaling Transformer model training.

NeMo LLMs can be aligned with state-of-the-art methods such as SteerLM, Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). See NVIDIA NeMo Aligner for more information.

In addition to supervised fine-tuning (SFT), NeMo also supports the latest parameter efficient fine-tuning (PEFT) techniques such as LoRA, P-Tuning, Adapters, and IA3. Refer to the NeMo Framework User Guide for the full list of supported models and techniques.

LLMs and MMs Deployment and Optimization

NeMo LLMs and MMs can be deployed and optimized with NVIDIA NeMo Microservices.

Speech AI

NeMo ASR and TTS models can be optimized for inference and deployed for production use cases with NVIDIA Riva.

NeMo Framework Launcher

[!IMPORTANT]
NeMo Framework Launcher is compatible with NeMo version 1.0 only. NeMo-Run is recommended for launching experiments using NeMo 2.0.

NeMo Framework Launcher is a cloud-native tool that streamlines the NeMo Framework experience. It is used for launching end-to-end NeMo Framework training jobs on CSPs and Slurm clusters.

The NeMo Framework Launcher includes extensive recipes, scripts, utilities, and documentation for training NeMo LLMs. It also includes the NeMo Framework Autoconfigurator, which is designed to find the optimal model parallel configuration for training on a specific cluster.

To get started quickly with the NeMo Framework Launcher, please see the NeMo Framework Playbooks. The NeMo Framework Launcher does not currently support ASR and TTS training, but it will soon.

Get Started with NeMo Framework

Getting started with NeMo Framework is easy. State-of-the-art pretrained NeMo models are freely available on Hugging Face Hub and NVIDIA NGC. These models can be used to generate text or images, transcribe audio, and synthesize speech in just a few lines of code.

We have extensive tutorials that can be run on Google Colab or with our NGC NeMo Framework Container. We also have playbooks for users who want to train NeMo models with the NeMo Framework Launcher.

For advanced users who want to train NeMo models from scratch or fine-tune existing NeMo models, we have a full suite of example scripts that support multi-GPU/multi-node training.

Key Features

Requirements

Python 3.10 or above
Pytorch 2.5 or above
NVIDIA GPU (if you intend to do model training)

Developer Documentation

Version	Status	Description
Latest		Documentation of the latest (i.e. main) branch.
Stable		Documentation of the stable (i.e. most recent release)

Install NeMo Framework

The NeMo Framework can be installed in a variety of ways, depending on your needs. Depending on the domain, you may find one of the following installation methods more suitable.

Conda / Pip: Install NeMo-Framework with native Pip into a virtual environment.
- Used to explore NeMo on any supported platform.
- This is the recommended method for ASR and TTS domains.
- Limited feature-completeness for other domains.
NGC PyTorch container: Install NeMo-Framework from source with feature-completeness into a highly optimized container.
- For users that want to install from source in a highly optimized container.
NGC NeMo container: Ready-to-go solution of NeMo-Framework
- For users that seek highest performance.
- Contains all dependencies installed and tested for performance and convergence.

Support matrix

NeMo-Framework provides tiers of support based on OS / Platform and mode of installation. Please refer the following overview of support levels:

Fully supported: Max performance and feature-completeness.
Limited supported: Used to explore NeMo.
No support yet: In development.
Deprecated: Support has reached end of life.

Please refer to the following table for current support levels:

OS / Platform	Install from PyPi	Source into NGC container
`linux` - `amd64/x84_64`	Limited support	Full support
`linux` - `arm64`	Limited support	Limited support
`darwin` - `amd64/x64_64`	Deprecated	Deprecated
`darwin` - `arm64`	Limited support	Limited support
`windows` - `amd64/x64_64`	No support yet	No support yet
`windows` - `arm64`	No support yet	No support yet

Conda / Pip

Install NeMo in a fresh Conda environment:

1
2

conda create --name nemo python==3.10.12
conda activate nemo

Pick the right version

NeMo-Framework publishes pre-built wheels with each release. To install nemo_toolkit from such a wheel, use the following installation method:

`1`	`pip install "nemo_toolkit[all]"`

If a more specific version is desired, we recommend a Pip-VCS install. From NVIDIA/NeMo, fetch the commit, branch, or tag that you would like to install.
To install nemo_toolkit from this Git reference $REF, use the following installation method:

git clone https://github.com/NVIDIA/NeMo
cd NeMo
git checkout @${REF:-'main'}
pip install '.[all]'

Install a specific Domain

To install a specific domain of NeMo, you must first install the nemo_toolkit using the instructions listed above. Then, you run the following domain-specific commands:

pip install nemo_toolkit['all'] # or pip install "nemo_toolkit['all']@git+https://github.com/NVIDIA/NeMo@${REF:-'main'}"
pip install nemo_toolkit['asr'] # or pip install "nemo_toolkit['asr']@git+https://github.com/NVIDIA/NeMo@$REF:-'main'}"
pip install nemo_toolkit['nlp'] # or pip install "nemo_toolkit['nlp']@git+https://github.com/NVIDIA/NeMo@${REF:-'main'}"
pip install nemo_toolkit['tts'] # or pip install "nemo_toolkit['tts']@git+https://github.com/NVIDIA/NeMo@${REF:-'main'}"
pip install nemo_toolkit['vision'] # or pip install "nemo_toolkit['vision']@git+https://github.com/NVIDIA/NeMo@${REF:-'main'}"
pip install nemo_toolkit['multimodal'] # or pip install "nemo_toolkit['multimodal']@git+https://github.com/NVIDIA/NeMo@${REF:-'main'}"

NGC PyTorch container

NOTE: The following steps are supported beginning with 24.04 (NeMo-Toolkit 2.3.0)

We recommended that you start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:25.01-py3.

If starting with a base NVIDIA PyTorch container, you must first launch the container:

docker run \
  --gpus all \
  -it \
  --rm \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  nvcr.io/nvidia/pytorch:${NV_PYTORCH_TAG:-'nvcr.io/nvidia/pytorch:25.01-py3'}

From NVIDIA/NeMo, fetch the commit/branch/tag that you want to install.
To install nemo_toolkit including all of its dependencies from this Git reference $REF, use the following installation method:

cd /opt
git clone https://github.com/NVIDIA/NeMo
cd NeMo
git checkout ${REF:-'main'}
bash reinstall.sh --library all

NGC NeMo container

NeMo containers are launched concurrently with NeMo version updates. NeMo Framework now supports LLMs, MMs, ASR, and TTS in a single consolidated Docker container. You can find additional information about released containers on the NeMo releases page.

To use a pre-built container, run the following code:

docker run \
  --gpus all \
  -it \
  --rm \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  nvcr.io/nvidia/pytorch:${NV_PYTORCH_TAG:-'nvcr.io/nvidia/nemo:25.02'}

Future Work

The NeMo Framework Launcher does not currently support ASR and TTS training, but it will soon.

Discussions Board

FAQ can be found on the NeMo Discussions board. You are welcome to ask questions or start discussions on the board.

Contribute to NeMo

We welcome community contributions! Please refer to CONTRIBUTING.md for the process.

Publications

We provide an ever-growing list of publications that utilize the NeMo Framework.

To contribute an article to the collection, please submit a pull request to the gh-pages-src branch of this repository. For detailed information, please consult the README located at the gh-pages-src branch.

Blogs

Large Language Models and Multimodal Models

Bria Builds Responsible Generative AI for Enterprises Using NVIDIA NeMo, Picasso (2024/03/06)

Bria, a Tel Aviv startup at the forefront of visual generative AI for enterprises now leverages the NVIDIA NeMo Framework. The Bria.ai platform uses reference implementations from the NeMo Multimodal collection, trained on NVIDIA Tensor Core GPUs, to enable high-throughput and low-latency image generation. Bria has also adopted NVIDIA Picasso, a foundry for visual generative AI models, to run inference.

New NVIDIA NeMo Framework Features and NVIDIA H200 (2023/12/06)

NVIDIA NeMo Framework now includes several optimizations and enhancements, including: 1) Fully Sharded Data Parallelism (FSDP) to improve the efficiency of training large-scale AI models, 2) Mix of Experts (MoE)-based LLM architectures with expert parallelism for efficient LLM training at scale, 3) Reinforcement Learning from Human Feedback (RLHF) with TensorRT-LLM for inference stage acceleration, and 4) up to 4.2x speedups for Llama 2 pre-training on NVIDIA H200 Tensor Core GPUs.

NVIDIA now powers training for Amazon Titan Foundation models (2023/11/28)

NVIDIA NeMo Framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.

Licenses

NeMo GitHub Apache 2.0 license
NeMo is licensed under the NVIDIA AI PRODUCT AGREEMENT. By pulling and using the container, you accept the terms and conditions of this license.