On the lookout for the correct text-to-speech mannequin? The 1.6 billion parameter mannequin Dia could be the one for you. You’d even be shocked to listen to that this mannequin was created by two undergraduates and with zero funding! On this article, you’ll be taught concerning the mannequin, tips on how to entry and use the mannequin and likewise see the outcomes to essentially know what this mannequin is able to. Earlier than utilizing the mannequin, it might be applicable to get acquainted with it.
What’s Dia-1.6B?
The fashions skilled with the purpose of getting textual content as enter and pure speech as output, are known as text-to-speech fashions. The Dia-1.6B parameter mannequin developed by Nari Labs belongs to the text-to-speech fashions household. That is an fascinating mannequin that’s able to producing practical dialogue from a transcript. It’s additionally value noting that the mannequin can produce nonverbal communications like snigger, sneeze, whistle and many others. Thrilling isn’t it?
Entry the Dia-1.6B?
Two methods wherein we are able to entry the Dia-1.6B mannequin:
- Utilizing Hugging Face API with Google Collab
- Utilizing Hugging Face Areas
The primary one would require getting the API key after which integrating it in Google Collab with code. The latter is a no-code and permits us to interactively use Dia-1.6B.
1. Utilizing Hugging Face and Collab
The mannequin is out there on Hugging Face and might be run with the assistance of 10 GB of VRAM, offered by the T4 GPU in Google Collab pocket book. We’ll reveal the identical with a mini dialog.
Earlier than we start, let’s get our Hugging Face entry token which shall be required to run the code. Go to https://huggingface.co/settings/tokens and generate a key, when you don’t have one already.
Be certain that to allow the next permissions:

Open a brand new pocket book in Google Collab and add this key within the secrets and techniques (Identify ought to be HF_Token):

Observe: Change to T4 GPU to run this pocket book. Then solely you’d be capable to use the 10GB of VRAM, required for operating this mannequin.
Let’s now get our fingers on the the mannequin:
- First clone the Dia’s Git repository:
!git clone https://github.com/nari-labs/dia.git
- Set up the native bundle:
!pip set up ./dia
- Set up the soundfile audio library:
!pip set up soundfile
After operating the earlier instructions, restart the session earlier than continuing.
- After the installations, let’s do the mandatory imports and initialize the mannequin:
import soundfile as sf
from dia.mannequin import Dia
import IPython.show as ipd
mannequin = Dia.from_pretrained("nari-labs/Dia-1.6B")
- Initialize the textual content for the textual content to speech conversion:
textual content = "[S1] That is how Dia sounds. (snigger) [S2] Do not snigger an excessive amount of. [S1] (clears throat) Do share your ideas on the mannequin."
- Run inference on the mannequin:
output = mannequin.generate(textual content)
sampling_rate = 44100 # Dia makes use of 44.1Khz sampling charge.
output_file="dia_sample.mp3"
sf.write(output_file, output, sampling_rate) # Saving the audio
ipd.Audio(output_file) # Displaying the audio
Output:
The speech could be very human-like and the mannequin is doing nice with non-verbal communication. It’s value noting that the outcomes aren’t reproducible as there aren’t any templates for the voices.
Observe: You may strive fixing the seed of the mannequin to breed the outcomes.
2. Utilizing Hugging Face Areas
Let’s attempt to clone a voice utilizing the mannequin by way of Hugging Face areas. Right here we’ve got an choice to make use of the mannequin instantly on the utilizing the web interface: https://huggingface.co/areas/nari-labs/Dia-1.6B
Right here you may move the enter textual content and moreover you may also use the ‘Audio Immediate’ to copy the voice. I handed the audio we generated within the earlier part.
The next textual content was handed as an enter:
[S1] Dia is an open weights textual content to dialogue mannequin.
[S2] You get full management over scripts and voices.
[S1] Wow. Superb. (laughs)
[S2] Attempt it now on Git hub or Hugging Face.
I’ll allow you to be the choose, do you are feeling that the mannequin has efficiently captured and replicated the sooner voices?
Observe: I bought a number of errors whereas producing the speech utilizing Hugging Face areas, strive altering the enter textual content or audio immediate to get the mannequin to work.
Issues to recollect whereas utilizing Dia-1.6B
Right here are some things that you must take into accout, whereas utilizing Dia-1.6B:
- The mannequin shouldn’t be fine-tuned on a selected voice. So, it’ll get a distinct voice on each run. You may strive fixing the seed of the mannequin to breed the outcomes.
- Dia makes use of 44.1 KHz sampling charge.
- After putting in the libraries, ensure to restart the Collab pocket book.
- I bought a number of errors whereas producing the speech utilizing Hugging Face areas, strive altering the Enter Textual content or Audio Immediate to get the mannequin to work.
Conclusion
The mannequin outcomes are very promising, particularly after we see what it may well do in comparison with the competitors. The mannequin’s greatest power is its assist for a variety of non-verbal communication. The mannequin has a definite tone and speech feels pure, however then again because it’s not fine-tuned on particular voices, it won’t be simple to breed a selected voice. Like some other generative AI device, this mannequin ought to be used responsibly.
Ceaselessly Requested Questions
A. No, you need to use a number of audio system however want so as to add this within the immediate [S1], [S2], [S3]…
A. No, it’s a totally free to make use of mannequin obtainable on Hugging Face.
Login to proceed studying and luxuriate in expert-curated content material.