This text is co-authored by Ugo Pradère and David Haüet
can or not it’s to transcribe an interview? You feed the audio to an AI mannequin, wait a couple of minutes, and growth: good transcript, proper? Nicely… not fairly.
In relation to precisely transcribe lengthy audio interviews, much more when the spoken language isn’t English, issues get much more sophisticated. You want top quality transcription with dependable speaker identification, exact timestamps, and all that at an inexpensive value. Not so easy in spite of everything.
On this article, we take you behind the scenes of our journey to construct a scalable and production-ready transcription pipeline utilizing Google’s Vertex AI and Gemini fashions. From sudden mannequin limitations to price range analysis and timestamp drift disasters, we’ll stroll you thru the actual challenges, and the way we solved them.
Whether or not you might be constructing your individual Audio Processing instrument or simply interested in what occurs “underneath the hood” of a strong transcription system utilizing a multimodal mannequin, you will see sensible insights, intelligent workarounds, and classes discovered that must be price your time.
Context of the undertaking and constraints
At the start of 2025, we began an interview transcription undertaking with a transparent purpose: to construct a system able to transcribing interviews in French, sometimes involving a journalist and a visitor, however not restricted to this example, and lasting from a couple of minutes to over an hour. The ultimate output was anticipated to be only a uncooked transcript however needed to mirror the pure spoken dialogue written in a “book-like” dialogue, making certain each a devoted transcription of the unique audio content material and readability.
Earlier than diving into improvement, we carried out a brief market overview of present options, however the outcomes had been by no means passable: the standard was typically disappointing, the pricing undoubtedly too excessive for an intensive utilization, and most often, each directly. At that time, we realized a customized pipeline can be mandatory.
As a result of our group is engaged within the Google ecosystem, we had been required to make use of Google Vertex AI companies. Google Vertex AI provides a wide range of Speech-to-Textual content (S2T) fashions for audio transcription, together with specialised ones corresponding to “Chirp,” “Latestlong,” or “Cellphone name,” whose names already trace at their meant use circumstances. Nevertheless, producing a whole transcription of an interview that mixes excessive accuracy, speaker diarization, and exact timestamping, particularly for lengthy recordings, stays an actual technical and operational problem.
First makes an attempt and limitations
We initiated our undertaking by evaluating all these fashions on our use case. Nevertheless, after intensive testing, we got here rapidly to the next conclusion: no Vertex AI service totally meets the whole set of necessities and can permit us to attain our purpose in a easy and efficient method. There was at all times at the least one lacking specification, often on timestamping or diarization.
The horrible Google documentation, this have to be stated, price us a major period of time throughout this preliminary analysis. This prompted us to ask Google for a gathering with a Google Cloud Machine Studying Specialist to attempt to discover a resolution to our downside. After a fast video name, our dialogue with the Google rep rapidly confirmed our conclusions: what we aimed to attain was not so simple as it appeared at first. The complete set of necessities couldn’t be fulfilled by a single Google service and a customized implementation of a VertexAI S2T service needed to be developed.
We offered our preliminary work and determined to proceed exploring two methods:
- Use Chirp2 to generate the transcription and timestamping of lengthy audio recordsdata, then use Gemini for diarization.
- Use Gemini 2.0 Flash for transcription and diarization, though the timestamping is approximate and the token output size requires looping.
In parallel of those investigations, we additionally needed to take into account the monetary side. The instrument can be used for tons of of hours of transcription per thirty days. In contrast to textual content, which is mostly low cost sufficient to not have to consider it, audio might be fairly pricey. We subsequently included this parameter from the start of our exploration to keep away from ending up with an answer that labored however was too costly to be exploited in manufacturing.
Deep dive into transcription with Chirp2
We started with a deeper investigation of the Chirp2 mannequin since it’s thought-about because the “finest at school” Google S2T service. An easy software of the documentation supplied the anticipated outcome. The mannequin turned out to be fairly efficient, providing good transcription with word-by-word timestamping based on the next output in json format:
"transcript":"Oui, en effet",
"confidence":0.7891818284988403
"phrases":[
{
"word":"Oui",
"start-offset":{
"seconds":3.68
},
"end-offset":{
"seconds":3.84
},
"confidence":0.5692862272262573
}
{
"word":"en",
"start-offset":{
"seconds":3.84
},
"end-offset":{
"seconds":4.0
},
"confidence":0.758037805557251
},
{
"word":"effet",
"start-offset":{
"seconds":4.0
},
"end-offset":{
"seconds":4.64
},
"confidence":0.8176857233047485
},
]
Nevertheless, a brand new requirement got here alongside the undertaking added by the operational workforce: the transcription have to be as devoted as doable to the unique audio content material and embody small filler phrases, interjections, onomatopoeia and even mumbling that may add that means to a dialog, and sometimes come from the non-speaking participant both on the similar time or towards the top of a sentence of the talking one. We’re speaking about phrases like “oui oui,” “en effet” but in addition easy expressions like (hmm, ah, and so on.), so typical of the French language! It’s truly not unusual to validate or, extra not often, oppose somebody level with a easy “Hmm Hmm”. Upon analyzing Chirp with transcription, we seen that whereas a few of these small phrases had been current, a lot of these expressions had been lacking. First draw back for Chirp2.
The primary problem on this strategy lies within the reconstruction of the speaker sentences whereas performing diarization. We rapidly deserted the concept of giving Gemini the context of the interview and the transcription textual content, and asking it to find out who stated what. This technique may simply lead to incorrect diarization. We as an alternative explored sending the interview context, the audio file, and the transcription content material in a compact format, instructing Gemini to solely carry out diarization and sentence reconstruction with out re-transcribing the audio file. We requested a TSV format, a really perfect structured format for transcription: “human readable” for quick high quality checking, straightforward to course of algorithmically, and light-weight. Its construction is as follows:
First line with speaker presentation:
Diarization Speaker_1:speaker_nameSpeaker_2:speaker_nameSpeaker_3:speaker_nameSpeaker_4:speaker_name, and so on.
Then the transcription within the following format:
speaker_idttime_startttime_stoptext with:
- speaker: Numeric speaker ID (e.g., 1, 2, and so on.)
- time_start: Section begin time within the format 00:00:00
- time_stop: Section finish time within the format 00:00:00
- textual content: Transcribed textual content of the dialogue phase
An instance output:
Diarization Speaker_1:Lea FinchSpeaker_2:David Albec
1 00:00:00 00:03:00 Hello Andrew, how are you?
2 00:03:00 00:03:00 High-quality thanks.
1 00:04:00 00:07:00 So, let’s begin the interview
2 00:07:00 00:08:00 All proper.
A easy model of the context supplied to the LLM:
Right here is the interview of David Albec, skilled soccer participant, by journalist Lea Finch
The outcome was pretty qualitative with what gave the impression to be correct diarization and sentence reconstruction. Nevertheless, as an alternative of getting the very same textual content, it appeared barely modified in a number of locations. Our conclusion was that, regardless of our clear directions, Gemini in all probability carries out extra than simply diarization and really carried out partial transcription.
We additionally evaluated at this level the price of transcription with this technique. Under is the approximate calculation based mostly solely on audio processing:
Chirp2 value /min: 0.016 usd
Gemini 2.0 flash /min: 0,001875 usd
Worth /hour: 1,0725 usd
Chirp2 is certainly fairly “costly”, about ten instances greater than Gemini 2.0 flash on the time of writing, and nonetheless requires the audio to be processed by Gemini for diarization. We subsequently determined to place this technique apart for now and discover a approach utilizing the model new multimodal Gemini 2.0 Flash alone, which had simply left experimental mode.
Subsequent: exploring audio transcription with Gemini flash 2.0
We supplied Gemini with each the interview context and the audio file requesting a structured output in a constant format. By rigorously crafting our immediate with commonplace LLM pointers, we had been capable of specify our transcription necessities with a excessive diploma of precision. As well as with the standard parts any immediate engineer would possibly embody, we emphasised a number of key directions important for making certain a top quality transcription (remark in italic):
- Transcribe interjections and onomatopoeia even when mid-sentence.
- Protect the complete expression of phrases, together with slang, insults, or inappropriate language. => the mannequin tends to alter phrases it considers inappropriate. For this particular level, we needed to require Google to deactivate the security guidelines on our Google Cloud Undertaking.
- Construct full sentences, paying explicit consideration to adjustments in speaker mid-sentence, for instance when one speaker finishes one other’s sentence or interrupts. => Such errors have an effect on diarization and accumulate all through the transcript till context is powerful sufficient for the LLM to appropriate.
- Normalize extended phrases or interjections like “euuuuuh” to “euh.” and never “euh euh euh euh euh …” => this was a classical bug we had been encountering known as “repetition bug” and is mentioned in additional element beneath
- Establish audio system by voice tone whereas utilizing context to find out who’s the journalist and who’s the interviewee. => as well as we will cross the data of the primary speaker within the immediate
Preliminary outcomes had been truly fairly satisfying by way of transcription, diarization, and sentence development. Transcribing brief check recordsdata made us really feel just like the undertaking was almost full… till we tried longer recordsdata.
Coping with Lengthy Audio and LLM Token Limitations
Our early checks on brief audio clips had been encouraging however scaling the method to longer audios rapidly revealed new challenges: what initially appeared like a easy extension of our pipeline turned out to be a technical hurdle in itself. Processing recordsdata longer than just some minutes revealed certainly a collection of challenges associated to mannequin constraints, token limits, and output reliability:
- One of many first issues we encountered with lengthy audio was the token restrict: the variety of output tokens exceeded the utmost allowed (MAX_INPUT_TOKEN = 8192) forcing us to implement a looping mechanism by repeatedly calling Gemini whereas resending the beforehand generated transcript, the preliminary immediate, a continuation immediate, and the identical audio file.
Right here is an instance of the continuation immediate we used:
Proceed transcribing audio interview from the earlier outcome. Begin processing the audio file from the earlier generated textual content. Don’t begin from the start of the audio. Watch out to proceed the beforehand generated content material which is offered between the next tags <previous_result>.
- Utilizing this transcription loop with massive information inputs appears to considerably degrade the LLM output high quality, particularly for timestamping. On this configuration, timestamps can drift by over 10 minutes on an hour-long interview. If just a few seconds drift was thought-about appropriate with our meant use, a couple of minutes made timestamping ineffective.
Our preliminary check on brief audios of some minutes resulted in a most 5 to 10 seconds drift, and important drift was noticed usually after the primary loop when max enter token was reached. We conclude from these experimental observations, that whereas this looping method ensures continuity in transcription pretty properly, it not solely results in cumulative timestamp errors but in addition to a drastic lack of LLM timestamps accuracy.
- We additionally encountered a recurring and significantly irritating bug: the mannequin would generally fall right into a loop, repeating the identical phrase or phrase over dozens of strains. This conduct made total parts of the transcript unusable and sometimes regarded one thing like this:
1 00:00:00 00:03:00 Hello Andrew, how are you?
2 00:03:00 00:03:00 High-quality thanks.
2 00:03:00 00:03:00 High-quality thanks
2 00:03:00 00:03:00 High-quality thanks
2 00:03:00 00:03:00 High-quality thanks.
2 00:03:00 00:03:00 High-quality thanks
2 00:03:00 00:03:00 High-quality thanks.
and so on.
This bug appears erratic however seems extra steadily with medium-quality audio with robust background noise, distant speaker for instance. And “on the sector”, that is typically the case.. Likewise, speaker hesitations or phrase repetitions appear to set off it. We nonetheless don’t know precisely what causes this “repetition bug”. Google Vertex workforce is conscious of it however hasn’t supplied a transparent rationalization.
The implications of this bug had been particularly limiting: as soon as it occurred, the one viable resolution was to restart the transcription from scratch. Unsurprisingly, the longer the audio file, the upper the likelihood of encountering the difficulty. In our checks, it affected roughly one out of each three runs on recordings longer than an hour, making it extraordinarily tough to ship a dependable, production-quality service underneath such situations.
- To make it worse, resuming transcription after a reached Max_token “cutoff” required resending all the audio file every time. Though we solely wanted the subsequent phase, the LLM would nonetheless course of the complete file once more (with out outputting the transcription), that means we had been billed the complete audio time lenght for each resend.
In apply, we discovered that the token restrict was sometimes reached between the fifteenth and twentieth minute of the audio. Because of this, transcribing a one hour lengthy interview typically required 4 to five separate LLM calls, resulting in a complete billing equal of 4 to five hours of audio for a single file.
With this course of, the price of audio transcription doesn’t scale linearly. Whereas a 15-minute audio can be billed as quarter-hour, in a single LLM name, a 1-hour file may successfully price 4 hours, and a 2-hour file may enhance to 16 hours, following a close to quadratic sample (≈ 4^x, the place x = variety of hours).
This made lengthy audio processing not simply unreliable, but in addition costly for lengthy audio recordsdata.
Pivoting to Chunked Audio Transcription
Given these main limitations, and being far more assured within the skill of the LLM to deal with text-based duties over audio, we determined to shift our strategy and isolate the audio transcription course of to take care of excessive transcription high quality. A top quality transcription is certainly the important thing step of the necessity and it is smart to make sure that this a part of the method must be on the core of the technique.
At this level, splitting audio into chunks grew to become the best resolution. Not solely, it appeared more likely to significantly enhance timestamp accuracy by avoiding the LLM timestamping efficiency degradation after looping and cumulative drift, but in addition lowering value since every chunk can be runned ideally as soon as. Whereas it launched new uncertainties round merging partial transcriptions, the tradeoff appeared to our benefit.
We thus targeted on breaking lengthy audio into shorter chunks that might insure a single LLM transcription request. Throughout our checks, we noticed that points like repetition loops or timestamp drift sometimes started across the 18-minute mark in most interviews. It grew to become clear that we must always use 15-minute (or shorter) chunks for security. Why not use 5-minute chunks? The standard enchancment regarded minimal to us whereas tripling the variety of segments. As well as, shorter chunks cut back the general context, which may damage diarization.
Additionally this setup drastically minimized the repetition bug, we noticed that it nonetheless occurred sometimes. In a need to offer the very best service doable, we undoubtedly wished to seek out an environment friendly counterback to this downside and we recognized a chance with our beforehand annoying max_input_token: with 10-minute chunks, we may undoubtedly be assured that token limits wouldn’t be exceeded in almost all circumstances. Thus, if the token restrict was hit, we knew for certain the repetition bug occurred and will restart that chunk transcription. This pragmatic technique turned out to be certainly very efficient at figuring out and avoiding the bug. Nice information.
Correcting audio chunks transcription
With good transcripts of 10 minutes audio chunk in hand, we applied at this stage an algorithmic post-processing of every transcript to deal with minor points:
- Elimination of header tags like tsv or json added in the beginning and the top of the transcription content material:
Regardless of optimizing the immediate, we couldn’t totally eradicate this facet impact with out hurting the transcription high quality. Since that is simply dealt with algorithmically, we selected to take action.
- Changing speaker IDs with names:
Speaker identification by title solely begins as soon as the LLM has sufficient context to find out who’s the journalist and who’s being interviewed. This ends in incomplete diarization originally of the transcript with early segments utilizing numeric IDs (first speaker in chunk = 1, and so on.). Furthermore, since every chunk could have a special ID order (first individual to speak being speaker 1), this might create confusion throughout merging. We instructed the LLM to solely use IDs and supply a diarization mapping within the first line, through the transcription course of. The speaker ids are subsequently changed through the algorithmic correction and the diarization headline eliminated.
- Hardly ever, malformed or empty transcript strains are encountered. These strains are deleted, however we flag them with a be aware to the consumer: “formatting problem on this line” so customers are at the least conscious of a possible content material loss and proper it will definitely handwise. In our last optimized model, such strains had been extraordinarily uncommon.
Merging chunks and sustaining content material continuity
On the earlier stage of audio chunking, we initially tried to make chunks with clear cuts. Unsurprisingly, this led to phrases and even full sentences loss at reduce factors. So we naturally switched to overlapping chunk cuts to keep away from such content material loss, leaving the optimization of the scale of the overlap to the chunk merging course of.
And not using a clear reduce between chunks, the chance to merge the chunks algorithmically disappeared. For a similar audio enter, the transcript strains output might be fairly completely different with breaks at completely different factors of the sentences and even filler phrases or hesitations being displayed in a different way. In such a state of affairs, it’s advanced, to not say inconceivable, to make an efficient algorithm for a clear merge.
This left us with the usage of the LLM choice after all. Rapidly, few checks confirmed the LLM may higher merge collectively segments when overlaps included full sentences. A 30-second overlap proved ample. With a ten min audio chunk construction, this might implies the next chunks cuts:
- 1st transcript: 0 to 10 minutes
- 2nd transcript: 9m30s to 19m30s
- third transcript: 19m to 29m …and so forth.

These overlapped chunk transcripts had been corrected by the beforehand described algorithm and despatched to the LLM for merging to reconstruct the complete audio transcript. The thought was to ship the complete set of chunk transcripts with a immediate instructing the LLM to merge and provides the complete merged audio transcript in tsv format because the earlier LLM transcription step. On this configuration, the merging course of has primarily three high quality criterias:
- Guarantee transcription continuity with out content material loss or duplication.
- Modify timestamps to renew from the place the earlier chunk ended.
- Protect diarization.
As anticipated, max_input_token was exceeded, forcing us into an LLM name loop. Nevertheless, since we had been now utilizing textual content enter, we had been extra assured within the reliability of the LLM… in all probability an excessive amount of. The results of the merge was passable most often however liable to a number of points: tag insertions, multi-line entries merged into one line, incomplete strains, and even hallucinated continuations of the interview. Regardless of many immediate optimizations, we couldn’t obtain sufficiently dependable outcomes for manufacturing use.
As with audio transcription, we recognized the quantity of enter data as the principle problem. We had been sending a number of hundred, even 1000’s of textual content strains containing the immediate, the set of partial transcripts to fuse, a roughly related quantity with the earlier transcript, and a few extra with the immediate and its instance. Positively an excessive amount of for a exact software of our set of directions.
On the plus facet, timestamp accuracy did certainly enhance considerably with this chunking strategy: we maintained a drift of simply 5 to 10 seconds max on transcriptions over an hour. As the beginning of a transcript ought to have minimal drift in timestamping, we instructed the LLM to make use of the timestamps of the “ending chunk” as reference for the fusion and proper any drift by a second per sentence. This made the reduce factors seamless and stored general timestamp accuracy.
Splitting the chunk transcripts for full transcript reconstruction
In a modular strategy just like the workaround we used for transcription, we determined to hold out the merge of the transcript individually, to be able to keep away from the beforehand described points. To take action, every 10 minute transcript is break up into three elements based mostly on the start_time of the segments:
- Overlap phase to merge originally: 0 to 1 minute
- Important phase to stick: 1 to 9 minutes
- Overlap phase to merge on the finish: 9 to 10 minutes
NB: Since every chunk, together with first and final ones, is processed the identical approach, the overlap originally of the primary chunk is instantly merged with the principle phase, and the overlap on the finish of the final chunk (if there’s one) is merged accordingly.
The start and finish segments are then despatched in pairs to be merged. As anticipated, the standard of the output drastically elevated, leading to an environment friendly and dependable merge between the transcripts chunk. With this process, the response of the LLM proved to be extremely dependable and confirmed not one of the beforehand talked about errors encountered through the looping course of.
The method of transcript meeting for an audio of 28 minutes 42 seconds:

Full transcript reconstruction
At this last stage, the one remaining activity was to reconstruct the whole transcript from the processed splits. To attain this, we algorithmically mixed the principle content material segments with their corresponding merged overlaps alternately.
Total course of overview
The general course of includes 6 steps from which 2 are carried out by Gemini:
- Chunking the audio into overlapped audio chunks
- Transcribing every chunks into partial textual content transcripts (LLM step)
- Correction of partial transcripts
- Splitting audio chunks transcripts into begin, primary, and finish textual content splits
- Fusing finish and begin splits of every couple of chunk splits (LLM step)
- Reconstructing the complete transcripts

The general course of takes about 5 min per hour of transcription deserved to the consumer in an asynchronous instrument. Fairly cheap contemplating the amount of labor executed behind the scene, and this for a fraction of the worth of different instruments or pre-built Google fashions like Chirp2.
One further enchancment that we thought-about however in the end determined to not implement was the timestamp correction. We noticed that timestamps on the finish of every chunk sometimes ran about 5 seconds forward of the particular audio. An easy resolution may have been to incrementally regulate the timestamps algorithmically by roughly one second each two minutes to appropriate most of this drift. Nevertheless, we selected to not implement this adjustment, because the minor discrepancy was acceptable for our enterprise wants.
Conclusion
Constructing a high-quality, scalable transcription pipeline for lengthy interviews turned out to be far more advanced than merely selecting the “proper” Speech-to-Textual content mannequin. Our journey with Google’s Vertex AI and Gemini fashions highlighted key challenges round diarization, timestamping, cost-efficiency, and lengthy audio dealing with, particularly when aiming to export the complete data of an audio.
Utilizing cautious immediate engineering, sensible audio chunking methods, and iterative refinements, we had been capable of construct a strong system that balances accuracy, efficiency, and operational price, turning an initially fragmented course of right into a easy, production-ready pipeline.
There’s nonetheless room for enchancment however this workflow now varieties a strong basis for scalable, high-fidelity audio transcription. As LLMs proceed to evolve and APIs develop into extra versatile, we’re optimistic about much more streamlined options within the close to future.
Key takeaways
- No Vertex AI S2T mannequin met all our wants: Google Vertex AI gives specialised fashions, however every one has limitations by way of transcription accuracy, diarization, or timestamping for lengthy audios.
- Token limits and lengthy prompts affect transcription high quality drastically: Gemini output token limitation considerably degrades transcription high quality for lengthy audios, requiring closely prompted looping methods and eventually forcing us to shift to shorter audio chunks.
- Chunked audio transcription and transcript reconstruction considerably improves high quality and cost-efficiency:
Splitting audio into 10 minute overlapping segments minimized vital bugs like repeated sentences and timestamp drift, enabling larger high quality outcomes and drastically lowered prices. - Cautious immediate engineering stays important: Precision in prompts, particularly concerning diarization and interjections for transcription, in addition to transcript fusions, proved to be essential for dependable LLM efficiency.
- Quick transcript fusion merging maximize reliability:
Splitting every chunk transcript into smaller segments with finish to start out merging of overlaps supplied excessive accuracy and averted widespread LLM points like hallucinations or incorrect formatting.