28 may 2025
·
Roberto Morales
What if you could call artificial intelligence on the phone?
In this series, we’re diving into how we’re building real-time AI tools. Today, we’re turning a simple phone call into a real conversation with an AI assistant with no screens, no clicks, just speech.
Talking to ChatGPT by Phone: How We Built a Real-Time AI Assistant with GPT-4o, Whisper, and ElevenLabs
Imagine asking ChatGPT anything not by typing, but just by making a phone call. No browser, no screen. Just your voice… and a real-time response. At Borah Digital, we made it happen.
We built a phone-based AI assistant that lets you have fluid, natural conversations with low latency, by simply calling a phone number (using Twilio).
Yes: you call, and the AI picks up.
How it works. Technical architecture
This seamless experience is powered by a beautifully orchestrated system of tools and logic:
Twilio receives the call we make from our personal phone
A Node server collects the audio frames until it detects 2 seconds of silence
We convert the audio from Twilio’s u-law 8kHz format to PCM 16kHz, so OpenAI Whisper can understand it
Whisper transcribes that audio into text, which is then sent to an LLM for interpretation.
Using the LLM, we generate a response to our question and send it to ElevenLabs.
ElevenLabs turns the LLM response into natural speech, converts it back to u-law 8kHz format, and sends it to Twilio through a WebSocket.
The result: a real-time conversation with an AI that feels… real. Sounds interesting, doesn’t it? Let’s dive into the details.
Transformations from 8kHz u-law to 16kHz PCM:
Twilio gives us audio in 8kHz 8-bit u-law, which is perfect for telephony but terrible for AI models.
So, converting it to 16kHz PCM is essential.
For this, we used wavefile, a library that makes working with audio files easy:
We use PCM 16kHz because most modern ASR models perform really well with this configuration. Increasing the sample rate doesn’t invent new information, but it helps us avoid strange guesses made by Whisper’s internal resampler. (We tried running everything in 8kHz u-law, but Whisper couldn’t understand what was being said over the phone and therefore couldn’t transcribe it.)
When returning the response, ElevenLabs already gives us an 8kHz u-law output optimized for Twilio, so there’s no need for re-encoding.
Voice detection, energy and silence:
One interesting challenge was avoiding mid-sentence cuts where the user's voice would get chopped off and LLM would lose context.
To solve this, we applied an ultralight voice activity detector (VAD) based on energy levels:
If the value returned by calculateSimpleEnergy
, which gives us the energy level approximately every 20ms is higher than the threshold set in the SILENCE_THRESHOLD
constant, we mark the state as SPEAKING. Otherwise, it will be interpreted as background noise and marked as SILENCE, to be ignored.
Once the energy drops below the SILENCE_THRESHOLD
, we start counting silence.
If this silence lasts longer than the time set in SILENCE_DURATION_MS
, we send the buffer.
Additionally, we invert the bits like this: audioBuffer[i] ^ 0xFF
, since Twilio’s Media Streams deliver them in reverse. On the other hand, it's true that sample < 128 ? sample * 2 : (sample - 128) * 4
is not an accurate decoding, but it's good enough for now.
As a first version, this VAD is functional (with stable SNR at 8kHz), although for noisier environments, we might need more advanced solutions like WebRTC-VAD, which can detect if there is voice in an audio segment.
Sending data with asynchronous processes:
Using asynchronous logic was essential in this project due to the real-time nature of phone calls (transcription, reasoning, synthesis…). To handle this, we implemented the following solution:
In addition to this, all the logic runs in parallel thanks to a processing
flag for each connection.
While the response is being generated, we continue receiving and discarding incoming audio, so back-pressure is effectively zero. In broad terms, here’s how we used this flag:
Results:
Conversation latency is around one second virtually unnoticeable for the user.
The simple energy threshold combined with a variable buffer size based on that value works surprisingly well.
At Borah, we believe that voice is still the most natural interface.
With just a few lines of code, signal processing, and AI, we've turned a phone number into a virtual assistant ready to become your next customer service "employee."
Would you like to implement this in your business?
Write to us we’d love to give voice to your ideas.