May 28, 2025

·

Roberto Morales

What if you could call artificial intelligence on the phone?

In this series, we’re diving into how we’re building real-time AI tools. Today, we’re turning a simple phone call into a real conversation with an AI assistant with no screens, no clicks, just speech.

Talking to ChatGPT by Phone: How We Built a Real-Time AI Assistant with GPT-4o, Whisper, and ElevenLabs

Imagine asking ChatGPT anything not by typing, but just by making a phone call. No browser, no screen. Just your voice… and a real-time response. At Borah Digital, we made it happen.

We built a phone-based AI assistant that lets you have fluid, natural conversations with low latency, by simply calling a phone number (using Twilio). 

Yes: you call, and the AI picks up.

How it works. Technical architecture

This seamless experience is powered by a beautifully orchestrated system of tools and logic:

  1. Twilio receives the call we make from our personal phone

  2. A Node server collects the audio frames until it detects 2 seconds of silence

  3. We convert the audio from Twilio’s u-law 8kHz format to PCM 16kHz, so OpenAI Whisper can understand it

  4. Whisper transcribes that audio into text, which is then sent to an LLM for interpretation.

  5. Using the LLM, we generate a response to our question and send it to ElevenLabs.

  6. ElevenLabs turns the LLM response into natural speech, converts it back to u-law 8kHz format, and sends it to Twilio through a WebSocket.

The result: a real-time conversation with an AI that feels… real. Sounds interesting, doesn’t it? Let’s dive into the details.


Transformations from 8kHz u-law to 16kHz PCM:

Twilio gives us audio in 8kHz 8-bit u-law, which is perfect for telephony but terrible for AI models.
So, converting it to 16kHz PCM is essential.

For this, we used wavefile, a library that makes working with audio files easy:

// utils/audio.js

const { WaveFile } = require('wavefile');

function ulawToWav(buf) {
  const wav = new WaveFile();
  wav.fromScratch(1, 8000, '8m', buf); // 8kHz u-law
  wav.fromMuLaw();                     // u-law to PCM 8kHz / 16 bits
  wav.toBitDepth('16');                // Change to PCM 16-bit
  wav.toSampleRate(16000);             // Upsample to 16kHz
  return wav.toBuffer();               // Ready for Whisper
}

We use PCM 16kHz because most modern ASR models perform really well with this configuration. Increasing the sample rate doesn’t invent new information, but it helps us avoid strange guesses made by Whisper’s internal resampler. (We tried running everything in 8kHz u-law, but Whisper couldn’t understand what was being said over the phone and therefore couldn’t transcribe it.)

When returning the response, ElevenLabs already gives us an 8kHz u-law output optimized for Twilio, so there’s no need for re-encoding.

Voice detection, energy and silence:

One interesting challenge was avoiding mid-sentence cuts where the user's voice would get chopped off and LLM would lose context.

To solve this, we applied an ultralight voice activity detector (VAD) based on energy levels:

// websocket.js

const SILENCE_THRESHOLD = 10; // All noises below this energy level will be considered as silence
const SILENCE_DURATION_MS = 2000; // After 2000ms of silence we stop the sentence
const MIN_SPEECH_DURATION_MS = 1000; // Minimum 1000ms of speech

function calculateSimpleEnergy(audioBuffer) {
  if (audioBuffer.length === 0) return 0;
  
  let sum = 0;
  for (let i = 0; i < audioBuffer.length; i++) {
    // Convert simple u-law to energy
    const sample = audioBuffer[i] ^ 0xFF; // u-law stored, invert bits
    const linear = sample < 128 ? sample * 2 : (sample - 128) * 4; // approx to linear PCM
    sum += linear; // accumulate amplitude
  }
  
  return sum / audioBuffer.length; // mean of amplitude in the buffer
}

If the value returned by calculateSimpleEnergy, which gives us the energy level approximately every 20ms is higher than the threshold set in the SILENCE_THRESHOLD constant, we mark the state as SPEAKING. Otherwise, it will be interpreted as background noise and marked as SILENCE, to be ignored.

Once the energy drops below the SILENCE_THRESHOLD, we start counting silence.
If this silence lasts longer than the time set in SILENCE_DURATION_MS, we send the buffer.

Additionally, we invert the bits like this: audioBuffer[i] ^ 0xFF, since Twilio’s Media Streams deliver them in reverse. On the other hand, it's true that sample < 128 ? sample * 2 : (sample - 128) * 4 is not an accurate decoding, but it's good enough for now.

As a first version, this VAD is functional (with stable SNR at 8kHz), although for noisier environments, we might need more advanced solutions like WebRTC-VAD, which can detect if there is voice in an audio segment.

Sending data with asynchronous processes:

Using asynchronous logic was essential in this project due to the real-time nature of phone calls (transcription, reasoning, synthesis…). To handle this, we implemented the following solution:

// websocket.js

async function processAudioSafely(audioBuffer, ws) {
  const text  = await transcribe(audioBuffer); // Whisper
  const reply = await chat(text);              // LLM
  const tts   = await elevenlabs.tts(reply);   // ElevenLabs
  sendAudio(tts, ws);                          // stream back
}

In addition to this, all the logic runs in parallel thanks to a processing flag for each connection.
While the response is being generated, we continue receiving and discarding incoming audio, so back-pressure is effectively zero. In broad terms, here’s how we used this flag:

let processing = false;

if (msg.event === 'media' && !processing) {
  // verify if should be processed
  if (silenceDuration >= SILENCE_DURATION_MS && totalSpeechDuration >= MIN_SPEECH_DURATION_MS) {
    console.log('PROCESANDO AUDIO...');
    processing = true;
  // ...
  }

  // Non-blocking parallel processing
  processAudioSafely(audioToProcess, ws)
    .finally(() => {
      processing = false;
      console.log('PROCESSING COMPLETED');
    });

}

Results:

  • Conversation latency is around one second virtually unnoticeable for the user.

  • The simple energy threshold combined with a variable buffer size based on that value works surprisingly well.

At Borah, we believe that voice is still the most natural interface.
With just a few lines of code, signal processing, and AI, we've turned a phone number into a virtual assistant ready to become your next customer service "employee."

Would you like to implement this in your business?
Write to us we’d love to give voice to your ideas.