How We Built Echo Cancellation with WebRTC AEC3 - Superior

If you've ever tried recording a meeting and running speech-to-text on it, you've probably hit a frustrating problem: the remote speaker's voice comes through your speakers, gets picked up by your microphone, and ends up in the transcription twice. The speech-to-text engine can't tell the difference between you speaking and the echo of the other person's voice bouncing off your walls and into your mic.

This is called acoustic echo, and canceling it in real time is the single hardest technical problem we had to solve to build Raven.

The problem in detail

In a typical video call, audio flows in two directions. The remote speaker's voice arrives at your computer, gets played through your speakers (or headphones), and is heard by you. Simultaneously, your voice is captured by your microphone and sent to the remote speaker.

The problem arises because microphones aren't perfectly directional. Even with headphones, some audio leaks. With laptop speakers, the leakage is massive - the mic picks up a significant portion of the speaker output. If you're capturing both system audio and mic audio for transcription, you end up with the remote speaker's voice in both streams.

Without echo cancellation, a transcription engine processes the remote speaker's words from the system audio (correct), and then processes them again from the mic audio (incorrect - that's just echo). You get duplicate, garbled transcripts that are useless.

Our solution: GStreamer + WebRTC AEC3

We use the WebRTC AEC3 engine for echo cancellation. This is the exact same acoustic echo canceller that runs inside Google Chrome when you make a WebRTC call. It's been battle-tested by billions of Chrome users across every imaginable hardware configuration — laptop speakers, external monitors, Bluetooth headphones, conference room setups.

The AEC3 engine is exposed through GStreamer's audio processing plugins: webrtcechoprobe and webrtcdsp.

Here's the pipeline:

Step 1: Capture two streams. Raven's native audio capture binary simultaneously records system audio (what the remote speaker is saying) and microphone audio (what you're saying). On macOS, this is a Swift process using ScreenCaptureKit for system audio and CoreAudio for the mic. On Windows, it's a Rust module using WASAPI loopback capture and standard capture.

Step 2: Feed the reference signal. The system audio stream is sent to webrtcechoprobe. This component doesn't modify the audio - it simply tells the AEC3 engine: "This is the sound that was played through the speakers and might leak into the microphone."

Step 3: Cancel the echo. The microphone audio stream is sent through webrtcdsp. This component uses the reference signal from step 2 to identify and subtract the echo from the mic signal. The AEC3 algorithm adapts in real time to the acoustic characteristics of your environment - room size, speaker placement, mic sensitivity, and more.

Step 4: Output clean streams. After processing, we have two clean audio streams: the original system audio (untouched) and the echo-cancelled mic audio. These are sent over two separate WebSocket connections to Deepgram Nova-3 for transcription.

Building the native addon

GStreamer is a C-based multimedia framework. To use it from an Electron app, we built a native Node.js addon in C++ using cmake-js. The addon lives in src/native/aec/ and provides a JavaScript API for creating, starting, and stopping the echo cancellation pipeline.

The build process compiles against Electron's specific Node.js runtime version (currently Electron 40.4.1) to ensure ABI compatibility. On macOS, GStreamer is installed via Homebrew. On Windows, it requires the GStreamer MSVC runtime and development installers, which set the necessary environment variables automatically.

The challenges we faced

Getting echo cancellation to work reliably across different hardware was the most time-consuming part of building Raven. Here are some of the issues we encountered:

Sample rate mismatches. Different audio devices run at different sample rates (44.1kHz, 48kHz, 96kHz). The AEC3 engine requires both streams to be at the same rate. We handle resampling in the GStreamer pipeline, but getting the buffer sizes right to avoid underruns or latency spikes took significant tuning.

Latency alignment. For echo cancellation to work, the reference signal (system audio) and the mic signal need to be roughly time-aligned. If the system audio arrives too early or too late relative to the mic, the AEC3 engine can't correlate the echo. We had to carefully manage pipeline latency to keep the offset within acceptable bounds.

Bluetooth audio. Bluetooth headphones introduce significant and variable latency that changes depending on the codec (SBC, AAC, aptX, LDAC). This threw off our timing assumptions and required additional buffering logic.

Platform differences. macOS's ScreenCaptureKit and Windows's WASAPI have fundamentally different APIs, threading models, and audio format conventions. We ended up writing completely separate capture implementations for each platform, sharing only the GStreamer pipeline and downstream processing.

The result

After months of iteration, Raven's echo cancellation works reliably across laptop speakers, external monitors with built-in speakers, wired headphones, wireless earbuds, and conference room setups. The transcription clearly separates "you" from "them," with minimal cross-talk.

Is it perfect? No. AEC is inherently lossy - there are situations where the algorithm aggressively suppresses parts of the mic signal to remove echo, which can occasionally clip the beginning or end of your sentences. But for the purpose of real-time transcription and AI assistance, it's more than good enough.

The entire echo cancellation pipeline - the native C++ addon, the GStreamer configuration, and the audio routing logic - is about 800 lines of code. It's open source, and if you're building anything that needs real-time echo cancellation in a desktop app, you're welcome to use it.

Chaitanya Laxman

Product

Tutorial

Getting Started with Raven in Under 5 Minutes

Mar 6, 2026

Engineering

How We Built Echo Cancellation with WebRTC AEC3

Mar 3, 2026

Tutorial

Using Raven for Technical Interviews, A Step-by-Step Guide

Mar 3, 2026

Buy template