Dual-Stream Audio Capture on macOS with ScreenCaptureKit - Superior

Before macOS 12 Monterey, capturing system audio on a Mac was a hack. You needed a virtual audio device driver — Soundflower, BlackHole, Loopback - that installed a kernel extension to route audio. It worked, but it was fragile, required administrator access, broke with OS updates, and was a significant barrier for any app that needed system audio capture.

With macOS 12, Apple introduced ScreenCaptureKit, a high-level framework for capturing screen content and audio. For the first time, apps could capture system audio without any third-party drivers, using a supported API that works reliably across OS updates.

Raven uses ScreenCaptureKit for system audio capture and CoreAudio for microphone capture, running simultaneously in a dedicated Swift process.

Why a separate Swift process?

Raven is an Electron app (TypeScript/JavaScript), but audio capture on macOS is best done from native code. ScreenCaptureKit is a Swift/Objective-C framework that requires specific entitlements and works most reliably from a native macOS context. Rather than building a complex native Node.js module, we created a lightweight Swift command-line tool that handles audio capture and streams the results to the Electron main process via standard I/O pipes.

The Swift binary lives in src/native/swift/AudioCapture/ and is built with swift build -c release. When Raven starts recording, the Electron main process spawns this binary as a child process and reads its output.

System audio capture with ScreenCaptureKit

ScreenCaptureKit provides the SCStream class for capturing screen content. To capture system audio:

We create an SCStreamConfiguration with capturesAudio set to true and excludesCurrentProcessAudio set to true (so Raven doesn't capture its own UI sounds).

We create an SCContentFilter that specifies what to capture. For audio-only capture, we select a display but configure the stream to include audio and exclude video, minimizing overhead.

We start the stream and receive audio samples via the SCStreamOutput delegate. Each sample is a CMSampleBuffer containing PCM audio data.

We extract the raw PCM bytes from each sample buffer and write them to stdout.

Microphone capture with CoreAudio

Simultaneously, we set up an AVAudioEngine to capture microphone input:

We get the input node from the audio engine, which represents the system's default microphone.
We install a tap on the input node that receives audio buffers in real time.
We extract the raw PCM bytes from each buffer and write them to stdout, interleaved with the system audio data using a simple framing protocol so the Electron process can separate the two streams.

The framing protocol

Since both audio streams are written to the same stdout pipe, we need a way to tell them apart. Each audio chunk is preceded by a small header that includes:

The stream identifier (system or mic)
The size of the audio data
A timestamp for synchronization

The Electron main process reads from the child process's stdout, parses these headers, and routes each chunk to the appropriate input of the GStreamer echo cancellation pipeline.

Permissions

ScreenCaptureKit requires the Screen Recording permission, and microphone capture requires the Microphone permission. Raven's onboarding flow prompts the user to grant both permissions in step 3. On macOS, these permissions are managed in System Settings → Privacy & Security.

One important note: the Screen Recording permission dialog can be confusing because Raven isn't capturing video — it's only using ScreenCaptureKit for audio. But macOS bundles screen and audio capture under the same permission, so the Screen Recording grant is required even though no visual content is being captured.

The Windows equivalent

On Windows, the audio capture is handled by a completely different implementation: a Rust module built with NAPI-RS that uses WASAPI (Windows Audio Session API).

WASAPI provides two capture modes that map to our two streams:

Loopback capture - captures the audio being played through the default audio output device (equivalent to ScreenCaptureKit for system audio)

Standard capture - captures audio from the default input device (microphone)

The Rust module runs both captures in parallel threads and delivers the raw PCM data to the Node.js runtime via NAPI callbacks. From there, the audio enters the same GStreamer echo cancellation pipeline as on macOS.

We chose Rust for the Windows implementation because WASAPI is a COM-based API with complex initialization, threading, and buffer management requirements. Rust's type system and NAPI-RS bindings made it significantly more manageable than writing it as a C++ addon or trying to wrangle COM from Node.js directly.

The result

Both platform implementations produce the same output: two clean PCM audio streams - system audio and microphone audio - ready for echo cancellation and transcription. The rest of Raven's pipeline (GStreamer AEC, Deepgram transcription, AI assistance) is platform-independent and works identically on both macOS and Windows.

The native audio capture code is about 600 lines of Swift on macOS and 900 lines of Rust on Windows. Both are open source and available in the Raven repository.

Chaitanya Laxman

Product

Tutorial

Getting Started with Raven in Under 5 Minutes

Mar 6, 2026

Engineering

How We Built Echo Cancellation with WebRTC AEC3

Mar 3, 2026

Tutorial

Using Raven for Technical Interviews, A Step-by-Step Guide

Mar 3, 2026

Buy template