🏆 World-class separation performance

Separate Vocals, Speakers & Music
Free, Online, in Seconds

Music playing during your shoot? Unwanted noise in the background? Drop any audio or video file below — Perso Dubbing splits it into vocals, individual speakers, and background music, and you can hear every track before signing up.

No signup needed · First 60 seconds free · Files are never stored

Audio Separation

Click or Drag & drop your file

Separation starts instantly — no account needed (Up to 200MB)

mp4movwebm wavmp3m4a

No file handy? Try a sample:

Separating audio tracks...

Analyzing sound frequencies to separate voice from ambient background elements

Edit speaker scripts line by line in the workspace

Your file is longer than 60 seconds — we separated the first minute so you can judge the quality. Sign in to process the full file →

Benchmarks

World-class performance — measured, not claimed

Three industry-standard public benchmarks — MUSDB18 for vocal separation, VoiceBank-DEMAND for speech denoising, and the Open ASR Leaderboard for transcription. The same datasets every research paper uses, against named engines, with per-sample data published so anyone can re-run the tests.

Vocal separation higher = better

MUSDB18 (vocals) · median SI-SDR

Perso Dubbing 🏆

10.67 dB

HTDemucs (Meta)

8.36 dB

LALAL.AI · MDX-Net

not yet tested

Wins on 44 of 50 tracks — and when we lose, the gap is at most 0.66 dB.

Noise removal quality higher = better

VoiceBank-DEMAND · PESQ-WB

DeepFilterNet3

2.77

Perso Dubbing

2.64

ElevenLabs

2.38

Noisy input (before cleanup)

1.70

The specialist DeepFilterNet3 leads by a hair (2.77 vs 2.64) — both far ahead of ElevenLabs.

Speech clarity higher = better

VoiceBank-DEMAND · ESTOI

DeepFilterNet3

0.821

Perso Dubbing

0.817

ElevenLabs

0.769

Noisy input (before cleanup)

0.747

Top two are effectively tied. ElevenLabs makes speech harder to understand on half of samples — we improve it on 96%.

Voice-clone fidelity higher = better

30 speakers · 2 cloning systems · cos_sim

Clean original (ceiling)

0.736

Perso Dubbing 🏆

0.674

ElevenLabs Audio Iso.

0.665

DeepFilterNet3

0.652

First on both cloning systems tested — even inside ElevenLabs' own cloner. The striped bar is the clean original: the natural ceiling.

Transcription accuracy (WER) lower = better

Open ASR Leaderboard · 8 configs · word error rate

Average of 8 benchmarks statistical tie

Scribe v2 (ElevenLabs)

7.52%

Perso Dubbing

7.61%

Multi-speaker content (GigaSpeech)

Perso Dubbing 🏆

10.70%

Scribe v2 (ElevenLabs)

11.48%

Whisper large-v3

not yet tested

Overall a statistical tie with Scribe v2 — but on multi-speaker content like podcasts, we come out ahead (shorter bar = fewer errors).

Bars are zoomed into the competitive range so small gaps stay visible — the exact score next to each bar is what counts.

What do these tests actually measure?

🎯 Vocal separation (SI-SDR) Higher = better

How cleanly voice and music are pulled apart — like extracting a karaoke track with zero voice left in it. Our score: 10.67 dB vs HTDemucs 8.36 dB — less leaking between tracks, and we win on 44 of 50 songs.

🔊 Noise removal (PESQ · ESTOI) Higher = better

How clear and natural speech sounds after noise is stripped out — the same scoring used for phone-call quality. We score 2.64, a hair behind the specialist DeepFilterNet3 (2.77) and well ahead of ElevenLabs (2.38). On clarity, we tie for first.

📝 Transcription accuracy (WER) Lower = better

Out of 100 spoken words, how many get written down wrong. Our 7.61% means about 92 of 100 words right — statistically the same as ElevenLabs Scribe v2 (7.52%), and ahead of it on multi-speaker recordings like podcasts.

🎤 Voice-clone fidelity (cos_sim) Higher = better

After cleanup, does a voice clone made from that audio still sound like the same person? Scored 0 to 1 against the original voice. Our 0.674 ranks first on both cloning systems tested — including inside ElevenLabs' own cloner.

Honest footnotes: vocal separation is measured on the MUSDB18 sample set (full MUSDB18-HQ rerun in progress, expected within ±0.5 dB). DeepFilterNet3 edges us on PESQ by 0.15 — we tie on clarity and lead on waveform fidelity (+18.66 vs +17.31 dB SI-SDR). MDX-Net and LALAL.AI are not yet tested, so we don't claim to beat every separator. Verified May 2026.

The bottom line: on public benchmarks, our engine separated vocals more cleanly than Meta's HTDemucs on 44 of 50 songs, matched the dedicated denoising specialist DeepFilterNet3, and beat ElevenLabs Audio Isolation on 92–100% of test samples. It even builds better voice clones inside ElevenLabs' own cloning system than ElevenLabs' own preprocessor does. Verified May 2026 — per-sample data published for anyone to re-check.

How it works

Three steps, under a minute

STEP 1

Upload your file

Drag and drop an audio or video file — MP3, WAV, M4A, MP4, MOV, or WebM, up to 200MB. No account needed for the first 60 seconds.

STEP 2

Preview separated tracks

The AI splits your file into individual speakers, pure background music, and background with reactions. Play each track right in the browser.

STEP 3

Export your mix

Pick the tracks you need and export them as one file. Sign in to download, or to process the full length of longer files.

Why Perso Dubbing

More than a vocal remover

😂 Dual background audio modes

Pure BGM, or BGM with laughter and applause kept intact. No other separation tool offers both from one upload.

👤 Multi-speaker separation

Not just vocals vs. music — speaker separation gives every person in the recording their own track, plus a speaker-labeled transcript in 99+ languages.

🔒 Nothing is stored

Trial files are processed in temporary storage and deleted when your session ends. Never kept, never used for training.

📝 Transcription in 99+ languages

Every separation includes automatic speech-to-text with speaker labels, shown right next to your tracks. Language detection is automatic — no extra tools, no extra steps.

🎬 Works with audio & video

Upload MP3, WAV, M4A, MP4, MOV, or WebM. Export tracks with embedded subtitles or separate SRT files.

🎚 Selective mix export

Combine any tracks into one file — Background Music plus Speaker 1, for example. No other separation tool exports a custom mix in one step.

Dual Background Audio Mode

Remove background music or noise from your video — two ways

A podcast laugh track, an audience reaction, a cough during a keynote — most vocal removers can't tell these from speech. Perso Dubbing gives you both options from a single upload.

MODE 1

Background Music

Removes every human sound — speech, laughter, claps — leaving only the background sound. Ideal for copyright-free BGM and clean audio beds for re-dubbing.

🗣 SpeechREMOVED

😂 Laughter / ApplauseREMOVED

🎵 Background MusicKEPT

MODE 2 · Only in Perso Dubbing

Background with Reaction

Removes only speech, keeping laughter, applause, and crowd energy intact. Perfect for podcasts, live events, and variety shows where atmosphere matters.

🗣 SpeechREMOVED

😂 Laughter / ApplauseKEPT

🎵 Background MusicKEPT

Multi-Speaker Separation

One track per voice — speaker separation for interviews, podcasts & meetings

Most vocal removers stop at two stems: voice and music. Perso Dubbing's multi speaker separation goes further — the AI detects how many people are talking and splits the recording into individual speaker tracks, each with a labeled transcript in 99+ languages.

INPUT

One mixed recording

An interview, podcast, or meeting recording with several people talking over music and room noise — uploaded as a single audio or video file.

🎙 Speaker 1 + Speaker 2 + MusicMIXED

OUTPUT · Speaker separation

A separate track for every speaker

Separate speakers from audio in one click: export a single speaker's track, or any mix you choose — no manual editing.

🎤 Speaker 1OWN TRACK

🎤 Speaker 2OWN TRACK

🎵 Background MusicOWN TRACK

Use cases

Who uses audio separation?

🛡 Copyright resolution

Remove copyrighted BGM while keeping dialogue intact, swap in royalty-free music, and re-upload claim-free.

🎙 Podcast editing

Cut filler words and unwanted speech while keeping audience laughter and ambient reactions untouched.

🌍 Video dubbing

Extract a clean BGM track with zero speech bleed, then lay new voice-over in any of 99+ languages on top.

💼 Meetings & conferences

Separate speakers from audio in Zoom or Meet recordings — each participant gets their own track, with speaker-labeled transcripts built in.

📱 Social media clips

Swap the original BGM in short-form videos for a trending track — without touching your voiceover.

🎤 Concerts & fancams

Strip crowd noise and venue reverb from live clips to isolate the artist's voice or the music.

📰 Journalism & interviews

Use multi-speaker separation to pull each interviewee's voice out of noisy field recordings, with clean transcripts for fact-checking.

♻️ Repurpose content

One upload becomes podcast audio, promo BGM, speaker clips for social, and a full transcript for your blog.

Do more in the Perso workspace

FAQ

Frequently asked questions

Is Perso Dubbing Audio Separation free to use?

Yes. You can upload any audio or video file and separate the first 60 seconds completely free, with no signup and no credit card. To download results or process files longer than 60 seconds, subscribe to Perso Dubbing. Paid plans extend processing limits and add speaker editing features.

Do I need to create an account to try audio separation?

No. The 60-second trial runs entirely without an account. Upload a file, listen to every separated track in your browser, and decide whether the quality meets your needs. An account is only required when you download a result or process longer files.

What happens if my file is longer than 60 seconds?

Files longer than 60 seconds are still accepted — the AI processes the first 60 seconds so you can judge separation quality on your own content. To separate the full file, sign in and upload the file again.

Are my uploaded files stored on Perso Dubbing servers?

No. Trial uploads are processed in temporary storage and deleted automatically when your session ends. Perso Dubbing does not keep, reuse, or train on files uploaded through the free trial.

What file formats and sizes are supported?

Perso Dubbing accepts MP3, WAV, and M4A audio files plus MP4, MOV, and WebM video files, up to 200MB per upload. Video files are handled automatically — the AI extracts the audio and separates it.

What is the difference between Background Music and Background with Reaction mode?

Background Music removes every human-generated sound — speech, laughter, applause — and leaves only the pure background sound. Background with Reaction removes only speech while keeping laughter, applause, and crowd sounds, which preserves the live atmosphere of podcasts and event recordings. Perso Dubbing generates both tracks from a single upload.

Can Perso Dubbing do multi-speaker separation, not just vocals and music?

Yes. Beyond the vocal/music split, Perso Dubbing runs full speaker separation (also called speaker split): the AI detects each speaker in the recording and produces a separate track per speaker, along with a speaker-labeled transcript in 99+ languages. This makes it suited for interviews, podcasts, and meeting recordings, not only music.

How accurate is Perso Dubbing audio separation compared to other tools?

On the standard MUSDB18 benchmark, Perso Dubbing separates vocals more cleanly than Meta's HTDemucs on 44 of 50 tracks (10.67 vs 8.36 dB median SI-SDR). On VoiceBank-DEMAND speech denoising, it ties the dedicated specialist DeepFilterNet3 and outperforms ElevenLabs Audio Isolation on 92-100% of samples. Per-sample results are published so anyone can verify the numbers.

Can I remove copyrighted background music from my video?

Yes. Upload your video, let the AI separate the audio tracks, then export only the vocal and speaker tracks without the background music. It is the fastest way to resolve copyright claims on YouTube, TikTok, or Instagram without re-recording your content.

How do I remove background music from a video I filmed?

Upload the video file directly — no need to extract the audio first. Perso Dubbing separates speech, background music, and ambience into individual tracks; export the speech-only mix to drop the music, or keep any combination you want. MP4, MOV, and WebM are supported, and the first 60 seconds are free.

How is Perso Dubbing different from LALAL.AI or Moises?

Music tools split vocals and instruments — and stop there. Perso Dubbing combines separation with transcription in 99+ languages, speaker reassignment, dual background audio modes, and selective track mixing in one workflow, built for video creators and content editors rather than musicians only.

Can I combine selected audio tracks into one file?

Yes. Choose any combination of separated tracks — Background Music plus Speaker 1, for example — and export them as a single merged audio file. This selective mix export is unique to Perso Dubbing.

Explore our product features

AI Dubbing Video Translation AI Lip Sync Voice Cloning Voice Translator Speech to Text Text-to-Speech AI Voice Generator Video Transcriber Subtitle Editor SRT Subtitles to MP4 Extract Audio from Video

Try it on your own file — right now

The first 60 seconds are free. No signup, no stored files, no catch.

↑ Upload a file

Separate Vocals, Speakers & Music Free, Online, in Seconds

World-class performance — measured, not claimed

Vocal separation higher = better

Noise removal quality higher = better

Speech clarity higher = better

Voice-clone fidelity higher = better

Transcription accuracy (WER) lower = better

What do these tests actually measure?

🎯 Vocal separation (SI-SDR) Higher = better

🔊 Noise removal (PESQ · ESTOI) Higher = better

📝 Transcription accuracy (WER) Lower = better

🎤 Voice-clone fidelity (cos_sim) Higher = better

Three steps, under a minute

Upload your file

Preview separated tracks

Export your mix

More than a vocal remover

😂 Dual background audio modes

👤 Multi-speaker separation

🔒 Nothing is stored

📝 Transcription in 99+ languages

🎬 Works with audio & video

🎚 Selective mix export

Remove background music or noise from your video — two ways

Background Music

Background with Reaction

One track per voice — speaker separation for interviews, podcasts & meetings

One mixed recording

A separate track for every speaker

Who uses audio separation?

🛡 Copyright resolution

🎙 Podcast editing

🌍 Video dubbing

💼 Meetings & conferences

📱 Social media clips

🎤 Concerts & fancams

📰 Journalism & interviews

♻️ Repurpose content

Frequently asked questions

Explore our product features

Try it on your own file — right now

Separate Vocals, Speakers & Music
Free, Online, in Seconds