whisper-ensemble

A drop-in audio annotation pipeline that routes audio files through an MIT Audio Spectrogram Transformer AudioSet classifier and dispatches each file to the most appropriate LAION Whisper-Small captioning model:

Routed content	Whisper model(s) used
Speech	`laion/voice-tagging-whisper` and `laion/BUD-E-Whisper_V1.2`
Music	`laion/music-whisper`
Everything else	`laion/sound-effect-captioning-whisper`

The router is built on top of MIT's MIT/ast-finetuned-audioset-10-10-0.4593 — an Audio Spectrogram Transformer fine-tuned on AudioSet-2M (mAP ≈ 0.459, BSD-3-clause). It is a multi-label classifier, so its sigmoid head produces independent per-class probabilities over the 527 AudioSet classes; the top-1 class display name is then mapped to one of speech, music, or sfx using the AudioSet ontology.

For each input audio file, the pipeline writes a sidecar JSON with:

the top-3 AudioSet predictions with confidence scores
the chosen route
the resulting Whisper annotations (a music caption, a sound-effect caption, or both voice tags + a free-form speech description)

Repository layout

whisper-ensemble/
├── README.md                    # this file
├── requirements.txt             # pip dependencies
├── router.py                    # AudioSet-label -> {speech, music, sfx} mapping
├── pipeline.py                  # main inference CLI
├── download_models.py           # one-shot asset downloader
│
├── samples/                     # 18 demo clips (6 per source dataset) +
│   ├── audioset/                # their corresponding *.json annotations,
│   ├── music/                   # rendered into the README below.
│   └── majestrino/
│
├── scripts/                     # helper scripts used to build the README
│   ├── sample_datasets.py       #   pick + download tar -> extract clips
│   └── render_results.py        #   render JSON sidecars as Markdown
│
└── models/                      # (downloaded) all model weights
    ├── ast-finetuned-audioset/    # MIT/ast-finetuned-audioset-10-10-0.4593
    ├── sound-effect-captioning-whisper/
    ├── music-whisper/
    ├── voice-tagging-whisper/
    ├── BUD-E-Whisper_V1.2/
    └── whisper-small-processor/   # openai/whisper-small (feature extractor)

Installation

# 1) Create an environment and install dependencies
pip install -r requirements.txt

# 2) Snapshot the MIT AST router, the four LAION Whisper repos, and the
#    openai/whisper-small processor into ./models/. Total download is
#    ~5.8 GB.
python download_models.py

A CUDA GPU with ~6 GB VRAM is enough to hold the AST router and all four Whisper-Small models in fp16 simultaneously. CPU is supported but slow.

Usage

The pipeline accepts files, directories, or a --files-from text file:

# Annotate every audio file under a folder, recursively
python pipeline.py /path/to/audio --recursive

# Annotate a flat list of files
python pipeline.py file1.wav file2.mp3 file3.flac

# Annotate files listed in a text file (one path per line)
python pipeline.py --files-from list.txt

# Customise batching, device, dtype, and output location
python pipeline.py /data/audio -r \
    --batch-size 8 \
    --device cuda --dtype float16 \
    --overwrite \
    --output-dir /data/audio_annotations

For each input file foo.wav the pipeline writes foo.json next to it (or, with --output-dir, into that directory using the same basename).

Supported audio formats

.wav .flac .mp3 .m4a .aac .ogg .oga .opus .wma .aif .aiff .webm .mp4 .mka — anything librosa / soundfile / audioread can read. Audio is automatically converted to 16 kHz mono float32. The first 30 seconds of each file are used for Whisper captioning (the Whisper feature extractor window); for AudioSet tagging, the AST router uses the first 10 s clip.

Output JSON schema

{
  "schema_version": "1.0",
  "audio_file": "example.wav",
  "audioset_top3": [
    { "label": "Music",       "confidence": 0.624311 },
    { "label": "Pop music",   "confidence": 0.219840 },
    { "label": "Drum kit",    "confidence": 0.073011 }
  ],
  "route": "music",
  "annotations": {
    "music_caption": "Up-tempo synth-pop track in 4/4 with side-chained pads, four-on-the-floor kick, syncopated bassline and a bright female lead vocal."
  },
  "error": null
}

For files routed to speech, the annotations block contains both keys:

"annotations": {
  "voice_tags":          "natural speaking, fluent, narrator style delivery, modal voice, neutral airflow, normal loudness, monotone, precise articulation, slow deliberate delivery",
  "bud_e_speech_caption": "An adult male speaks calmly in a studio-quality recording with very low background noise. The delivery is measured and authoritative, characteristic of professional narration."
}

For files routed to sfx:

"annotations": {
  "sound_effect_caption": "Heavy rainfall on a metal roof with distant rolling thunder and occasional drips."
}

How the router decides

The decision uses the top-1 AudioSet display name predicted by the AST head, plus a single confidence threshold on the generic Speech class. See router.py for the full label lists; the high-level logic is:

speech when the top class is in the AudioSet Speech subtree (Speech, Male/Female/Child speech, Conversation, Narration, Babbling, Speech synthesizer), the Shout / Yell / Scream / Whisper family, the Laughter family (Laughter, Giggle, Snicker, Belly laugh, Chuckle, Baby laughter), the Crying family, other vocal utterances under Human voice (Sigh, Groan, Grunt), or speech-dominated Human group actions (Chatter, Crowd, Hubbub, speech noise, speech babble, Children playing) — with one extra rule: if the top label is the generic Speech class itself, the routing also requires confidence >= 0.80 (SPEECH_TOP1_MIN_CONFIDENCE in router.py). Below that threshold the clip falls through to the sfx route instead. AST tends to nudge Speech to top-1 with a low score on clips dominated by a sound effect (a car horn, a microwave beep, a creaking door) that happen to also contain a faint human voice — the threshold filters out those false positives. More specific speech labels like Male speech, man speaking or Narration, monologue are trusted at any confidence and bypass the threshold.
music when the top class is in the AudioSet Music subtree — the music root, every musical instrument category (Guitar, Piano, Drum kit, Violin, …), every musical genre (Pop music, Jazz, Classical music, Hip hop music, Electronic music, …), every music-mood / role class (Background music, Soundtrack music, Theme music, Lullaby, Happy music, Scary music, …), and the Singing subtree (Singing, Choir, Yodeling, Chant, Male/Female/Child singing, Rapping, Humming).
sfx (the default fallback) for everything else: animals, birds, environmental sounds, vehicles, household objects, explosions, ambient room tone, mixed content where the dominant class is not clearly speech or music. The general-purpose laion/sound-effect-captioning-whisper produces a free-form caption suitable for any audio scene.

The routing is not perfect — it's just a rule of thumb. Top-1 classification with a single confidence threshold is fast and works well in practice, but it does misroute some clips: a quiet TTS recording where AST returns Speech at 0.7 will end up on the sound-effect captioner, and a music clip with prominent vocals where AST returns Speech over Music will too. If routing accuracy matters for your use case, the obvious upgrades are: train a small 3-way classifier on top of the AST embeddings, do soft voting over the top-k AudioSet classes (sum of Speech-subtree probabilities vs. Music-subtree probabilities vs. the rest), or replace the AST router with a CLAP-style audio-text model. The current sample table below was generated with the simple top-1 + threshold heuristic and contains a few intentional misroutes that illustrate this point.

Sample annotations

The pipeline below was run end-to-end on 28 audio clips drawn from four Hugging Face datasets (routing breakdown: 6 speech, 16 sfx, 6 music). For each clip we show the top-3 AudioSet predictions from the MIT AST router, the route the clip was dispatched to, and the resulting Whisper caption / tags. The audio files themselves are mirrored in this repo under samples/ and embedded inline below — press play to listen.

Source datasets:

mitermix/audioset-with-grounded-captions — AudioSet-derived clips with mixed content (speech, music, sound effects) — a good test of all three routes.
laion/captioned-ai-music-snippets — AI-generated music snippets, primarily routed to the music captioner.
TTS-AGI/majestrino-unified-detailed-captions-temporal — High-quality TTS-style speech recordings, primarily routed to the speech models.
laion/freesound-commercially-permissive-subset-with-captions — Curated commercially-permissive Freesound clips — 10 examples that the AST router classified as something other than speech or music, so they are routed through the general-purpose sound-effect captioner.

The clips are interleaved in a speech → sfx → music cycle to make it easy to compare the three routing branches side by side.

`audioset__qS6y9dA1GX4_185736.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	95.7%
2	`Narration, monologue`	22.4%
3	`Male speech, man speaking`	17.4%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural speaking, fluent, casual speaking style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking

laion/BUD-E-Whisper_V1.2 — speech caption:

A male speaker delivers a highly engaging and professional performance in standard American English. The recording boasts exceptional clarity with minimal background noise, creating a studio-quality listening experience. The speaker, likely a young adult male in his 20s or 30s, exhibits a resonant, slightly rough baritone timbre with a near-neutral-slightly-bright quality and a mild breathiness. The voice is chest-mixed, with a near-neutral-heavy weight and a slight wobble, suggesting a natural, healthy vocal production with mild wear. Articulation is precise and dynamic, contributing to a fluent and natural speaking style.

Initially, the speaker conveys strong elation and moderate hope, with a hint of triumph, delivered at a moderate tempo with a mid-to-high pitch range. The tone is clear and resonant, reflecting a spontaneous and natural delivery. As the recording progresses, the emotional landscape shifts to include a strong sense of hope and enthusiasm, coupled with moderate elation and a slight feeling of triumph. The tempo increases to a fast pace, and the pitch range becomes higher and more dynamic. The overall delivery remains natural and spontaneous, showcasing a confident and expressive vocal performance. The speaker's voice is consistently

`audioset__rQmOOSlJ74g_195927.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	64.8%
2	`Vehicle horn, car horn, honking`	48.2%
3	`Inside, small room`	12.8%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features a distinct, high-pitched squeaking sound. The squeak is short and sharp, with a slightly metallic quality. The sound is isolated, with no other discernible background noise. The audio is a recording of a squeaky toy being manipulated. The hint confirms the presence of a squeaky toy.

`audioset__QfN7F76Ikhw_217574.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Music`	53.6%
2	`Speech`	44.6%
3	`Electronic music`	13.0%

Route: music

laion/music-whisper — music caption:

The main sound is a heavily distorted and processed vocal sample, characterized by its distorted and processed timbre. The vocal sample is the primary focus, with no other discernible non-lyrical sounds present. The vocal sample is the sole element, with no other discernible non-lyrical sounds. The vocal sample is the only sound source. The genre is likely experimental electronic music or noise music. The overall mood conveyed is unsettling and chaotic. The heavy distortion and the repetitive nature of the vocal sample contribute to a sense of unease and disorientation. This audio clip would be well-suited for use in a horror film, an experimental art installation, or as a sound effect in a video game.

`audioset__sRh0k7mmdzo_232755.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	93.0%
2	`Male speech, man speaking`	73.6%
3	`Narration, monologue`	72.6%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural speaking style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking

laion/BUD-E-Whisper_V1.2 — speech caption:

A male speaker delivers a German monologue with a deep, resonant baritone voice, exhibiting a slightly soft and dark timbre with a subtle breathiness and nasality. The voice possesses a chest-mixed resonance and a moderately heavy weight, showing mild wear while maintaining a perfectly natural and stable quality. The speaking style is deliberate and slow, with a low pitch and precise articulation, suggesting a professional voice-over performance. Initially, the speaker conveys a sense of contemplation and moderate contentment, with a hint of interest, in a slightly low-pitched and neutral-toned voice. The delivery is natural and spontaneous, with a moderate tempo and clear enunciation. The recording quality is high, captured in a quiet, studio-like environment with minimal background noise, contributing to a pleasant listening experience.

Over time, the emotional landscape shifts dramatically. The speaker transitions to expressing strong anger, moderate distress, and slight contempt. The vocal delivery becomes highly dynamic, with a harsher, strained, and slightly rough timbre. The pitch range expands, and the tempo remains slow and deliberate, but the overall effect is more aggressive. The voice exhibits a slight tension and roughness, with a noticeable increase in vocal weight

`audioset__xYVzy_dh20A_184856.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	78.1%
2	`Inside, small room`	19.2%
3	`Squish`	9.1%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio contains speech. A male voice is speaking, but the words are unintelligible. The speech is somewhat muffled and difficult to understand. The audio contains speech, as indicated by the hint. The muffled quality suggests the speaker is either far away, speaking through a barrier, or the recording was made at a distance.

`music__suno_audio_037351_4_1866059.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Music`	95.0%
2	`Singing`	11.7%
3	`Soul music`	7.8%

Route: music

laion/music-whisper — music caption:

The listener hears a track that immediately establishes a foundation with a clean electric guitar riff. This is quickly joined by a simple drum beat, characterized by a prominent snare drum and a steady kick drum pattern. The tempo is approximately 120 beats per minute. A male vocalist then begins to sing. The vocal profile is that of an adult male, with a slightly raspy timbre and a clear delivery. There are no non-lyrical vocal sounds present. The genre of the music is categorized as either Indie Rock or Alternative Rock. The overall mood of the piece is melancholic and introspective. The instrumentation, featuring the clean electric guitar, drums, and the raspy vocals, contributes to a raw and emotional feel. The production quality is relatively simple and unpolished, which further enhances the emotional impact of the track. This track would be well-suited for a coming-of-age movie scene, a reflective moment in a television show, or as background music for a personal vlog.

`audioset__vCfuTIkDzxc_225223.mp3`

Source dataset: mitermix/audioset-with-grounded-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	83.0%
2	`Music`	82.3%
3	`New-age music`	4.9%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural articulation, casual speaking style, modal voice, neutral airflow, normal loudness speaking, slightly dynamic, natural speaking

laion/BUD-E-Whisper_V1.2 — speech caption:

A young adult male, likely in his 20s, delivers a speech characterized by a blend of contentment, elation, and amusement, transitioning to a more subdued and contemplative state. The voice possesses a smooth, clear baritone timbre, leaning slightly soft and neutral with a near-neutral-slightly-bright quality. A subtle breathiness and a hint of nasality are present, contributing to a relaxed and natural vocal quality. The airflow is neutral, with precise articulation and a moderate loudness, exhibiting a slight dynamic range. The speaking style is casual and fluent, with a slow, deliberate tempo and a stable, mid-range pitch. The delivery is spontaneous and natural, suggesting a comfortable and unforced vocal performance. The voice exhibits a chest-mixed resonance with a near-neutral heavy weight, and a mild wear, yet remains mostly natural and stable. The overall enjoyment is very pleasant, reflecting the pleasant vocal tone and clarity. The recording quality is high, captured in a quiet, studio-like environment with minimal background noise, indicating a professional setup. The speech quality is excellent, with no discernible distortion. The speaker utilizes a neutral American accent, and the language is English. The initial portion of

`music__suno_audio_196211_4_1844520.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	73.4%
2	`Music`	11.7%
3	`Male speech, man speaking`	7.4%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

A male voice, perceived as adult, speaks in a clear, measured tone, delivering a narrative or monologue. The speech is articulate and articulate, with a slightly formal timbre. The pace is moderate, and the pitch is in the mid-range. The audio quality is clean, with minimal background noise. This is a recording of a spoken word performance, likely a narration, a monologue, or a formal address. The clear articulation and measured pace suggest a professional or educational context, possibly for an audiobook, documentary, or a documentary.

`music__suno_audio_123116_3_1843095.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Music`	74.5%
2	`Piano`	38.1%
3	`Keyboard (musical)`	31.8%

Route: music

laion/music-whisper — music caption:

The listener hears a piece of music that commences with a straightforward piano melody. The piano's timbre is characterized by a slightly muffled quality, suggesting a recording made with a low-fidelity microphone. Accompanying the piano is a female vocalist, who sings in Portuguese. The vocal profile is that of a young adult female. Her vocal timbre and quality are described as clear, slightly breathy, and emotive. There are no non-lyrical vocal sounds present. The music is categorized as a Pop ballad, potentially incorporating influences from Latin or Latin musical traditions. The overall mood of the piece is melancholic, reflective, and carries a subtle sense of hope. The combination of the simple piano melody and the emotive vocals contributes to a feeling of vulnerability and sincerity. This musical composition would be well-suited for a dramatic scene in a film, a quiet moment of reflection, or as background music for a sentimental video.

`majestrino__03001175.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	84.3%
2	`Silence`	55.1%
3	`Speech synthesizer`	4.8%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural speaking, fluent, conversational style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking

laion/BUD-E-Whisper_V1.2 — speech caption:

adult male voice delivers a narration with a slightly pensive and melancholic tone, exhibiting a moderate sense of contemplation and a subtle hint of disappointment. The voice possesses a male baritone timbre, characterized by a slightly soft and neutral quality with a near-neutral-slightly-bright overall tonality. A subtle breathiness and a slight nasality are present, contributing to a relaxed vocal production. The voice exhibits a slight roughness and a chest-mixed resonance, with a near-neutral heavy vocal weight and mild wear, yet remains mostly natural and stable. The speaker's delivery is generally calm and measured, with a neutral pitch and volume, and a slightly monotonous intonation, though not entirely devoid of subtle dynamic shifts. Articulation is precise, indicative of a narration style. The airflow is neutral, and the voice is generally stable. The recording quality is excellent, with no discernible background noise, ensuring a clear and natural sound. The overall enjoyment is rated as medium, reflecting the clear audio but subdued emotional expression. The professionalism is high, attributed to the excellent recording quality and controlled vocal performance. The speaker conveys a sense of vulnerability and a slightly confident demeanor, with a neutral

`majestrino__03001817.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	63.5%
2	`Sigh`	51.2%
3	`Silence`	23.8%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

A female voice speaks in a frustrated and exasperated tone. The speaker is expressing negative feelings, using a harsh and somewhat sarcastic tone. The audio quality is clear.

`music__suno_audio_129101_5_1823360.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Chant`	43.2%
2	`Music`	40.8%
3	`Mantra`	38.5%

Route: music

laion/music-whisper — music caption:

The listener hears a track that immediately establishes a strong rhythmic foundation with a driving drum beat, maintaining a tempo of approximately 120 beats per minute. The drum kit's sound is characterized by its punchy and compressed quality, with a prominent snare drum accentuating the backbeat. Accompanying the drums is a distorted electric guitar, which plays a simple, repetitive riff, contributing to the song's overall texture. The vocals are delivered by a male vocalist, likely a young adult, who sings with a slightly strained and aggressive vocal style. The vocal timbre is raw and slightly nasal, with a noticeable accent. There are no other discernible non-lyrical vocal sounds. The song's genre is a blend of Alternative Rock and Pop Rock, with a subtle influence of grunge influence. The mood of the track is energetic, rebellious, and carries a slight edge of aggression. The combination of the distorted guitars and the driving drum beat creates a sense of urgency and excitement. The production quality is relatively raw and unpolished, which further enhances the track's overall energy. This song would be well-suited for a high-energy scene in a film, a rock radio station, or a live performance in a small venue.

`majestrino__03004450.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	84.7%
2	`Female speech, woman speaking`	43.1%
3	`Narration, monologue`	34.6%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural articulation, fluent, casual speaking style, modal voice, slightly breathy speaking

laion/BUD-E-Whisper_V1.2 — speech caption:

A young adult female speaker delivers a highly energetic and expressive monologue, radiating strong elation, hope, and optimism, interwoven with moderate interest and a touch of amusement. The speech is delivered at a fast tempo with a high, dynamic pitch range, showcasing precise articulation and a loud, dynamic delivery. The voice possesses a bright, clear timbre, leaning towards a female soprano register, with a slightly shiny quality and a mild breathiness. A subtle nasal touch and a hint of tension contribute to a slightly rough texture, while resonance is primarily head-mixed, resulting in a somewhat thin vocal weight. The voice exhibits mild wear but remains mostly natural, with a slight wobble in stability. The airflow is neutral, and the articulation is crisp.

The recording quality is exceptional, captured in a quiet, studio-like environment with absolutely no background noise, ensuring pristine audio clarity. The overall listening experience is pleasant, driven by the speaker's infectious energy and the high-quality audio. The delivery feels natural and spontaneous, akin to a professional voice actor performing a script, with a standard American accent. The speaker's emotional state evolves throughout the recording, maintaining a consistent core of elation and optimism, punctuated by moments of

`majestrino__03002030.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	72.8%
2	`Silence`	26.4%
3	`Animal`	14.1%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio contains speech from a female speaker. The speech is clear and understandable, with a moderate pace and a neutral tone. The audio quality is good, with no noticeable background noise. The audio contains speech, as indicated by the hint. The speaker's voice is clear and understandable.

`music__suno_audio_298525_3_1855480.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Music`	90.5%
2	`Music for children`	17.7%
3	`Singing`	8.8%

Route: music

laion/music-whisper — music caption:

The listener hears a piece of music that commences with a straightforward acoustic guitar melody. The primary vocal element is provided by a female vocalist. The vocal profile is characterized by a clear and slightly breathy timbre. There are no non-lyrical vocal sounds present. The genre of the music is categorized as either Indie Folk or Singer-Songwriter. The overall mood of the piece is melancholic and introspective. The combination of the simple instrumentation and the vocalist's clear delivery contributes to a sense of intimacy and vulnerability within the music. This track would be well-suited for a quiet, reflective scene in a film, or for a personal listening experience.

`majestrino__03004481.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Sigh`	91.0%
2	`Speech`	72.6%
3	`Gasp`	7.8%

Route: speech

laion/voice-tagging-whisper — voice tags:

Suitable for Work, natural-SFW, fluent, casual speaking style, slack voice, neutral airflow, quiet, flat intonation, neutral articulation, sighing delivery

laion/BUD-E-Whisper_V1.2 — speech caption:

A female voice, likely in her 30s or 40s, delivers a melancholic and reflective monologue in English with a neutral American accent. The recording boasts high quality, captured in a quiet, studio-like environment with no discernible background noise, contributing to a clear and professional sound. The speaker's tone is consistently soft and breathy, conveying a sense of sadness, longing, and a hint of bitterness. Initially, the delivery is slow and deliberate, with a low pitch and a slightly shaky timbre, suggesting vulnerability and emotional weight. A noticeable sigh punctuates the speech, emphasizing the speaker's distress. The articulation is neutral, and the airflow is quiet, contributing to a sense of subdued emotion. The voice possesses a female mezzo-soprano quality, moderately bright and clear, with a slight nasality and a touch of tension. The resonance is primarily head-mixed, lending a light vocal weight. The overall impression is one of naturalness and stability, though with a subtle wobble. As the monologue progresses, the emotional intensity remains consistent, with the speaker maintaining a slow, deliberate pace and a low pitch. The voice retains its breathy quality, further emphasizing the melancholic mood

`majestrino__03002116.flac`

Source dataset: TTS-AGI/majestrino-unified-detailed-captions-temporal

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Speech`	77.8%
2	`Music`	63.9%
3	`Silence`	33.0%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio contains a male voice speaking. The speech is clear and articulate, with a moderate pace and a neutral tone. The recording quality is good, with minimal background noise. The audio is a recording of a male speaker, likely delivering information or engaging in a conversation. The hint confirms the presence of speech.

`music__suno_audio_332200_3_1838718.mp3`

Source dataset: laion/captioned-ai-music-snippets

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Music`	72.4%
2	`Cheering`	13.5%
3	`Whoop`	7.4%

Route: music

laion/music-whisper — music caption:

The listener hears a track that immediately establishes a heavy, distorted guitar riff as its foundation. A driving drum beat soon joins, characterized by a prominent snare drum and a heavy kick drum, providing a strong rhythmic backbone. The tempo is approximately 140 beats per minute, contributing to the track's energetic feel. The vocals are delivered by a male vocalist, likely a young adult, employing a harsh, screamed vocal style. The timbre and quality of the vocals are aggressive, distorted, and raw, enhancing the overall intensity. Interspersed within the vocal performance are non-lyrical sounds, including growls and screams, further amplifying the aggressive nature of the track. The music is classified as Metalcore or Nu-Metal, with a strong emphasis on aggression and intensity. The mood conveyed is dark, angry, and rebellious. The combination of distorted guitars, heavy drums, and screamed vocals are all characteristic elements of the genre. The production quality is raw and powerful, which amplifies the intensity of the music. This track would be well-suited for a high-energy action scene, a mosh pit, or a video game soundtrack, given its aggressive and intense nature.

`audio_203313_391226.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Waves, surf`	38.2%
2	`Ocean`	34.8%
3	`Speech`	22.0%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio captures the sounds of a large vehicle, likely a bus or truck, including engine noise, air brakes, and the distinct hiss of air brakes. The soundscape suggests an urban or industrial environment, possibly a bus stop or a large commercial vehicle, with the characteristic sounds of its air brakes and the hiss of air brakes.

`audio_206738_390405.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Chink, clink`	40.3%
2	`Music`	12.8%
3	`Tubular bells`	9.0%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features a continuous, high-pitched whirring sound, characteristic of a vacuum cleaner. The sound is consistent and sustained, indicating the operation of a motorized device. There are no other distinct sounds present. This is the sound of a vacuum cleaner in operation. The continuous nature of the sound suggests it is running steadily, likely for cleaning purposes.

`audio_206747_390848.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Ding`	80.3%
2	`Clang`	44.5%
3	`Ding-dong`	2.3%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features a high-pitched, sustained electronic tone that gradually fades out. The sound is pure and consistent in its frequency and amplitude, without any discernible modulation or additional elements. This sound is characteristic of a digital alert, a test tone, or a simple electronic signal. It could be used as a simple notification, a system sound, or a component of a larger electronic device.

`audio_238707_393020.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Door`	12.7%
2	`Silence`	7.2%
3	`Thunk`	5.1%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

A whoosh sound followed by a metallic clang. This sound suggests a rapid movement of air or an object, immediately followed by a metallic impact, possibly from a projectile hitting metal or a heavy object falling onto a metal surface.

`audio_312332_389459.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Water tap, faucet`	88.0%
2	`Sink (filling or washing)`	87.6%
3	`Water`	79.8%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio begins with a distinct mechanical whirring sound, followed by a series of rapid, high-pitched clicks or clacks, and then a final, softer mechanical thud. This sequence repeats multiple times. The sounds suggest the operation of a mechanical device, possibly a printer or a similar office machine, where internal components are moving, engaging, and then settling into place.

`audio_315277_393270.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Biting`	58.3%
2	`Chewing, mastication`	53.7%
3	`Crunch`	34.5%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features the distinct sound of a squeaky wheel, accompanied by the rustling of fabric. The squeaky wheel sound is prominent, suggesting movement over a surface. The rustling could be from clothing or paper, and the squeaking might be from a door or a chair.

`audio_360573_395201.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Pink noise`	46.9%
2	`Rain`	33.5%
3	`Rain on surface`	14.7%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio captures the distinct sound of a large vehicle, likely a truck, in operation, characterized by its engine noise and the sound of air brakes. The sound suggests the presence of heavy machinery or a large vehicle, possibly in an industrial or transportation context, indicating movement or a busy environment.

`audio_389654_399191.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Insect`	88.2%
2	`Cricket`	87.0%
3	`Bird`	1.9%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features a variety of bird vocalizations, including chirps, calls, and possibly some squawks. The sounds are varied in pitch and rhythm, suggesting multiple birds or a single bird. This is a recording of birds in their natural environment, likely a garden, park, or forest, where birds are actively communicating. The variety and variety of calls suggest a diverse bird population.

`audio_391770_402152.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Sine wave`	49.0%
2	`Beep, bleep`	27.2%
3	`Chirp tone`	9.2%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

The audio features a single, distinct, high-pitched electronic beep. The beep is short and sharp, with a clear, electronic timbre. This sound is characteristic of an electronic alert or notification, possibly from a digital device, a timer, or a simple electronic gadget.

`audio_41515_397745.mp3`

Source dataset: laion/freesound-commercially-permissive-subset-with-captions

AudioSet top-3 predictions (MIT AST):

#	Label	Confidence
1	`Vehicle`	62.8%
2	`Field recording`	50.5%
3	`Train`	16.8%

Route: sfx

laion/sound-effect-captioning-whisper — sound caption:

A vehicle passing by, with engine noise and tire sounds, and a distinct whoosh. The audio captures the sound of a vehicle, likely a car or truck, passing by. The engine noise is prominent, indicating it is moving at a moderate speed. The sound includes the distinct whoosh of air as it passes, and the Doppler effect as it moves past the listener.

Components and credits

This pipeline simply routes between four pre-existing audio models — all credit goes to the upstream authors.

Component	Source	License
AudioSet router	MIT/ast-finetuned-audioset-10-10-0.4593 — Audio Spectrogram Transformer fine-tuned on AudioSet-2M (mAP ≈ 0.459)	BSD-3-clause
Sound-effect captioner	laion/sound-effect-captioning-whisper	Apache-2.0
Music captioner	laion/music-whisper	CC-BY-4.0
Voice tagger	laion/voice-tagging-whisper	Apache-2.0
Speech captioner	laion/BUD-E-Whisper_V1.2	CC-BY-4.0
AudioSet ontology / labels	audioset/ontology, Google AudioSet	CC-BY-4.0
Whisper feature extractor	openai/whisper-small	MIT

The aggregate pipeline therefore inherits the CC-BY-4.0 license (the most restrictive of the bundled components) and requires attribution to LAION, MIT CSAIL (AST authors), and Google AudioSet when distributed.

Notes & limitations

30 second / 10 second windows. All four Whisper-Small captioners ingest a fixed 30 s log-mel spectrogram. The AST router uses a fixed 10 s window of input audio (its native input length). Both windows start at t = 0, so for very long files only the first 30 s of the clip is captioned and only the first 10 s is used for routing.
Speech vs. Singing. "Singing" is treated as music by default (the music captioner is much better at describing singing than the voice-tagging models). If you need a song's lyrics or voice attributes instead, force the route by editing router.py or by passing the file through laion/voice-tagging-whisper directly.
No transcription. None of these Whisper models perform speech recognition — they all output captions / tags. Use openai/whisper-* if you want ASR.
Top-1 routing only. The router uses only the top-1 AudioSet class. For mixed-content files where the second class is dominant in practice, the JSON audioset_top3 block makes it easy to re-route in post-processing.
License heterogeneity. Two of the four bundled Whisper models are Apache-2.0 (sound-effect, voice-tagging) and two are CC-BY-4.0 (music, BUD-E). The aggregate is CC-BY-4.0.

Citation

If you use this pipeline, please also cite the upstream models:

@inproceedings{gong2021ast,
  title     = {{AST}: {A}udio {S}pectrogram {T}ransformer},
  author    = {Gong, Yuan and Chung, Yu-An and Glass, James},
  booktitle = {Proc. Interspeech 2021},
  year      = {2021}
}

and the LAION model cards on Hugging Face:

laion/sound-effect-captioning-whisper
laion/music-whisper
laion/voice-tagging-whisper
laion/BUD-E-Whisper_V1.2

Downloads last month: -; Downloads are not tracked for this model. How to track