Back to AI Research

AI Research

S-SONDO: Self-Supervised Knowledge Distillation for... | AI Research

Key Takeaways

  • S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models General audio foundation models have become incredibly powerful, but they...
  • General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks.
  • However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices.
  • Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques.
  • Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models.
Paper AbstractExpand

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: this https URL .

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
General audio foundation models have become incredibly powerful, but they are often massive, containing hundreds of millions of parameters. This size makes them difficult to run on edge devices like mobile phones or embedded hardware. While "knowledge distillation"—a technique used to compress large models into smaller, more efficient ones—is a popular solution, most existing methods for audio rely on supervised learning, which requires class labels or specific model architectures. S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models) addresses this by providing the first framework to distill audio models using only their output embeddings, making it compatible with a wider range of modern, self-supervised models.

How S-SONDO Works

S-SONDO operates by aligning the output of a small "student" model with the output of a large "teacher" model. Because the teacher and student often have different latent space dimensions, the framework uses a mapping head to project the student’s embeddings into the teacher’s space. Once aligned, the student is trained to minimize the difference between its projected embeddings and the teacher’s embeddings. By using cosine similarity as the default loss function, the student learns to capture the rich, semantically structured information already present in the teacher’s representation, without needing any labels or specific architectural requirements.

Enhancing Training with Balanced Data Sampling

A key challenge in training models on large, unlabeled datasets is ensuring the model learns from a diverse range of audio. To address this, the researchers introduced a Balanced Data Sampling (BDS) strategy. Since ground-truth labels are unavailable in a self-supervised setting, the team used clustering to group teacher embeddings and assigned sampling weights based on the frequency of these clusters. This ensures that the student model is exposed to a more balanced distribution of audio concepts, which helps prevent the model from overfitting or collapsing during the training process.

Performance and Efficiency

The researchers tested S-SONDO by distilling two large foundation models (M2D and MATPAC++) into three different student architectures. The results were highly effective: the distilled students were up to 61 times smaller than their teachers while retaining up to 96% of the teacher's performance. In most cases, the students trained using S-SONDO actually outperformed models trained through traditional supervised methods. This demonstrates that embedding-based distillation is a viable and powerful way to bring high-performance audio intelligence to resource-constrained devices.

Future Directions

While S-SONDO is effective, the researchers note that the current clustering-based sampling method can be further refined. Future work will focus on developing more advanced clustering techniques that can better capture complex semantic structures, especially for multi-label audio tasks. Additionally, there is potential to improve the training process by combining the current alignment approach with contrastive objectives, which could help the student model better distinguish between semantically similar audio samples.

Comments (0)

No comments yet

Be the first to share your thoughts!