S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
General audio foundation models have become incredibly powerful, but they are often massive, containing hundreds of millions of parameters. This size makes them difficult to run on edge devices like mobile phones or embedded hardware. While "knowledge distillation"—a technique used to compress large models into smaller, more efficient ones—is a popular solution, most existing methods for audio rely on supervised learning, which requires class labels or specific model architectures. S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models) addresses this by providing the first framework to distill audio models using only their output embeddings, making it compatible with a wider range of modern, self-supervised models.
How S-SONDO Works
S-SONDO operates by aligning the output of a small "student" model with the output of a large "teacher" model. Because the teacher and student often have different latent space dimensions, the framework uses a mapping head to project the student’s embeddings into the teacher’s space. Once aligned, the student is trained to minimize the difference between its projected embeddings and the teacher’s embeddings. By using cosine similarity as the default loss function, the student learns to capture the rich, semantically structured information already present in the teacher’s representation, without needing any labels or specific architectural requirements.
Enhancing Training with Balanced Data Sampling
A key challenge in training models on large, unlabeled datasets is ensuring the model learns from a diverse range of audio. To address this, the researchers introduced a Balanced Data Sampling (BDS) strategy. Since ground-truth labels are unavailable in a self-supervised setting, the team used clustering to group teacher embeddings and assigned sampling weights based on the frequency of these clusters. This ensures that the student model is exposed to a more balanced distribution of audio concepts, which helps prevent the model from overfitting or collapsing during the training process.
Performance and Efficiency
The researchers tested S-SONDO by distilling two large foundation models (M2D and MATPAC++) into three different student architectures. The results were highly effective: the distilled students were up to 61 times smaller than their teachers while retaining up to 96% of the teacher's performance. In most cases, the students trained using S-SONDO actually outperformed models trained through traditional supervised methods. This demonstrates that embedding-based distillation is a viable and powerful way to bring high-performance audio intelligence to resource-constrained devices.
Future Directions
While S-SONDO is effective, the researchers note that the current clustering-based sampling method can be further refined. Future work will focus on developing more advanced clustering techniques that can better capture complex semantic structures, especially for multi-label audio tasks. Additionally, there is potential to improve the training process by combining the current alignment approach with contrastive objectives, which could help the student model better distinguish between semantically similar audio samples.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!