Your Guide to Model Distillation
Unpacking the science of creating smaller, faster, and more efficient models from larger ones. Your one-stop source for concepts, papers, and tools.
Explore ConceptsAbout ModelDistill.com
Model distillation is a powerful technique in machine learning for compressing large, complex models (the "teacher") into smaller, more efficient models (the "student"). The goal of ModelDistill.com is to provide a curated, high-quality collection of resources for students, researchers, and engineers. We aim to be the definitive starting point for anyone looking to learn about or apply model distillation techniques in their work.
Core Concepts
The fundamental ideas behind knowledge distillation.
Teacher-Student Paradigm
At its core, distillation involves a large, pre-trained 'teacher' model and a smaller 'student' model. The student learns to mimic the teacher's outputs, effectively inheriting its knowledge in a more compact form.
Soft Targets (Logits)
Instead of just learning from hard labels (e.g., "cat" or "dog"), the student model learns from the teacher's full probability distribution over classes. These "soft targets" contain richer information about how the teacher generalizes.
Distillation Temperature
A temperature scaling hyperparameter is used in the softmax function to soften the probability distribution from the teacher. A higher temperature produces a softer distribution, revealing more about the teacher's logic.
Key Resources
Foundational papers and essential tools.
Model Distillation in Action
Real-world examples of how distillation creates efficient, powerful models.
Hugging Face's DistilBERT
A landmark case in NLP. The massive BERT model was distilled into DistilBERT, which is 40% smaller and 60% faster, while retaining over 97% of BERT's language understanding capabilities. This made high-performance NLP accessible for production environments.
Learn More →Google's On-Device AI
Many features in Android, such as live captions and Reading Mode, rely on powerful models. Google uses distillation to shrink these models to run directly on your phone with low latency and without needing an internet connection, preserving privacy and speed.
Read Google's Blog →Edge Computing & Vision
In fields like autonomous driving and smart cameras, complex computer vision models must run on small, power-efficient "edge" devices. Distillation is a key technique used to compress large, accurate vision models into smaller forms for real-time analysis.
Read Survey →Get In Touch
Have a question or a resource to suggest? We'd love to hear from you.
Our team of experts can help with your model distillation needs, from consultation and strategy to implementing bespoke student models for your specific use case. Reach out to see how we can help.