Dynamic Token Pruning via Semantic-aware Distillation for Efficient Transformer Compression
Published 2024-11-30
How to Cite

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Transformer-based models have achieved state-of-the-art performance across various domains but often suffer from high computational cost due to the quadratic complexity of self-attention. In this paper, we introduce a novel Dynamic Token Pruning framework that integrates Semantic-aware Distillation to compress large-scale Transformers effectively. Our method identifies and removes redundant tokens at each layer based on their contribution to final predictions, guided by a teacher model trained on full sequences. A lightweight gating module evaluates token importance dynamically, ensuring adaptive pruning without retraining from scratch. Experiments on image classification (DeiT, ViT) and natural language understanding tasks (BERT, RoBERTa) show that our approach achieves up to 50% reduction in FLOPs with less than 1% drop in accuracy. Furthermore, ablation studies validate the effectiveness of semantic guidance in retaining critical contextual information. This work provides a scalable and model-agnostic solution to deploy Transformer models on resource-constrained devices.