Dynamic Token Pruning via Semantic-aware Distillation for Efficient Transformer Compression

Callum Petros

Vol. 3 No. 8 (2024)

Articles

Dynamic Token Pruning via Semantic-aware Distillation for Efficient Transformer Compression

Callum Petros

Published 2024-11-30

How to Cite

Petros, C. (2024). Dynamic Token Pruning via Semantic-aware Distillation for Efficient Transformer Compression. Journal of Computer Technology and Software, 3(8). Retrieved from https://ashpress.org/index.php/jcts/article/view/189

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Transformer-based models have achieved state-of-the-art performance across various domains but often suffer from high computational cost due to the quadratic complexity of self-attention. In this paper, we introduce a novel Dynamic Token Pruning framework that integrates Semantic-aware Distillation to compress large-scale Transformers effectively. Our method identifies and removes redundant tokens at each layer based on their contribution to final predictions, guided by a teacher model trained on full sequences. A lightweight gating module evaluates token importance dynamically, ensuring adaptive pruning without retraining from scratch. Experiments on image classification (DeiT, ViT) and natural language understanding tasks (BERT, RoBERTa) show that our approach achieves up to 50% reduction in FLOPs with less than 1% drop in accuracy. Furthermore, ablation studies validate the effectiveness of semantic guidance in retaining critical contextual information. This work provides a scalable and model-agnostic solution to deploy Transformer models on resource-constrained devices.

Dynamic Token Pruning via Semantic-aware Distillation for Efficient Transformer Compression

How to Cite

Download Citation

Abstract