Abstract
We propose a novel feature enhancement module designed for fine-grained visual classification tasks, which can be seamlessly integrated into various backbone architectures, including both convolutional neural network (CNN)-based and Transformer-based networks. The plug-and-play module outputs pixel-level feature maps and performs a weighted fusion of filtered features to enhance fine-grained feature representation. We introduce a class-centric loss function that optimizes the alignment of samples with their target class centers by pulling them toward the center of the target class while simultaneously pushing them away from the center of the most visually similar nontarget classes. Soft labels are employed to mitigate overfitting, ensuring the model generalizes well to unseen examples. Our approach consistently delivers significant improvements in accuracy across various mainstream backbone architectures, underscoring its versatility and robustness. Furthermore, we achieved the highest accuracy on the NABirds (NAB) and our proprietary lock cylinder datasets.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Neural Networks and Learning Systems |
| DOIs | |
| State | Accepted/In press - 2025 |
Keywords
- Class center
- Transformer
- convolutional neural network (CNN)
- fine-grained visual classification (FGVC)
- soft label