Arsitektur Hybrid Vision Transformer–ConvNeXt dengan Multi-Task Focal Loss dan Medical Test-Time Augmentation untuk Klasifikasi Lesi Kulit Berbasis Citra
DOI:
https://doi.org/10.63643/jodens.v5i2.325Keywords:
Klasifikasi lesi kulit, Vision Transformer, ConvNeXt, Focal Loss, Test-Time Augmentation, HAM10000Abstract
Dermatoscopy image-based skin lesion classification is a challenge in dermatology due to the high visual variation between lesion types and the imbalanced class distribution in the dataset. In this study, a Hybrid Vision Transformer–ConvNeXt architecture is proposed, combining the global attention capability of Vision Transformer (ViT) and the spatial feature representation of ConvNeXt, to improve the classification performance of skin lesion images on the HAM10000 dataset. This study also applies Multi-Task Focal Loss, auxiliary classifier, and Weighted Random Sampler to effectively address the class imbalance. In addition, the Medical Test-Time Augmentation (TTA) approach is used in the inference stage to improve the stability of predictions. The model is trained using a two-stage strategy (head training and full fine-tuning), as well as optimization based on AdamW and Cosine Annealing Warm Restarts. The test results show that the proposed model successfully achieves a validation F1-Score of 0.8723, and after TTA it increases to 0.90, surpassing the baseline of ViT and single ConvNeXt. These findings indicate that the integration of ViT–ConvNeXt with loss strategy and medical TTA is able to significantly improve the performance of skin lesion classification, and has the potential to be applied as a clinical diagnosis support system.
References
R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2019,” CA Cancer J Clin, vol. 69, no. 1, pp. 7–34, Jan. 2019, doi: 10.3322/caac.21551.
Q. Wu, Y. Yu, and X. Zhang, “A Skin Cancer Classification Method Based on Discrete Wavelet Down-Sampling Feature Reconstruction,” Electronics (Basel), vol. 12, no. 9, p. 2103, May 2023, doi: 10.3390/electronics12092103.
H. Zunair and A. Ben Hamza, “Melanoma detection using adversarial training and deep transfer learning,” Phys Med Biol, vol. 65, no. 13, p. 135005, Jul. 2020, doi: 10.1088/1361-6560/ab86d3.
P. Bartlett, F. C. N. Pereira, C. J. C. . Burges, L. Bottou, and K. Q. Weinberger, Advances in neural information processing systems 25 : 26th annual conference on neural information processing systems 2012. Curran Associates, Inc., 2013.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 2021, [Online]. Available: http://arxiv.org/abs/2010.11929
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” Mar. 2022, [Online]. Available: http://arxiv.org/abs/2201.03545
P. Tschandl, C. Rosendahl, and H. Kittler, “Data descriptor: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Sci Data, vol. 5, Aug. 2018, doi: 10.1038/sdata.2018.161.
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection.”
C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J Big Data, vol. 6, no. 1, Dec. 2019, doi: 10.1186/s40537-019-0197-0.
X. Huang et al., “EConv-ViT: A strongly generalized apple leaf disease classification model based on the fusion of ConvNeXt and Transformer,” Information Processing in Agriculture, Mar. 2025, doi: 10.1016/j.inpa.2025.03.001.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hendry, Ferry Govert Anwar, David Chow, Andi Saputra, Muhammad Khaerul Naim Mursalim

This work is licensed under a Creative Commons Attribution 4.0 International License.









