Vision Transformer Enhanced by Contrastive Learning: A Self-Supervised Strategy for Pulmonary Tuberculosis Diagnosis

Widia Marlina, Umar Zaky

Introduction

Vision transformer enhanced by contrastive learning: a self-supervised strategy for pulmonary tuberculosis diagnosis . Novel Vision Transformer (ViT) with Self-Supervised Learning enhances pulmonary TB diagnosis from CXR images. Achieves high recall, generalization, and scalability for global health efforts.

48 views

Abstract

Tuberculosis (TB) diagnosis from Chest X-ray (CXR) images poses a significant challenge in radiology due to the inherent data imbalance and subtle lesion heterogeneity. These factors cause traditional deep learning models, like standard CNNs and conventional Vision Transformers (ViT), to exhibit poor generalization and inadequate sensitivity (recall) for the minority TB class. We address this critical research gap by introducing a novel methodology, an enhanced ViT architecture that leverages Self-Supervised Learning (SSL) via the SimCLR framework, subsequently optimized with an Adaptive Weighted Focal Loss. Our primary objective was to develop a generalizable model that minimizes false negatives without sacrificing overall precision, thereby establishing a new performance benchmark for automated TB detection. The methodology conceptually separates feature learning from SSL pre-training on unlabeled data to generate robust and domain-invariant features, distinct from classification optimization. Adaptive Weighted Focal Loss is employed during fine-tuning to counter majority class gradient dominance mechanistically. We validated this approach using K-Fold Cross-Validation. The final ViT SSL Weighted model achieved a peak internal accuracy of 0.9861 and an AUPRC of 0.9781. Crucially, it maintained generalization stability when externally tested on the TBX11K dataset, securing an AUPRC of 0.9795 and a high recall of 0.9527. This minimal variance strongly confirms the reproducibility and robustness of our features against institutional variation. The resulting high recall directly translates to enhanced diagnostic decision-making, significantly lowering the clinical risk associated with a missed TB diagnosis. This study establishes an effective, stable, and generalizable SSL-based ViT framework, offering a scalable solution for public health efforts in resource-constrained settings.

Review

This paper presents a highly relevant and timely contribution to the field of automated medical diagnosis, specifically addressing the challenging task of pulmonary tuberculosis (TB) detection from Chest X-ray (CXR) images. The authors adeptly identify critical limitations of existing deep learning approaches—namely, poor generalization and inadequate sensitivity stemming from data imbalance and the subtle nature of TB lesions. By proposing a novel methodology that integrates an enhanced Vision Transformer (ViT) with Self-Supervised Learning (SSL) via the SimCLR framework, the study targets the minimization of false negatives, a crucial clinical objective. The overall impression is one of a well-conceived and potentially impactful strategy designed to establish a new performance benchmark for TB diagnosis. The methodological design is robust and thoughtfully constructed. The core innovation lies in separating feature learning through SSL pre-training on unlabeled data, thereby generating robust and domain-invariant features, from the subsequent classification optimization. This strategic disentanglement is key to overcoming domain shift and data scarcity often encountered in medical imaging. Furthermore, the incorporation of Adaptive Weighted Focal Loss during the fine-tuning phase is a judicious choice, directly confronting the issue of majority class gradient dominance that commonly plagues imbalanced datasets. The use of K-Fold Cross-Validation for internal assessment and, crucially, external validation on the TBX11K dataset provides strong evidence for the model's reliability and generalizability, a frequently overlooked aspect in many deep learning studies. The reported results are highly compelling and underscore the efficacy of the proposed ViT SSL Weighted model. Achieving a peak internal accuracy of 0.9861 and an AUPRC of 0.9781 are strong indicators of internal performance. More importantly, the model's performance on the external TBX11K dataset, yielding an AUPRC of 0.9795 and an impressive recall of 0.9527 with minimal variance, speaks volumes about its robustness and reproducibility across institutional variations. This high recall is particularly significant, as it directly translates to a reduced clinical risk associated with missed TB diagnoses, enhancing diagnostic decision-making. The study successfully establishes an effective, stable, and generalizable SSL-based ViT framework, positioning it as a scalable solution with considerable potential for public health initiatives, especially in resource-constrained settings where access to expert radiologists is limited.

Full Text

You need to be logged in to view the full text and Download file of this article - Vision Transformer Enhanced by Contrastive Learning: A Self-Supervised Strategy for Pulmonary Tuberculosis Diagnosis from Jurnal Teknokes .