Processing math: 0%
AIROGS: Artificial Intelligence for Robust Glaucoma Screening Challenge | IEEE Journals & Magazine | IEEE Xplore

AIROGS: Artificial Intelligence for Robust Glaucoma Screening Challenge


Abstract:

The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cos...Show More

Abstract:

The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.
Published in: IEEE Transactions on Medical Imaging ( Volume: 43, Issue: 1, January 2024)
Page(s): 542 - 557
Date of Publication: 15 September 2023

ISSN Information:

PubMed ID: 37713220

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Glaucoma is one of the main causes of irreversible blindness and impaired vision in the world. It affects the optic nerve, which connects the eye with the brain, and leads to progressive visual field damage. This damage initially passes unnoticed by the patient. Only in later stages will glaucoma patients experience visual loss. According to estimates, by 2040, over 110 million people will have varying degrees of visual impairment caused by glaucoma [1], with 10% experiencing blindness in both eyes and 25% in one eye [2]. Many people experience visual impairment from glaucoma because it is often not detected until later stages [3], [4]. Current treatments of glaucoma cannot repair the damage, but can only halt or slow the progression of the condition [5]. Implementing screening programs to identify patients early on for treatment can alleviate the consequences of the disease. Artificial intelligence (AI) may be the enabling technology for the cost-effective implementation of these programs by automatically detecting perimetric glaucoma (i.e., glaucoma in which there is already visual field damage) in color fundus photographs (CFPs) [6], [7], [8], [9], [10].

Existing AI solutions have been shown to drop in performance in real-world screening practice due to comorbidities, poor quality images, different ethnicities, or unexpected out-of-distribution (OOD) samples [9]. Ad-hoc quality check modules have been added to AI solutions to overcome this performance drop, but recent research has indicated that these quality checks are not sufficiently accurate when deployed in real-world settings [11]. To allow a safe and effective deployment in screening, the reliability and robustness of such solutions need to be assessed. Medical image analysis challenges often exclusively focuses on performance metrics that are potentially unrealistic and overestimated due to the use of test sets that do not represent real-world scenarios. Moreover, metrics to measure reliability and robustness are often neglected due to the difficulty of estimating them in the provided test sets.

To develop solutions that overcome the aforementioned issues related to robustness in glaucoma screening, we organized the Artificial Intelligence for RObust Glaucoma Screening (AIROGS) challenge. The goal of this challenge was to evaluate the feasibility of the development of a state-of-the-art, reliable AI solution that takes a CFP as input and provides as output the likelihood of referable glaucoma, accompanied with outputs for robustness (i.e., predicting whether the input image can be graded reliably or not). The screening task was to distinguish no referable glaucoma (i.e., either no glaucoma at all or non-referable glaucoma, that is suspected pre-perimetric glaucoma) from referable glaucoma. This is different from, for example, distinguishing between multiple glaucoma severity stages.

To encourage the development of solutions that are robust to any kind of ungradable and unexpected input data and are equipped with inherent robustness mechanisms, the training set we provided was a subset of the full AIROGS dataset where only gradable images were included and ungradable images excluded. The test set, however, is unfiltered, containing all images found in screening settings (gradable and ungradable), representing a real-world scenario.

AIROGS was part of the International Symposium on Biomedical Imaging (ISBI) 2022 challenge program. It reopened after presenting the results during ISBI 2022 and submissions can still be made on Grand Challenge.1

Our challenge, along with the dataset we made publicly available, distinguishes itself from previous glaucoma challenges. First, to the best of our knowledge, our dataset is the largest publicly available CFP dataset with glaucoma labels by a large margin. In total, our dataset contains 112,732 CFPs, exceeding the size of other publicly available datasets containing CFPs with glaucoma, of which the sizes range from 22 to 2,000 CFPs [12], [13], [14], [15], [16], [17], [18], [19], [20]. It is a highly diverse dataset as it originates from 500 screening centers across the United States of America and was acquired with a large variety of cameras. Second, the AIROGS challenge is the first challenge to emphasize robustness in glaucoma screening. Third, AIROGS is one of the first types of challenges on grand-challenge.org that requires participants to submit an algorithm (a Type 2 challenge), rather than a file with their predictions on the test set (a Type 1 challenge), as is done in more traditional challenges. This makes human intervention in the generation of test set results impossible, reducing the possibility of cheating. More over, it greatly improves reproducibility, allowing everyone to reuse the trained algorithms that were submitted and apply them to new data in a cloud-based environment. Fourth, the reproducibility enabled testing of the participating algorithms on three external datasets: two for evaluating the screening task and one for evaluating robustness.

SECTION II.

Datasets

A. The Rotterdam EyePACS AIROGS Dataset

The Rotterdam EyePACS AIROGS dataset contains 112,732 CFPs from 60,071 subjects and 500 different sites with a heterogeneous ethnicity. The images were originally acquired for a diabetic retinopathy screening program [21]. For grading of the CFPs, all graders were trained and then selected for this task using the European Optic Disc Assessment Trial (EODAT) [22], containing 110 stereoscopic optic nerve photographs, in which all glaucomatous eyes had reproducible visual field defects on standard automated perimetry. 90 experienced ophthalmologists and optometrists were examined and those who scored at least 85% overall accuracy and 92% specificity were selected to label images for the present study. Eventually, 30 out of 90 candidates passed.

For each eye, three images were taken by the camera operators to reduce the number of ungradable eyes. When labeling the images, graders classified one eye at a time. The labeling tool first presented the first CFP for each eye, upon when graders could choose from the options “Referable glaucoma” (RG), “No referable glaucoma” (NRG), or “Ungradable” (U). If a grader selected U for the first image, the tool showed the consecutive CFP. The third image was presented if the second image was also deemed U. Each eye was scored by two separate graders, who were both unaware of the identity of the other grader. If the two graders agreed on the label of a CFP, this became the final label. If they disagreed, the image was scored by one of the glaucoma specialists who passed the EODAT test with at least 95% accuracy. The final label was then based on his judgment.

The graders were instructed to select RG if they found glaucomatous signs which they expected to be associated with visual field defects on standard automated perimetry. The signs that could be selected were “appearance neuroretinal rim superiorly”, “appearance neuroretinal rim inferiorly”, “baring of the circumlinear vessel superiorly”, “baring of the circumlinear vessel inferiorly”, “disc hemorrhage(s)”, “retinal nerve fiber layer defect superiorly”, “retinal nerve fiber layer defect inferiorly”, “nasalization (nasal displacement) of the vessel trunk”, “laminar dots” and “large cup”. If the graders did not expect any glaucomatous visual field defects, NRG was to be selected, ignoring any comorbidities (e.g., age-related macular degeneration and diabetic retinopathy). If there was not enough information visible in the CFP to decide between RG and NRG, graders were instructed to select U. Since the goal of this study was to develop solutions for automated screening, glaucoma severity was not reported by the human graders.

The graders were not only evaluated at the start, but they were periodically monitored during the grading process, as well. If their sensitivity or specificity dropped below 80% or 95%, respectively, they were removed from the study and all images they labeled were re-graded by any of the remaining graders. In case a grader wrongly classified a CFP as U, while its final label was NRG or RG, their specificity or sensitivity went down, respectively. In the end, 20 graders remained.

Out of the three CFPs that were available for each eye, we only included the RG or NRG photograph in the dataset if it was available. Otherwise, only one of the U photographs was used. We split the data into a training set of 101,442 CFPs and a test set of 11,290 CFPs, ensuring that data from patients in the training set was not in the test set. We randomly sampled patients when making the split, oversampling patients with ungradable and RG CFPs for the test set, such that approximately 1,600 RG and 1,600 U photographs ended up in the test set. When making the split, we ensured that data from a single patient ended up in either the training set or the test set. Since we were interested in AI solutions that can identify ungradable data without training on ungradable data, we left out all U photographs that ended up in the AIROGS training set. Table I shows statistics about RG, NRG, and U prevalence, age, sites, and cameras for the full dataset, the training set, and the test set. For further information about the acquisition process, labeling, and dataset statistics, including the prevalence of ethnicity, please refer to the paper on the REGAIS dataset, of which AIROGS is a subset [23]. The AIROGS dataset includes 99% of the data from the REGAIS dataset. The difference in size is due to the exclusion of ungradable eyes of patients in the AIROGS set, which was not done in the REGAIS set. The dataset included data from “people of African descent, Whites, Asians, Latin Americans, native Americans, people from the Indian subcontinent, people of mixed ethnicity, and people of unspecified ethnicity” [23]. Approval from the Institutional Review Board of the Rotterdam Eye Hospital was obtained to conduct this research.

TABLE I Statistics of the Rotterdam EyePACS AIROGS Dataset. # = Number of. CFPs = Color Fundus Photographs
Table I- 
Statistics of the Rotterdam EyePACS AIROGS Dataset. # = Number of. CFPs = Color Fundus Photographs

B. External Datasets

The participants uploaded their trained algorithms, rather than a file with predictions on our test set, to our challenge platform. This enabled us to reuse the developed models on external data after the challenge ended. To evaluate model generalization and to demonstrate this reusability, we applied all trained algorithms to three external datasets: Retinal Fundus Glaucoma Challenge (REFUGE) [18], Glaucoma grAding from Multi-Modality imAges (GAMMA) [20] and Diabetic Retinopathy Image Database (DRIMDB) [24]. The former two are datasets with positive and negative glaucoma CFPs, which we used for externally evaluating the screening performance. We used the latter dataset, which contained different types of ungradable images, to evaluate the robustness externally.

The REFUGE test set contained 400 CFPs, of which 40 CFPs showed glaucoma and 360 CFPs did not. The definition of glaucoma was glaucomatous damage in the optic nerve head area and reproducible glaucomatous visual field defects, which is similar to our definition of glaucoma described earlier [18].

The GAMMA dataset is a multi-modal dataset with optical coherence tomography scans and CFPs for each eye. We used the CFP data from the 100-sample training set as only that subset of the GAMMA dataset had publicly available labels. We defined positive glaucoma in the same way as Wu et al. [20], i.e., as the union of the early, intermediate, and advanced glaucoma stages. These stages were defined using the mean deviation (MD) from the visual field reports as follows: an MD less than −6 dB the early stage, an MD between −6 dB and −12 dB for the intermediate stage, and an MD worse than −12 dB for the advanced stage [20]. This resulted in 50 negative and 50 positive glaucoma samples.

DRIMDB is a dataset with 125 “Good” CFPs, 69 “Bad” CFPs, and 22 “Outlier” CFPs. According to Şevik et al. [24], one of the criteria of the “Good” category was OD presence. We also manually confirmed the OD was visible in all CFPs that were labeled “Good” in the DRIMDB dataset. Therefore, we assumed the CFPs with th2e category “Good” were gradable. The images labeled “Bad” and “Outlier” were assumed to be ungradable. This resulted in 125 gradable and 91 ungradable images.

SECTION III.

Challenge Setup

The AIROGS challenge consisted of four phases (see Fig. 1). The Training Phase opened on the 1st of December 2021 and closed on the 4th of March 2022, providing the participants with approximately three months to develop their solutions. At the start of this phase, the training set was released and has since been available for download under the CC BY-NC-ND license on Zenodo.2

Fig. 1. - Overview of all phases in the AIROGS challenge. A world map is shown for each phase that indicates with red circles from which countries the teams that participated in that phase originated. A circle is shown for each country from which at least one team participated and its size represents the number of teams that joined from that country. The relevant subset of the AIROGS dataset for each phase is shown at the bottom of the figure. *All phases reopened for new submissions after the winning teams were announced.
Fig. 1.

Overview of all phases in the AIROGS challenge. A world map is shown for each phase that indicates with red circles from which countries the teams that participated in that phase originated. A circle is shown for each country from which at least one team participated and its size represents the number of teams that joined from that country. The relevant subset of the AIROGS dataset for each phase is shown at the bottom of the figure. *All phases reopened for new submissions after the winning teams were announced.

To ensure fair competition and to encourage the development of inherent robustness mechanisms, teams were not permitted to use additional fundus image training data, including weights pre-trained on fundus image data or in pre-processing steps such as OD segmentation. Manually labeling the challenge data and using the resulting annotations during training was allowed.

To test the algorithms developed by participants, they needed to wrap their trained algorithm in a Docker3 container and submit it to our challenge platform. This allows the submitted algorithms to be run on data that is not directly accessible by the participating teams. Example code for generating such a containerized submission can be found on GitHub.4 Preliminary Test Phase 1 opened and closed simultaneously with the Training phase and served as a check for whether the submitted algorithms could be run on the challenge platform and produced the output in the expected format. Algorithms were tested on 10 images from the training set for this check. All algorithms were executed on the challenge platform using an NVIDIA T4 GPU (16 GB VRAM) with 8 CPUs (32 GB RAM).

The test set was and still is closed, meaning the image data and the labels are private and cannot be downloaded. Preliminary Test Phase 2 opened on the 1st of February 2022 and we allowed three submissions per team to this phase, as it used 10% of the test set for evaluation. All challenge metrics were also computed and reported back to the participants. The Final Test Phase opened simultaneously with Preliminary Test Phase 2, but algorithms were tested on 100% of the test data and only one submission per team was allowed. The challenge metrics computed for this phase were used for the final team ranking.

The algorithms were expected to produce four outputs, of which two were related to glaucoma screening performance (i.e., image classification of RG and NRG) and the other two to robustness (i.e., the identification of U). The glaucoma screening outputs were a likelihood score for RG (O_{1} ) and a binary decision for RG (O_{2} , positive if RG and negative if NRG). The ungradability outputs were a binary decision on whether the image is ungradable (O_{3} , positive if ungradable and negative if ungradable) and a non-thresholded scalar value that is positively correlated with the likelihood for ungradability (e.g. the entropy of a probability vector produced by a machine learning model or the variance of an ensemble) (O_{4} ). Output O_{2} was not used in the evaluation pipeline for the challenge leaderboard, but it was requested by the challenge organizers for further analysis.

The evaluation was also based on the two aspects of screening performance and robustness, with two metrics per aspect. Screening performance was evaluated using the standardized partial area under the receiver operating characteristic curve [25] (90-100% specificity) for RG (pAUC_{S} ), and the sensitivity at 95% specificity (SE \text{@} 95SP_{S} ). These metrics were based on these specificity ranges, as a high specificity is required for cost-effective glaucoma screening due to its relatively low prevalence [26], [27]. pAUC_{S} and SE \text{@} 95SP_{S} are both based on output O_{1} . For evaluating the robustness, we determined the model’s agreement with the human reference on ungradability using Cohen’s kappa score (\kappa _{U} ), calculated using output O_{3} . Furthermore, we calculated the area under the receiver operator characteristic curve using the human reference for ungradability as the true labels and output O_{4} as the target scores (AUC_{U} ).

To determine the final ranking, we first ranked all participants on the four individual metrics pAUC_{S} , SE \text{@} 95SP_{S} , \kappa _{U} , and AUC_{U} resulting in the rankings R_{pAUC_{S}} , R_{SE \text{@} 95SP_{S}} , R_{\kappa _{U}} , and R_{AUC_{U}} , respectively. The final score S_{final} was then calculated as the mean of those rankings:\begin{equation*} S_{final} = \frac {R_{pAUC_{S}} + R_{SE \text{@} 95SP_{S}} + R{\kappa _{U}} + R_{AUC_{U}}}{4}. \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The final ranking (later also referred to as Mean position), was based on S_{final} , where a lower value for S_{final} resulted in a higher ranking.

We calculated 95% confidence intervals (CIs) with non-parametric bootstrapping using 1000 iterations [28]. The code for evaluating submissions can be found on GitHub.5 The performance of human graders was calculated by comparing the labels given by the individual graders (excluding the two glaucoma specialists, since the final labels were equal to their decision in case of disagreement) to the final labels as defined in Section II-A. To compute the performance of all human graders combined, each image was weighted equally in the calculation of the metrics. We also evaluated ensembles of participating algorithms, which were generated by averaging the outputs of these algorithms.

SECTION IV.

Participating Methods

Fifteen teams submitted a working solution to the Final Test Phase, of which one team did not opt-in to contribute to the current paper. In this section, we present the methods of the fourteen participating teams. More extensive descripions are available on the AIROGS challenge website6 and a selection of the participating methods were included in the ISBI challenge proceedings [37], [38], [39]. Table II and III summarize the participating methods in a structured manner.

TABLE II Method Overview From All Participating Teams for the Screening Task. OD = Optic Disc. #ODs = Number of CFPs in Which the OD Was Manually labeled. #vessels = Number of CFPs in Which the Vessels Were Manually labeled
Table II- 
Method Overview From All Participating Teams for the Screening Task. OD = Optic Disc. #ODs = Number of CFPs in Which the OD Was Manually labeled. #vessels = Number of CFPs in Which the Vessels Were Manually labeled
TABLE III Method Overview From All Participating Teams for the Ungradability Task and the Deep Learning Frameworks They Used. OD = Optic Disc. AE = Autoencoder. VAE = Variational Autoencoder. Rec. Error = Reconstruction Error. OOD = Out-of-Distribution
Table III- 
Method Overview From All Participating Teams for the Ungradability Task and the Deep Learning Frameworks They Used. OD = Optic Disc. AE = Autoencoder. VAE = Variational Autoencoder. Rec. Error = Reconstruction Error. OOD = Out-of-Distribution

A. PUMCH-Eye [37]

The PUMCH-eye team proposed an approach with five trained models in their workflow. The first model (M_{disc} ) was a segmentation model with ResNet101-UperNet [40] as the backbone that segmented the OD in the input CFP. For the development of this model, they manually labeled the OD in 40 images. In case M_{disc} successfully detected the OD, they computed the center c and the diameter d of the segmentation to crop the input image around c with size 3d . This cropped image was then fed into a vision transformer [41] for the binary classification of RG and NRG. If the OD detection was unsuccessful, they fed the original input image to a different vision transformer for binary classification of RG and NRG.

The team also developed a vessel segmentation model with 40 images in which they manually annotated vessels (M_{vessel} ). They trained a ResNet-18 (R_{vessel} ) which took the output of M_{vessel} as input data, using the first 500 images in the training dataset and 100 manually selected images in the training set with relatively poor image quality. This classfication model served as one of the inputs for ungradability classification. The second input was taken from M_{disc} . The ungradability likelihood output (O_{4} ) was then defined as the output likelihood of the binary classification model R_{vessel} (i.e., O_{vessel} or, if the M_{disc} could not detect an OD, as R_{vessel} + 0.75 . O_{3} was positive if O_{4} was at least 0.95 and negative otherwise. The vessel segmentation model was evaluated on four randomly selected images, on which a Dice score of 0.787 was achieved. This was lower than the state-of-the-art for this problem [42]. However, this was expected to be sufficient for the downstream task of ungradability detection.

B. RWTH-CuP [38]

The RWTH-CuP team proposed an approach with two steps consisting of cropping around the OD by employing a detection network, followed by an ensemble of transformers (Swin Transformer-B [43] and DeiT-S [44]) and convolutional neural networks (CNNs) (EfficientNet-B4 [45] and EfficientNetV2-M [46]) that classifies the cropped image. They manually labeled the OD and its environment in 3,221 CFPs to develop this detection network, for which they trained a YOLOv5 [47] object detector network.

For ungradability classification, the team used a hybrid approach. As the probability that an image is ungradable is high if the OD could be found by the object detector network, they employed the confidence score of the YOLOv5 detection model as one of the ungradability measures. To capture other ungradability causes, such as blurred depictions of the OD, they trained an additional classifier on a manually selected subset of the CFPs in the development set. The team considered the 4000 CFPs with the lowest confidence score of the object detector and manually selected 600 images that were assumed by the team to be very close to being classified as ungradable. They used another set of 2,000 high-quality images to train an EfficientNet-B4 [45] ungradability classification model. O_{4} was then defined as (1-c) + g , where c is the object detection confidence and g the output the ungradability classification model. The binary O_{3} output value was determined using a cut-off manually determined by a medical doctor in 20,000 images from the development set for which O_{4} was computed.

C. Eyelab [48]

The Eyelab team employed a two-stage approach for glaucoma classification. The first step was to detect and crop the OD area and the second step was a vision transformer [49] that classified the cropped image from the first step. For the detection model, they trained a YOLOv5 [47] model using semi-automatically generated labels. Their method for ungradability detection was based on whether the optic disc detection model from the first step found an optic disc to be present.

D. Tien [50]

Tien used an ensemble of an EfficientNet [45] and DenseNet [51] for the classification of RG and NRG. For the ungradability task, they used an autoencoder network and a blending engine. They used the reconstruction error as a measure of the likelihood of ungradability. The higher the reconstruction error, the more likely it is that the image is ungradable. The blending engine fused the probability output from the binary classification model as a weight factor to the reconstruction error. The highest weight was 1 (when the probability was 0.5) and the weight was lowest when the probability is certain (either 0 or 1).

E. UPF+AIML [52]

Team UPF+AIML trained two separate models, both based on the MobileNet-V2 [53] architecture for lightweight training, and both optimized with the Sharpness-Aware Minimization (SAM) [31] technique for better generalization. The first model was trained on the available training set for the screening task. The second model was tasked with identifying out-of-distribution data, i.e., ungradable images in this case. For this, ungradable images were simulated by applying four image transformations (brightness, gamma, saturation, and blur) online to the data, with such a strength that they would destroy image content and turn images useless for diagnostic purposes. The ungradability detection model was trained on a mixture of gradable (sampled directly from the original training set) and ungradable (simulated). After training, this model was applied to the training set, where all images were expected to be gradable. The threshold that would classify 0.1% of the training set as ungradable was selected for ungradabiltiy detection.

F. FMS-CETCV [54]

The FMS-CETCV team used a binary classifier with ResNet-50 as the backbone for classifying RG and NRG. They used focal loss [55] to account for the class imbalance in the training set.

For the classification of ungradable images, they used a self-supervised learning approach, inspired by the work of Oza et al. [56], where a one-class classification method was presented for unsupervised anomaly detection. The one-class classifier builds a feature space by extracting the features of the training sample which contain only the positive samples (i.e., gradable images). They used an encoder with ResNet-18 as the backbone, which is trained on the AIROGS test set that only contains gradable images. The feature space produced by this encoder is then used by a Gaussian anomaly classifier to distinguish gradable and ungradable images.

G. ICT_HCI [57]

Team ICT_HCI used ResNet-50 for their RG and NRG classification model. During inference, they made five random 512\times 512 crops of the image and then provided all crops separately to the model and get five scores. If the maximum of five scores was greater than 0.9, they let it be the output of the model, otherwise, they took the mean of the five scores as the output of the model.

The team used the minimum class probability of the two classes RG and NRG as the likelihood for ungradability. If and only if the ungradability likelihood was greater than 0.1, they set O_{3} to be positive.

H. SK [58]

The SK team employed ResNet-RS [59] for RG and NRG classification. They replaced the final linear layer of ResNet-RS with a single linear layer with two channel outputs.

For ungradability classification, they used an inference-time OOD energy-based method [60] combined with activation rectification [61]. The energy-based method uses a scoring function based on energy, instead of softmax, to discriminate in-distribution (ID) and OOD data. In activation rectification, the outsized activation of a few layers can be attenuated by rectifying the activations at an upper limit. After rectification, the output distributions for ID and OOD data become much more well-separated. It is based on the observation that the mean activation for ID data is well-behaved with a near-constant mean and standard deviation, and the mean activation for OOD data has significantly larger variations across units and is biased towards having sharp positive values.

I. SACM [62]

Team SACM used the YOLOv5 [47] detection model to crop the optic disc in the CFP input image. In a semi-automated process, they manually labeled the locations of 735 optic discs and trained the detection model with 4,088 in total. The cropped image was then passed through an ensemble of classifiers (SeResNext-50 [63], VGG-16 [64], DenseNet-161 [51], EfficientNet-B5 [45], EfficientNet-B7 [45] and Inception-V3 [65]) to make the final prediction. They also used test-time augmentation.

For the robustness task, they used the detection model confidence, an autoencoder, and a variational autoencoder (VAE) [66]. They combined these three aspects of their pipeline using this formula to achieve a final ungradability score for O_{4} as (1-c)\cdot s \cdot p_{autoencoder} \cdot p_{vae} , where c refers to the detection network confidence and p_{autoencoder} and p_{vae} refer to the mean squared error between the input and output of the autoencoder and VAE, respectively.

J. UPRetina-UR [67]

The UPRetina-UR team used ResNet-RS-50 [59] for the classification of RG and NRG. They oversampled cases with RG during training to account for the class imbalance.

They employed a closed-set classification approach for the ungradability task based on the method proposed by Vaze et al. [68]. They applied test-time augmentation to obtain five predictions that are averaged to produce O_{4} .

K. OPTIMATeam [39]

OPTIMATeam used the first two blocks from the Inception-V3 [65] network for the classification of RG and NRG. They only used these two blocks to reduce the receptive field size, which was necessary for their ungradability approach.

The ungradability approach was based on the direct modeling of the uncertainty following the evidential deep learning approach [69]. They used Deep Dirichlet uncertainty estimation as the ungradability score O_{4} . To set a threshold for getting a binary value for O_{3} based on O_{4} , they assumed that diagnosis is only possible if the OD has enough image quality for diagnosis, as glaucoma’s main structural manifestation occurs in that region. They applied Grad-CAM [70] on the trained model for the screening task and occluded out the region where Grad-CAM was greater than 0.5. This allowed them to produce ID and OOD samples in their validation set, with which they computed the threshold for the binary ungradability decision. In particular, they constructed a receiver operating characteristic (ROC) curve using their values for O_{4} with these ID and OOD samples. The ROC threshold where the sensitivity was 0.5 was set to calculate O_{3} .

L. MA [71]

Team MA used an ensemble of these twelve different architectures for the glaucoma screening task: SeNet-154 [63], SeResNet-101 [63], SeResNeXt-101 [63], EfficientNet-B1 [45], EfficientNet-B2 [45], EfficientNet-B3 [45], EfficientNet-B4 [45], EfficientNet-B5 [45], EfficientNet-B6 [45], EfficientNet-B7 [45], DenseNet-201 [51], Inception-ResNet-v2 [72]. The RG likelihood was computed by averaging the likelihoods of all respective models in the ensemble.

The ungradability output O_{4} was the sum of the variances between all models in the ensemble for the positive and negative class probabilities. O_{3} was positive if O_{4} exceeded 0.2 and negative otherwise.

M. YC [73]

The YC team used two DenseNet-121 networks to classify RG and NRG in the CFPs. The first network was trained with the full CFP as input and the second network used a version of the CFP that was cropped around the optic disc as input. After the last convolutions of these networks, a fully connected layer with dropout was added. The outputs of these fully connected layers were then concatenated and used as the input to another fully connected layer with dropout, which was followed by the final layer of the network. For cropping the CFPs around the optic disc, they trained a U-Net [74] with a DenseNet-121 [51] backbone. To train this segmentation network, they first roughly annotated the position of the optic disc in 101 CFPs in the training set. Subsequently, they generated reference segmentation maps using a probability density function of the multivariate normal distribution around the annotated optic disc position.

They used Monte-Carlo drop-out [75] with 20 predicted probabilities per image for the robustness task. Then they statistically tested a Wilcoxon one-sample test whether the mean of the predicted probabilities was equal to 0.5. The team defined ungradability for predicting glaucoma as the logarithm of the p-value for the Wilcoxon test.

N. Mirazzak [76]

Team Mirazzak used an ensemble of ConvNeXts [77] and a vision transformer for the screening performance task.

For the ungradability task, they employed the regret function, which was proposed by Bibas et al. [78] as the generalization error of an explicit expression of the predictive normalized maximum likelihood learner. If the value of regret function was high, the samples were considered OOD and they were marked as ungradable.

SECTION V.

Results

This section presents the glaucoma screening performance and robustness of the fourteen participating teams. The final rankings and mean positions of the teams are shown in the first plot of Fig. 2. Four teams shared a rank with another team, since their mean positions were exactly equal, causing there to be two teams for each of the ranks #2 and #11.

Fig. 2. - Final rankings of all participating teams. The teams are sorted by their final ranking and therefore also by their mean position. The mean position is shown in the left plot and the four challenge metrics are shown in the other four plots. The 
$\kappa _{U}$
 of all human graders is indicated with a red dotted line. The width of the horizontal lines in all plots and the shaded area in the plot for 
$\kappa _{U}$
 are 95% CIs. We consistently use the same colors to refer to teams in other figures in this manuscript.
Fig. 2.

Final rankings of all participating teams. The teams are sorted by their final ranking and therefore also by their mean position. The mean position is shown in the left plot and the four challenge metrics are shown in the other four plots. The \kappa _{U} of all human graders is indicated with a red dotted line. The width of the horizontal lines in all plots and the shaded area in the plot for \kappa _{U} are 95% CIs. We consistently use the same colors to refer to teams in other figures in this manuscript.

A. Glaucoma Screening Performance

The glaucoma screening performance of the participating teams is summarized in Fig. 2, showing pAUC_{S} and SE \text{@} 95SP_{S} in the second and third plot, respectively. The highest scores for pAUC_{S} and SE \text{@} 95SP_{S} were 0.90 (95% CI: 0.89 – 0.91) and 0.85 (95% CI: 0.83 – 0.87), respectively. These scores were both achieved by team PUMCH-eye.

Fig. 3a and Fig. 3b show the pAUC_{S} and SE \text{@} 95SP_{S} for the ensembles when averaging the RG likelihood output O_{1} of the best M participants in terms of the relevant metric. An optimal pAUC_{S} of 0.91 (95% CI: 0.90 – 0.92) was achieved at M=3 . At M=2 , an optimal value for SE \text{@} 95SP_{S} was reached, which was 0.87 (95% CI: 0.85 – 0.89).

Fig. 3. - The four challenge metrics (a) 
$pAUC_{S}$
, (b) 
$SE \text{@} {95}SP_{S}$
, (c) 
$\kappa _{U}$
, and (d) 
$AUC_{U}$
 for the ensembles generated by incrementally fusing one algorithm at a time. The algorithms were fused by averaging the outputs of all algorithms in the ensemble. The vertical lines in the left plot and the shaded areas in the right plots indicate 95% CIs.
Fig. 3.

The four challenge metrics (a) pAUC_{S} , (b) SE \text{@} {95}SP_{S} , (c) \kappa _{U} , and (d) AUC_{U} for the ensembles generated by incrementally fusing one algorithm at a time. The algorithms were fused by averaging the outputs of all algorithms in the ensemble. The vertical lines in the left plot and the shaded areas in the right plots indicate 95% CIs.

Fig. 4a shows the partial ROC curves between 90% and 100% specificity for all participants. The plot also presents the sensitivity and specificity of the human graders with a 95% CI. These were 0.86 (95% CI: 0.84 – 0.87) and 0.94 (95% CI: 0.94 – 0.95), respectively.

Fig. 4. - ROC curves for both challenge tasks. The sensitivity and specificity of all human graders on the AIROGS test set combined are indicated with black lines. Respectively, the width and height of the black horizontal and vertical lines are 95% CIs. In (a), the partial ROC curve (90%-100% specificity) for screening is shown, with 1,602 positive (RG) and 8,134 negative (NRG) images from the AIROGS test set. In (b), the ROC curve for robustness is shown with 1,554 positive (ungradable) and 9,736 negative (gradable) images from the AIROGS test set.
Fig. 4.

ROC curves for both challenge tasks. The sensitivity and specificity of all human graders on the AIROGS test set combined are indicated with black lines. Respectively, the width and height of the black horizontal and vertical lines are 95% CIs. In (a), the partial ROC curve (90%-100% specificity) for screening is shown, with 1,602 positive (RG) and 8,134 negative (NRG) images from the AIROGS test set. In (b), the ROC curve for robustness is shown with 1,554 positive (ungradable) and 9,736 negative (gradable) images from the AIROGS test set.

In Fig. 5, we compare the performance on the REFUGE test set of the final AIROGS algorithms, which were trained on the AIROGS train set, to the performance of the algorithms that were submitted to the REFUGE challenge, which were trained on the REFUGE train set. The top three participants of the REFUGE algorithms achieved AUCs of 0.99, 0.98 and 0.96. For the AIROGS algorithms, the best three AUCs were 0.98, 0.97 and 0.97. The mean ± std. dev. AUC of all REFUGE and AIROGS algorithms were 0.94 ± 0.04 and 0.95 ± 0.02, respectively. Fig. 6 presents the relation between the two glaucoma screening performance metrics of all participating AIROGS algorithms on the AIROGS test set and that performance on the REFUGE test set. For both metrics, almost all AIROGS algorithms (except for team PUMCH-eye for SE \text{@} 95SP_{S} ) scored higher on REFUGE than on AIROGS. Of all AIROGS participants, the best pAUC_{S} and SE \text{@} 95SP_{S} on REFUGE were 0.94 and 0.88, respectively.

Fig. 5. - Comparison of the AIROGS and REFUGE algorithms, tested on the REFUGE test set, visualized as violin and swarm plots. The final algorithms that were developed for the REFUGE challenge itself and for the AIROGS challenge are shown on the left and right, respectively. The AIROGS algorithms were only trained on the AIROGS train set and were not retrained with the REFUGE dataset.
Fig. 5.

Comparison of the AIROGS and REFUGE algorithms, tested on the REFUGE test set, visualized as violin and swarm plots. The final algorithms that were developed for the REFUGE challenge itself and for the AIROGS challenge are shown on the left and right, respectively. The AIROGS algorithms were only trained on the AIROGS train set and were not retrained with the REFUGE dataset.

Fig. 6. - Performance of the participating AIROGS algorithms on the REFUGE dataset, compared to their performance on the AIROGS dataset. Both screening metrics (a) 
$\textit {pAUC}_{S}$
 and (b) 
$\textit {SE} \text{@} {95}\textit {SP}_{S}$
 are shown.
Fig. 6.

Performance of the participating AIROGS algorithms on the REFUGE dataset, compared to their performance on the AIROGS dataset. Both screening metrics (a) \textit {pAUC}_{S} and (b) \textit {SE} \text{@} {95}\textit {SP}_{S} are shown.

In Fig. 7, the relation between the glaucoma performance of all participating AIROGS algorithms on the AIROGS test set and that performance on GAMMA is shown. For both screening metrics, all AIROGS algorithms scored higher on GAMMA than on AIROGS. Of all AIROGS participants, the best pAUC_{S} and SE \text{@} 95SP_{S} on GAMMA were 1.0 and 1.0, respectively.

Fig. 7. - Performance of the participating AIROGS algorithms on the GAMMA dataset, compared to their performance on the AIROGS dataset. Both screening metrics (a) 
$\textit {pAUC}_{S}$
 and (b) 
$\textit {SE} \text{@} {95}\textit {SP}_{S}$
 are shown.
Fig. 7.

Performance of the participating AIROGS algorithms on the GAMMA dataset, compared to their performance on the AIROGS dataset. Both screening metrics (a) \textit {pAUC}_{S} and (b) \textit {SE} \text{@} {95}\textit {SP}_{S} are shown.

B. Robustness

The robustness metrics of the participating teams are summarized in Fig. 2, showing \kappa _{U} and AUC_{U} in the fourth and fifth plot, respectively. The highest scores for \kappa _{U} and AUC_{U} were 0.82 (95% CI: 0.80 – 0.84) and 0.99 (95% CI: 0.98 – 0.99), respectively. These scores were achieved by team Temirgali and RWTH-CuP, respectively.

Fig. 3c and Fig. 3d show the \kappa _{U} and AUC_{U} for the ensembles when averaging output O_{3} and output O_{4} , respectively, of the M best algorithms in terms of these respective metrics. An optimal \kappa _{U} of 0.85 (95% CI: 0.84 – 0.86) was achieved at M=6 . Also at M=6 , an optimal value for AUC_{U} was reached, which was 0.99 (95% CI: 0.99 – 0.99).

In Fig. 4b, ROC curves for robustness are shown for all participants. The plot also presents the sensitivity and specificity for separating ungradable from gradable images of the human graders with a 95% CI. These were 0.95 (95% CI: 0.94 – 0.96) and 0.97 (95% CI: 0.97 – 0.97), respectively.

The results on the external DRIMDB dataset are shown in Fig. 8, indicating the relation between the ungradability metrics \kappa _{U} and AUC_{U} of all participating AIROGS algorithms on DRIMDB and those metrics on AIROGS. Of all AIROGS participants, the best \kappa _{U} and AUC_{U} on DRIMDB were 0.94 and 1.0, respectively.

Fig. 8. - Performance of the participating AIROGS algorithms on the DRIMDB dataset, compared to their performance on the AIROGS dataset. Both robustness metrics (a) 
$\kappa _{U}$
 and (b) 
$\textit {AUC}_{U}$
 are shown.
Fig. 8.

Performance of the participating AIROGS algorithms on the DRIMDB dataset, compared to their performance on the AIROGS dataset. Both robustness metrics (a) \kappa _{U} and (b) \textit {AUC}_{U} are shown.

C. Inference Time

The time that each algorithm took to perform inference on the test set of the Final Test Phase is shown in Fig. 9. Please note that these results reflect the total time it took to run the Docker containers provided by the participants. The time it took for the software to load the model weights and to run other setup code defined by the teams was also included in this analysis. Since the test set was split into 38 separate chunks, this initialization and setup code was run at least 38 times for each participating team, as well.

Fig. 9. - Average inference time per CFP in the test set of the Final Test Phase. This time includes the actual inference time, model initialization, and other setup code executed by the submitted Docker containers.
Fig. 9.

Average inference time per CFP in the test set of the Final Test Phase. This time includes the actual inference time, model initialization, and other setup code executed by the submitted Docker containers.

SECTION VI.

Discussion

AI models have been shown to be effective at detecting glaucoma in CFPs, but most studies lack evidence of robustness to real-world scenarios in which unexpected OOD data can be presented due to various causes. To this end, we relied on the community to develop robust AI solutions for glaucoma screening based on the largest multi-center real-world CFP dataset with glaucoma labels. We organized the AIROGS challenge around this dataset, ensuring the resulting algorithms are reusable in a cloud-based environment. We applied these algorithms to ungradable data, while the participants could only train on gradable data to ensure robustness to any kind of ungradable data, and to other publicly available datasets to assess their generalization.

A. Overall Findings

The team with the highest SE \text{@} 95SP_{S} scored expert-level screening performance on the AIROGS test set with a sensitivity of 0.85 (95% CI: 0.83 - 0.87) at 95% specificity, similar to the sensitivity of 0.86 (95% CI: 0.84 - 0.87) at a specificity of 0.94 (95% CI: 0.94 - 0.95) of human graders. The highest pAUC_{S} that was achieved by any of the teams was 0.90 (95% CI: 0.89 - 0.91). Ensembling the different participating methods improved the screening performance even further, to 0.91 (95% CI: 0.90 - 0.92) and 0.87 (95% CI: 0.85 - 0.89) for pAUC_{S} and SE \text{@} 95SP_{S} , respectively. Our analysis revealed that ensembling improved the performance for all metrics up to a certain point at which adding further models to the ensemble resulted in a decline in performance. A probable reason for this is that the models added after this point under-perform to such a degree that their outputs negatively impact the performance of the ensemble, instead of improving it. Seven out of fourteen teams exceeded the minimum performance of 80% sensitivity and 95% specificity that was required by human graders who were periodically monitored during the grading process. This shows these models can provide similar performance to human graders for glaucoma screening, suggesting that AI can potentially play a role in an automated screening process.

We also evaluated the screening performance of the algorithms on two external test sets. Even though the algorithms were trained on AIROGS data, they achieved very high performances on the two external test sets, showing reproducible results in different sets and populations. On average, the participating AIROGS algorithms scored slightly higher on the REFUGE dataset than the REFUGE participants. We found that the participating algorithms scored substantially higher on these external datasets than on the AIROGS test set, indicating the value of a challenging real-world dataset. This strong generalization of the developed solutions also shows the potential of models trained on our dataset to be successfully implemented in screening programs with limited to no loss of performance. Unlike the external datasets, the AIROGS dataset represents a screening population, which likely consists of a relatively large amount of individuals with lower severity levels of glaucoma compared to clinical populations. Since more severe cases are expected to be picked up easier than less severe cases, the underlying ratio between less and more severe positive glaucoma cases could be a cause of the observation that the external test set performance was higher than the internal test set performance.

The robustness to ungradable data in the AIROGS test set was evaluated for each team using the metrics \kappa _{U} and AUC_{U} . The teams that performed the best in terms of these metrics achieved 0.82 (95% CI: 0.80 - 0.84) and 0.99 (95% CI: 0.98 - 0.99) for \kappa _{U} and AUC_{U} , respectively. Human experts did reach a higher \kappa _{U} of 0.85 (95% CI: 0.84-0.86) for this task. Moreover, they achieved a sensitivity of 0.95 (95% CI: 0.94 - 0.96) and a specificity of 0.97 (95% CI: 0.97 - 0.97) for detecting ungradable cases, while the team with the best AUC_{U} achieved a lower sensitivity at 97% specificity of 0.90 (95% CI: 0.88-0.92). Although the teams achieved relatively high performances, they still achieved lower performance at the robustness task than human experts. This shows this task was especially challenging, possibly because the participating teams could not use ungradable development data or because their robustness approaches focused on specific forms of ungradability.

We also assessed robustness on the external DRIMDB dataset. The best-scoring team on this dataset scored very high performances; they achieved 0.94 and 1.0 for \kappa _{U} and AUC_{U} , respectively. These two metrics were lower on the AIROGS dataset for that team. This also indicates very strong generalization to other datasets for the robustness task. The high ungradability detection performance also indicates robustness to other diseases in the image, as diabetic retinopathy was prevalent in the gradable subset of DRIMDB and the best algorithms did not classify these diseases as ungradable.

A large difference in performance between participating teams can be observed, both for the screening and the robustness task. Therefore, we think it is important to identify which methodological choices were made predominantly by top-performing teams. One of the most notable differences between the top three participants and the rest was the use of transformers. Outside of the top three, only the latest-placed team used a transformer. One of the possible reasons for this superiority that is achieved by transformers compared to CNNs could be their effectiveness at modeling long-range dependencies [79], [80]. This allows for a better understanding of contextual information, which is generally believed to be beneficial in medical imaging [80]. Empirically, transformers and methods that combine transformers with CNNs have previously also been shown to outperform CNNs in medical image analysis [81], [82], [83], which is in line with our findings.

Moreover, all best three participants manually labeled ODs for training either a segmentation or detection model to crop around the OD during pre-processing. Even though this was also done by two other teams, this seems like an effective strategy to achieve higher screening performance. A likely reason for the effectiveness of this approach is that most glaucoma-related imaging features can be found on or around the OD. This shows how a priori medical knowledge could still be of value even when a large amount of data is available. A less important factor appears to be the number of manually labeled ODs. A possible reason for this could be that the OD detection or segmentation network is not required to be extremely accurate as combining a rough localization of the OD with a large enough padding margin could also suffice to crop the image during pre-processing.

Since the development set only consisted of images that were labeled gradable (either RG or NRG) and the use of external fundus data was prohibited, all teams came up with an uncertainty or OOD detection method based on the gradable data for the robustness task. The ungradability methods of the top three participants in terms of mean position, \kappa _{U} , and AUC_{U} , all revolved around the confidence of a neural network that localized the OD. Of the other participants, only the ninth-placed team had such an approach. Apart from these methods based on OD detection, only team UPF+AIML implemented a different robustness technique that was also based on domain knowledge. This raises the impression that solutions based on domain knowledge are more effective for robustness than more general OOD detection solutions. However, it still needs to be evaluated if such approaches are robust for other general tasks (not glaucoma screening) or other sources of OOD data.

For calculating the \kappa _{U} metric, the participants were required to output a binary decision on ungradability. A popular approach, especially among the top participants, was to manually identify relatively low-quality images in the development set and base a threshold for this binary output on that subset. This technique was employed by the best three, fifth, tenth, and twelfth teams in terms of \kappa _{U} . This indicates that this could be a successful approach, although not in general as the accuracy of this binary value is also highly dependent on the quality of the scalar output for ungradability O_{4} that is being thresholded. We found the difference between the ranking in terms of \kappa _{U} and AUC_{U} of one team, in particular, stood out. Team YC ranked only eleventh for AUC_{U} (which depended on the scalar output O_{4} ), but ranked fourth in terms of \kappa _{U} (which depended on the binary output O_{3} ), indicating the approach they used for thresholding their scalar value was highly effective. The difference between their AUC_{U} and \kappa _{U} was 7, while the next biggest value of this difference was only 3. Team YC indeed came up with a relatively sophisticated method for binarizing O_{4} compared to others, based on a Wilcoxon one-sample test to statistically test whether the mean of the predicted probabilities from a Monte-Carlo drop-out approach was 0.5 or not.

B. Strengths and Limitations

The dataset presented in this paper substantially exceeds what was publicly available before in terms of number of images and patients. The dataset is also highly diverse because of the large number of different sites, cameras that were used, and ethnicities. Comorbidities were not excluded from the dataset, as our goal was to develop tools that are robust to data with such conditions in real-world screening settings. These conditions should not have an influence on the classification of glaucoma or ungradability. The quality of the labels was controlled by the initial and periodical evaluation of human graders, the fact that each image was independently labeled twice by two trained graders, and, in case of disagreement, by a highly experienced reader. The participants submitted their solutions as containerized algorithms, allowing reproducibility, facilitating inference on other data, and preventing manual manipulation of the test set.

One of the rules of the AIROGS challenge was the prohibition of the use of external fundus data for development. A limitation of this work is the fact that we cannot be sure if any of the teams used such data in their development process. A possible approach to prevent this and make the process fairer is to have participants submit a containerized algorithm for training, which would be trained by the challenge organizers with private challenge training data. Nevertheless, with such an approach it would still be challenging and time-consuming for the challenge organizers to verify if the training containers do not contain any weights pre-trained on other data.

The teams that participated in the competition were permitted to create their own manual annotations on the data and employ them in the development of their models. We made a deliberate decision not to forbid this practice, as we believed that the possibility of achieving superior results outweighed the disadvantage of potentially introducing a slight unfairness due to some teams having access to larger manual labor workforces than others. Teams PUMCH-eye, RWTH-CuP, Eyelab, UPF+AIML, ICT_HCI, SACM, and YC took advantage of this opportunity and produced at least one of the subsequent manual annotations: OD detection or segmentation, vessel segmentation, and identification of low-quality images in the development set. As a result, the final rankings of the teams have likely been influenced by this practice, and it is important to consider this potential bias when comparing solutions. We think it is also worth noting that the aforementioned solution for promoting fairness of submitting a containerized algorithm for training on private data would disable manual annotation of development data.

The dataset used for the challenge is diverse, but improvements could still be made in that respect. All screening sites were based across the United States of America, raising the question of whether a more generalizable model could be obtained with data from across the world. On the other hand, we showed that many algorithms trained on the AIROGS dataset performed at least as well on three external test sets, of which two originated from China and one from Turkey, as on our internal test set.

Not all research groups working in the field of retinal image analysis participated in this challenge and many teams that joined the challenge did not submit a solution to the Final Test Phase. Possible reasons for this include that many teams saw their results did not match to ones already present on the leaderboard, that the barrier for some teams was too high to get a solution wrapped in a Docker container, or that they were not able to finish in time. Therefore we would like to stress the challenge is still open and we are curious to see if the community can make further improvements. After all, especially for the robustness task, there seems to be room for improvement, given the gap with the human grader performance.

C. Future Directions

Based on the solutions that were presented by the teams, we think it would be valuable to combine methodologies from different participants and to work further on their ideas. For example, as we mentioned before, team YC apparently had a highly effective method for thresholding their ungradability scores as their \kappa _{U} was very high compared to their AUC_{U} . A possible future direction would be to combine methods of high performance in terms of AUC_{U} with the binarization technique from team YC. Moreover, we observed that algorithms that scored high in terms of robustness, used domain knowledge for this aspect of the challenge. Possible future directions could be to explore other ways to incorporate domain knowledge into an ungradability method. This observation also leads to the question of whether there are more fields in medical image analysis in which domain knowledge can be leveraged for uncertainty estimation and OOD detection.

Next to a decision on RG and NRG presence, the graders were asked to provide which clinical, glaucomatous features were present in the eyes they classified as RG, as listed in Section II-A and further described by [84]. This information was not yet included in the dataset release for this challenge, as it fell outside the scope of this challenge. Future solutions and challenges could be developed with this information, possibly resulting in more explainable algorithms.

This challenge only focused on classification based on a single CFP. It may be interesting to explore the effect on screening performance and robustness of including various types of metadata in our dataset, which we have available but have not been published yet. This metadata, although missing for some images, includes the camera type, age, and anonymous patient identification (which can be used to link two eyes to a single patient).

In order to ensure the safe and effective implementation of the AI models for glaucoma screening described in this paper, several important steps need to be undertaken. González-Gonzalo et al. [85] provide valuable insights into the key aspects that are crucial for the integration of AI models in ophthalmic practice.

Among these aspects, additional retrospective validation studies play a significant role in validating the performance and generalization of these models. An external evaluation with substantially different data was already performed in this study. However, we think it is crucial to evaluate with additional large screening datasets that represent real-world scenarios before practical implementation. Prospective validation studies and cost-effectiveness analyses are also essential for evaluating the accuracy, reliability, and generalization of glaucoma screening AI models in real-world settings. These analyses are especially important for screening programs since screening solutions that are not specific enough can have substantial negative financial impacts, as they can lead to unnecessary hospital visits. It is also essential to identify and mitigate potential limitations such as data quality, which the AIROGS challenge aimed to address, model interpretability, integration with screening workflows, and potential biases. Finally, establishing mechanisms for post-market surveillance is vital to monitor and evaluate the performance and safety of these AI models after their regulatory approval.

Further implementation and real-world evaluation of these algorithms are needed, as described above. As this was considered out of the scope of the current manuscript, we leave the execution of these steps to future work.

SECTION VII.

Conclusion

We presented the results of community-acquired algorithms tested on real-world data for robust glaucoma screening from CFP. The best algorithms performed similarly in terms of screening to the carefully trained and selected human graders, and were shown to be effective at flagging images that could not be graded. Methodological choices predominantly made by the best teams included, for the screening task, the use of vision transformers and the incorporation of optic disc detection models in pre-processing and, for the robustness task, out-of-distribution detection approaches based on domain knowledge. We hope the unprecedented size and real-world nature of the dataset we released and the algorithms that were developed using this dataset will help towards implementing robust AI for glaucoma screening.

ACKNOWLEDGMENT

References

References is not available for this document.