Journals & Magazines >IEEE Access >Volume: 12

Estimation of Multi-Expert Sperm Assessments Using Video Recognition Based Model Trained by EMD Loss

Overview of our sperm grade estimation system. We collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for an...

Abstract:

Infertility is a common problem, affecting approximately one in six adults worldwide. Some studies have shown that male factors contribute to infertility in up to 50% of ...Show More

Metadata

Abstract:

Infertility is a common problem, affecting approximately one in six adults worldwide. Some studies have shown that male factors contribute to infertility in up to 50% of couples. Intracytoplasmic sperm injection (ICSI) is a common treatment for male infertility. This procedure requires a quick and accurate determination of whether sperm are suitable for ICSI. However, this assessment requires expertise and is time-consuming. Several computer-based systems for sperm analysis have been proposed to mitigate the burden on experts. However, there are no systems that can consider both sperm motility and morphology, or that can directly assess sperm suitability for ICSI. To address this problem, we constructed the multi-expert rated sperm video dataset for analysis, that includes motion information and developed an end-to-end sperm grade distribution estimation model using this dataset. Our model predicts a distribution that reflects multiple expert assessments, and thus helps to easily determine the suitability of a given sperm for ICSI. To develop this model, we conducted an exhaustive evaluation of various feature extractors and loss functions. Through this analysis, TimeSformer was identified as the optimal feature extractor from sperm videos, improving on average by

$0.1\times 10^{-2}$ in MSE, 1.17% in grade distribution accuracy, and 3.41% in grade mode accuracy compared to ResNet, an image recognition model. Moreover, we identified earth mover’s distance loss as the most suitable loss function, particularly in segments with lower scores.

Overview of our sperm grade estimation system. We collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for an...

Published in: IEEE Access ( Volume: 12)

Page(s): 112702 - 112713

Date of Publication: 13 August 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3443179

Contents

SECTION I.

Introduction

Infertility is a widespread global concern, affecting approximately 17.5% of adults, or about one in six people [1]. To address this issue, affordable and high-quality fertility care is needed.

Some studies suggest that male factors contribute to infertility in up to 50% of couples [2], [3], [4]. Therefore, infertility treatment is important for both males and females. Intracytoplasmic sperm injection (ICSI) is common treatment for male infertility. In ICSI, individual sperm cells are meticulously selected and injected into eggs by experts. In this process, it is important to quickly and accurately classify sperm as either normal or abnormal. This assessment, however, requires expertise and is time-consuming.

Several computer-based systems for sperm analysis have been proposed to mitigate the burden on experts. However, there are no systems that can consider both sperm motility and morphology, or that can directly assess sperm suitability for ICSI. Computer-assisted semen analysis (CASA) systems [5] automated sperm analysis to some extent. This systems provide sperm concentration, motility analysis, morphology assessment, but they cannot assess sperm considering both motility and morphology and do not assess sperm suitability for ICSI. There are also several studies used machine learning [6], [7], [8], [9], [10], [11], [12]. However, they focused primarily on sperm images and ignored the crucial aspects, motion information and cannot assess sperm suitability for ICSI.

To address this problem, we collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for analysis, that included motion information, and developed an end-to-end sperm grade distribution estimation model trained by this dataset. Figure 1 illustrates this process. Owing to the end-to-end inference from the video, this model can consider both motion and morphology. The predicted distribution reflecting the evaluations of multiple experts, can easily help determine the suitability of a particular sperm for ICSI. By referring to predicted sperm grade distributions, experts can efficiently select a sperm from multiple candidates without the need to check all sperm. Furthermore, the collected dataset and predicted sperm grade distribution can be valuable resources for professional development. Therefore, our system can significantly help to reduce the workload for experts.

FIGURE 1.

Overview of our sperm grade estimation system. We collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for analysis, and developed an end-to-end sperm grade distribution estimation model trained by this dataset. By referring to the sperm grade distributions predicted by our model, experts can reduce their workload.

Show All

There are two questions in developing the end-to-end sperm grade distribution estimation model. First, how to extract sperm motion and morphological features from video data. Second, which loss function is appropriate for a model to estimate the grade distribution. To answer these research questions, we conducted a thorough comparison of models based on different feature extractors and loss functions. We considered three video feature extractors: R( $2+1$ )D, SlowFast and TimeSformer, and four loss functions: mean squared error loss (MSE), cross entropy loss (CE), Jensen-Shannon divergence loss (JSD), and earth mover’s distance loss (EMD). From the experimental results of the model comparison, TimeSformer emerged as the most promising choice among the video recognition models, and outperform the image-recognition model. This result suggests that image-based analysis is not sufficient for comprehensive sperm analysis, which highlights the importance of video-based analysis. In addition, our experimental results comparing loss functions indicate that EMD performs best, showing superior performance especially for lower scoring segment samples.

Our contributions can be summed up in three points:

We constructed the multi-expert rated sperm video (MERSV) dataset and proposed an end-to-end sperm grade distribution estimation system based on videos that helps to reduce expert’s workload for ICSI.
We analyzed three feature extractors and found that TimeSformer was the most effective feature extractor for video-based sperm analysis, that outperforms ResNet and highlights the indispensability of video data.
Earth mover’s distance (EMD) loss was identified as the most suitable loss function for the estimating grade distribution and demonstrated its superior performance in lower-scoring segment samples.

The rest of this paper is organised as follows. In Section II, we briefly present previous work that includes machine learning approaches for sperm assessment, video recognition models and label distribution learning. Section III describes details of our compiled MERSV dataset. In Section IV, we explains details of our proposed sperm grade distribution estimation model. Sections V shows the experimental setup for comparing of models based on different feature extractors and loss functions and the results. Finally, Section VI presents the conclusions.

SECTION II.

Related Work

A. Datasets and Machine Learning Approaches for Sperm Assessment

Machine learning have been used on healthcare and medicine [13], [14], [15], [16], [17]. Several machine learning methods have been proposed for sperm morphology analysis too. The methods were validated using three datasets: SMIDS, HuSHeM and SCIAN.

The Sperm Morphology Image (SMIDS) dataset was constructed by the Medical Faculty of Istanbul University using smartphone-based data acquisition [6]. It included 3,000 segmented RGB sperm images labeled as normal(1,021), abnormal(1,005) and non-sperm(974) by an expert.

The Human Sperm Head Morphology (HuSHeM) dataset was curated from sperm samples of 15 patients at the Isfahan Fertility and Infertility Center [7]. This dataset included 216 RGB images of $131\times 131$ pixels, which were classified by three experts as Normal(54), Tapered(53), Pyriform(57), and Amorphous(52), with consensus among the three experts.

The Laboratory for Scientific Image Analysis Gold-Standard for Morphological Sperm Analysis (SCIAN) dataset consists of sperm samples from the Medical Faculty of Chile University [8]. It included 1132 grayscale images of sperm head images in $35\times 35$ pixels formats, labeled as Normal(100), Tapered(228), Pyriform(76), Amorphous(656), and Small(72). These images were labeled by three experts, and samples were selected for which at least two out of three experts agreed. Both the HuSHeM dataset and the SCIAN dataset are annotated based on the sperm morphological categories provided by the World Health Organization (WHO) [18].

Numerous studies used image recognition models, particularly CNN-based models. Riordon et al. [9] fine-tuned a pre-trained VGG16 from ImageNet. Spencer et al. [10] combined VGG16, VGG19, ResNet-34 and DenseNet-161 using a multi-class meta-classifier. Yüzkat et al. [11] proposed six CNN models and conbined their decisions with soft-voting. Non-CNN methods were also introduced. Ilhan et al. [12] presented a computational framework using multi-stage cascade-connected preprocessing techniques, region-based descriptor features, and non-linear kernel SVM-based learning.

However, these approaches do not consider sperm movement and, focus exclusively on sperm images. Therefore, we compiled the MERSV dataset that can be analyzed including movement information. Using this dataset, an end-to-end sperm grade distribution estimation model was developed.

B. Video Recognition Models

Deep learning is used for video data analysis such as human action recognition [19], [20], [21], [22], [23], [24], [25], surveillance system [26], [27] and visual speech recognition [28], [29]. Video recognition models, mainly used for action recognition, have evolved in parallel with image recognition models such as ResNet [30]. These models can be broadly classified into two categories: CNN-based and transformer-based models.

CNN-based video recognition models began with R3D [19], which extended 2D convolution to 3D. However, owing to the large number of parameters, R3D’s performance is limited. R( $2+1$ )D [20] addressed this problem by pseudo-representing a 3D convolution by a 2D +1D convolution, resulting in a reduction of parameters and improved performance. SlowFast [21] further improved performance by extracting features from videos with different frame rates. It incorporates a slow pathway that focuses on shape, and a fast pathway, that emphasizes motion.

Transformer-based video recognition models began with ViViT [22], which was inspired by the transformer-based image recognition model ViT [31]. TimeSformer [23] introduced divided space-time attention, which was superior to various self-attention methods in terms of computational complexity and accuracy, and achieved improved accuracy.

We focus on commonly used video recognition models such as R( $2+1$ )D and SlowFast, which are advanced CNN-based video recognition models, and TimeSformer, an advanced transformer-based video recognition model. These models are used as video feature extractors in this study.

There are alternative methods for extracting features from a video using video coding techniques that involve motion estimation. In their work, Kumar et al. [32] proposed the K-MCSP algorithm for motion estimation, which incorporates a non-linear function as opposed to MCSP, which uses a linear function. This modification results in reduced computational complexity while minimizing PSNR variation during reconstruction. We didn’t employ this method because it requires learning the acquisition of features to be extracted from a video based on a grade distribution that reflects multiple expert assessments in the training data.

In addition to frame features, there are studies that have achieved high accuracy by incorporating new task-specific features. Cao et al. [33] proposed e-TSN, a TSN network for hand gesture recognition that incorporates hand skeletal features instead of optical flow. In this study, in addition to the video frame features of the detected sperm, the speed calculated from the position information obtained during detection is added as a feature.

C. Label Distribution Learning

Geng introduced the concept of a label distribution which includes different degrees of description across multiple labels, and investigated the best algorithm [34]. Gao et al. [35] subsequently demonstrated the effectiveness of end-to-end learning with KL-Divergence for tasks with label distributions.

In our context, the grade distribution serves as the estimation target. This distribution is a special case of a label distribution whose labels have an ordinal relationship. Consequently, we propose using earth mover’s distance for estimating the grade distribution, as this metric considers the distances between labels within the distribution.

SECTION III.

Multi-Expert Rated Sperm Video (MERSV) Dataset

A. Dataset Construction

We compiled the multi-experts rated sperm video (MERSV) dataset for precise sperm assessment. The dataset included 615 videos recorded under a microscope, with each video having a grade distribution determined by annotations from approximately 40 experts. The study was approved by Ethics Committee for Medical and Biological Research Involving Human Subjects No. 2023-27.

1) Sperm Video Recording Method

The videos were filmed in the semen of patients who had undergone ICSI treatment with consent. The number of patients is 615, which corresponds to the number of videos. Consent has been obtained from all patients for the collection and utilization of data. Sperm suspensions for video recording were prepared using density-gradient centrifugation followed by the swim-up method to obtain sperm with good motility. Video recording was conducted using an Olympus IX73 microscope. Due to variations in the thickness of the sperm suspension, the focus and light source were adjusted accordingly for each recording. Other settings are consistent with those described in [36].

2) Details of Recorded Sperm Video

The videos were recorded at a rate of 15 frames per second (fps). Each frame had a resolution of $1392 \times 976$ pixels. Each frame was annotated using a $150 \times 150$ bounding box to focus on a single sperm. We performed this bounding box annotation by tagging the target sperm in the first frame, and then tracking it throughout the video using template-matching techniques.

3) Sperm Evaluation by Multiple Experts

The experts assessed the sperm using a five-grade grading system: A(good), B(better), C(middle), D(worse), and E(bad). We refer to these labels as “grade class labels.” By combining the experts ratings, we obtain a distribution, which we refer to as “grade distribution.”

Figure 2 shows examples of sperm videos and their grade distributions. The sufficiency of the video data quantity is discussed in Section VI-A. While we cannot publish this dataset now, we plan to make it public in the future.

FIGURE 2.

Six examples from our (MERSV) dataset. It shows 4 out of a 16 video frames (left) and their grade distribution (right) in each sample. Top samples, good; middle samples, medium; bottom samples show examples of concentrated bad ratings. Samples on the left side are examples of low variability in ratings, while samples on the right side are examples of high variability in ratings.

Show All

B. Statistics of Grade Distribution

The total number of expert ratings collected was 24,533. For statistical analysis, we represented grade class labels as categorical variables using assigned numerical grade scores to them: A,1; B,2; C,3; D,4; and E,5. The means and standard deviations of the grade distributions are shown in Figure 3(a). We refer to the most frequently evaluated grade class label/score as the “grade mode class label/score.” Histograms illustrating the grade mode class scores are shown in Figure 3(b). These graphs show that most of the samples had grade distributions concentrated in B (2) and C (3). When training a model with this dataset, data imbalance may occur. Therefore, evaluation results for each grade mode class label should be examined.

FIGURE 3.

Statistical analysis of the results of multiple experts ratings (grade distribution). Left (a): Joint histogram of the mean and standard deviations of grade scores. Right (b): Histograms of the grade mode score. Grade mode score is the most frequently evaluated grade score. These graphs show that most of the samples had grade distributions concentrated in B (2) and C (3).

Show All

C. Grade Distribution Estimation Task

The grade distribution prediction task is referred to as the “grade distribution estimation task.” As mentioned in Section III-A, grade distribution consists of grade class labels that have been annotated by multiple experts. The grade distribution was normalized by dividing it by the number of experts, to ensure that the sum of its values was 1. Therefore, each value in the grade distribution represents the probability of each grade class ( $y_{i}$ represents the probability of the i-th grade class). Grade distribution can be viewed as a special version of both single and multi-label annotations. Up to this point, the grade distribution shares similarities with the label distribution [34]. However, there are ordering relationships between the class labels within the grade distribution.

SECTION IV.

Sperm Grade Distribution Estimation Model

A. Model Structure

We propose a sperm grade distribution estimation model for automated sperm assessment and provide an overview of the model in Figure 4. This model predicts grade distribution from detected frames and sperm positions.

FIGURE 4.

Overview of sperm grade distribution estimation model. This model predicts grade distribution from detected frames and sperm positions. In our model, video recognition model is used as feature extractor (Backbone) and the last layer (Head) outputs the grade distribution based on extracted features and sperm speed.

Show All

Our proposed sperm grade estimation model is based on a video recognition model architecture. More explicitly, we examined three video recognition model architectures: R( $2+1$ )D [20], SlowFast [21], and TimeSformer [23]. The video recognition models were used as feature extractors (Backbone). The last layer (Head) outputs the grade distribution based on the extracted features and sperm speed. To achieve this, we replaced the last layer of the video recognition model with five neurons, followed by Soft-max activation. The mathematical representation of the sperm grade distribution estimation model’s process flow is as follows. $\begin{equation*} \hat {\boldsymbol {y}} = H(B(\boldsymbol {x}), \boldsymbol {v}) \tag {1}\end{equation*}$ View SourceHere, $\boldsymbol {y}$ represents the predicted grade distribution, H stands for the model head, B denotes the video feature extractor in the model, $\boldsymbol {x}$ represents the detected sperm video frames, and $\boldsymbol {v}$ corresponds to the sperm speed.

In addition, we prepared an image-based model as a baseline for comparison. We believe that video information is necessary for sperm analysis, as image information alone is not adequate, and demonstrate this in later experiments.

B. Loss Function

There is a lack of research that clarifies the appropriate loss function for grade distribution estimation tasks. Four loss functions were prepared to determine the best loss function for this task: cross entropy loss (CE), mean squared error loss (MSE), Jensen-Shannon divergence loss (JSD) and earth mover’s distance loss (EMD).

1) Cross Entropy Loss (CE)

CE is widely used as a training loss in classification tasks. $\begin{equation*} \mathcal {L}_{CE}(\boldsymbol {p}, \boldsymbol {q}) = - \sum _{i=1}^{n} p_{i} \log (q_{i}) \tag {2}\end{equation*}$ View SourceThe true grade distribution is denoted by $\boldsymbol {p}$ , the ith true grade class probability is denoted by $p_{i}$ . The predicted grade distribution is denoted by $\boldsymbol {q}$ , the ith predicted grade class probability is denoted by $q_{i}$ , and the number of grade classes is denoted by n.

2) Mean Squared Error (MSE)

MSE is widely used as a training loss function for regression tasks. $\begin{equation*} \mathcal {L}_{MSE}(\boldsymbol {p}, \boldsymbol {q}) = \frac {1}{n} \sum _{i=1}^{n} (p_{i} - q_{i})^{2} \tag {3}\end{equation*}$ View Source

3) Jensen-Shannon Divergence (JSD)

JSD measures the difference between two probability distributions. JSD is based on the Kullback-Leibler divergence (KLD), but it differs in that it is symmetric and always has a finite value. $\begin{align*} \mathcal {L}_{JS}(\boldsymbol {p}, \boldsymbol {q}) & = \frac {1}{2} \mathcal {L}_{KL}\left ({{\boldsymbol {p}, \frac {\boldsymbol {p}+\boldsymbol {q}}{2}}}\right ) + \frac {1}{2} \mathcal {L}_{KL}\left ({{\boldsymbol {q}, \frac {\boldsymbol {p}+\boldsymbol {q}}{2}}}\right ) \tag {4}\\ \mathcal {L}_{KL}(\boldsymbol {p}, \boldsymbol {q}) & = \sum _{i=1}^{n} p_{i} \log \frac {p_{i}}{q_{i}} \tag {5}\end{align*}$ View Source

These three loss functions lack the inter-class relationships between score buckets.

4) Earth Mover’S Distance (EMD)

EMD is defined as the minimum cost of transporting the mass from one distribution to another. The EMD is also known as the Wasserstein distance. It is used as a loss function in various scenarios, such as order-class classification [37], [38], [39], adversarial training [40], and modality alignment learning [41]. In classification tasks where labels have strong relationships, EMD-based losses yield better results than other loss functions [37]. EMD can be solved exactly in closed form if the sums of the distributions are equal and the class space can be represented by a one-dimensional embedding. If the sums of the distributions are equal and the class space can be represented by a one-dimensional embedding, then an exact closed-form solution can be obtained [42]. The graded distributions considered in this study satisfy these conditions. Because the grade classes have an ordering relation of $1(A) \lt 2(B) \lt 3(C) \lt 4(D) \lt 5(E)$ , the ground distance matrix of the grade class labels has a one dimensional embedding. In Addition, the two distributions, $\boldsymbol {p}$ and $\boldsymbol {q}$ have equal mass.: $\sum _{i=1}^{n} p_{i} = \sum _{i=1}^{n} q_{i}$ . Consequently, EMD can be computed exactly and in closed-form. As in [37], we use the Euclidean distance between the CDFs, which allows easier optimization with gradient descent. $\begin{equation*} \mathcal {L}_{EMD}(\boldsymbol {p}, \boldsymbol {q}) = \sum _{i = 1}^{n}\left ({{CDF_{i}(\boldsymbol {p}) - CDF_{i}(\boldsymbol {q})}}\right )^{2} \tag {6}\end{equation*}$ View Source

SECTION V.

Experiments and Results

A. Experimental Settings

1) Datasets

We used the MERSV Dataset described in Section III. For evaluation, a stratified 3-fold cross-validation was performed, considering the grade mode class as a label. This ensured that the distribution of the number of grade mode classes in the training samples remained the same in the validation samples. In each fold, the number of training data was 492, while for the test data, it was 123 and there is no confusion between videos of the same patient in train and test data. As each video differs in length, only the initial second, equivalent to 16 frames, is utilized. Additionally, fair sampling was performed such that the number of each grade mode class in training samples remained the same as that in validation samples. To align with the pretrained models, we utilize upsampled frames/images of size $224 \times 224$ from $150 \times 150$ . Data augmentations, such as random rotation and color jittering, were also applied to improve the robustness of our model.

2) Implementation Details

The sperm grade distribution estimation models presented in this study were implemented using PyTorch [43]. We used a pretrained image/video recognition model as the image/video feature extractor (Backbone). Specifically, ResNet [30] was used as the image feature extractor. In this image-based model, predictions were obtained from all 16 frames and evaluated individually. For the video feature extractor, we used R( $2+1$ )D [20], SlowFast [21] and TimeSformer [23]. In these video-based model, predictions were generated by analyzing frames extracted at evenly spaced intervals, specifically 8 out of a total 16 frames. The last fully-connected layer is randomly initialized. In training, we used the stochastic gradient descent (SGD) optimizer in all experiments with a learning rate of 1e-3, momentum of 0.9, and weight decay of 5e-4. The models were trained for 1,000 epochs. The batch size was different for each model (32:ResNet, 16:SlowFast, 8:R( $2+1$ )D, TimeSformer).

3) Evaluation Metrics

We employed multiple evaluation metrics to accurately ascertain the performance among models and loss functions. To evaluate distribution differences, We used EMD, MSE, JSD and CE. For a fair evaluation of the loss functions, the histogram intersection (HI) [44], which is not used as a loss function, was used as an evaluation metric. The HI is a similarity measure that quantifies the degree of overlap between two histograms. We defined HI as “grade distribution accuracy”. $\begin{equation*} HI(\boldsymbol {p}, \boldsymbol {q}) = \sum _{i=1}^{n} \min (p_{i}, q_{i}) \tag {7}\end{equation*}$ View SourceThe grade distribution estimation task includes the grade mode classification aspect. To evaluate this aspect under the label imbalance, we used the macro F1 score (Macro F1). We also show the (grade mode) accuracy score which provides a clear understanding of the quality of our models’ predictions, however, accuracy cannot accurately evaluate performance under the label imbalance. Therefore, we did not utilize accuracy in our detailed analysis. In Addition, to evaluate the robustness of grade mode class of the true distribution in the presence of distributional differences, a macro histogram intersection (Macro HI) was used. We define the Macro HI as the mean HI of each grade mode class.

B. Results

Table 1 shows the performance of the grade distribution estimation models for different backbone models and loss function settings. Lower values of EMD, MSE, JSD, and CE indicate higher performance, whereas higher values of HI(Macro HI) and MacroF1 indicate higher performance. We analyzed these results by comparing the backbone models and the loss functions.

TABLE 1 Performance of the Grade Distribution Models for Different Backbone Models and Loss Function Settings. For a More Detailed Analysis, we Compared Backbone Models (Table 2) and loss functions (Table 3)

1) Comparison Between Backbone Models

For the each loss function, we ranked the models and compared them based on their average rankings across the different evaluation metrics.

Table 2 shows the average rankings of each metrics and their averages. For all metrics, the average rankings of the models exhibited similar trends. TimeSformer consistently outperformed all other models in all metrics, while R( $2+1$ )D the lowest in almost all metrics. TimeSformer improved on average by $0.1\times 10^{-2}$ in MSE and 1.17% in HI (grade distribution accuracy) compared to ResNet, an image-based model. The superior performance of TimeSformer can be attributed to its effective feature extraction from the video data. However, R( $2+1$ )D legs behind ResNet. Video-based models have the potential to perform better than image-based models because they can use not only shape but also motion information. However, if the video-based model fails to extract features, as is the case in R( $2+1$ )D, its performance is inferior to that of image-based model.

TABLE 2 Comparison Between Backbone Models in Terms of Ranking for Each Metric and Overall Average

2) Comparison Between Loss Functions

For each model, we ranked the loss functions and compared them based on their average rankings across different evaluation metrics.

Table 3 shows the average rankings for each metric and overall averages. For all loss functions, the average ranking was the highest when the metric was identical to the loss function. EMD had the best average ranking, indicating that it was the most appropriate function for the grade distribution estimation task. Conversely, CE was the lowest for almost all metrics, suggesting that it was not suitable for the grade distribution estimation task. However, in some cases, CE demonstrated superior performance compared to other loss functions with respect to MacroF1, indicating its suitability for classification tasks. JSD outperformed CE, however, lagged behind EMD or MSE in average rating.

TABLE 3 Comparison Between Loss Functions in Terms of Ranking for Each Metric and Overall Average

3) Observations During Training

Figure 5 shows the learning curves of test data for each loss function. The numbers within parentheses indicate the epochs in which the model performed best. These graphs confirm that all models exhibited well-converged learning. Additionally, it is evident that TimeSformer consistently attained the lowest value in the earliest epochs for all loss functions, indicating efficient and rapid convergence during training.

FIGURE 5.

Learning curves of each loss functions in test data. The number within the parentheses indicates the epoch at which the model performs best.

Show All

4) Accuracy in Grade Mode Class

We also assessed the accuracy of our models in grade mode (grade mode accuracy), and Table 4 shows the results. Note that this accuracy is assessed under label imbalance. The accuracy ranges from 65% to 70%. On average, TimeSformer shows a 3.41% improvement compared to ResNet and video recognition models outperform ResNet, an image-based model, in almost all loss functions. These results reinforce the need for video-based analysis of sperm.

TABLE 4 Accuracy in Grade Mode Class. Note That This Accuracy is Assessed Under Label Imbalance

C. Analysis of HI Distribution in Test Data

We also compared the HI distributions in the test data. Figure 6 shows the HI distributions for each grade mode class label and the blue dotted line shows the average HI. To analyze these distributions, we focused on specific segments of the HI score distribution, namely, the higher (Top25%), middle (Top50%), and lower (Top75%) segments. Table 5 lists the Top25/50/75% HI results, which we analyzed by comparing backbone models and loss functions below.

TABLE 5 Top 25/50/75% HI Results for Different Backbone Models and Loss Function Settings. For a More Detailed Analysis, we Compared Backbone Models (Table 6) and Loss Functions (Table 7)

FIGURE 6.

Distributions of HI for each grade mode class label. For a more detailed analysis, we focused on specific segments of the HI score distribution (Table 5). The blue dotted line shows the average HI score.

Show All

1) Comparison Between Backbone Models

For each loss function, we ranked the backbone models and compared them based on their average rankings from the results in Table 5. Table 6 lists the average rankings for each metric and overall averages.

TABLE 6 Comparison of the Top25/50/75% of HI Between Backbone Models

Analyzing Table 6, it is found that TimeSformer performs exceptionally well across all segments of the HI score, indicating its superiority in this dataset, which is consistent with the findings in Section V-B. SlowFast outperformed ResNet in the Top25% segment, however, it performed worse than ResNet in the Top50% and 75% of the segments. This discrepancy suggests that SlowFast has difficulty effectively extracting generic features from video data. R( $2+1$ )D consistently lagged behind ResNet in all the segments. This result is consistent with our observations in Section V-B, suggesting that R( $2+1$ )D cannot effectively extract the appropriate features from the video data.

2) Comparison Between Loss Functions

For each models, we ranked the loss functions and compared them based on their average rankings from the results in Table 5.

Table 7 lists a summary of the average rankings for each loss function across the various metrics. Upon analyzing the overall average, it is found that EMD was the best among the loss functions. Specifically, EMD outperformed the other loss functions in the Top50% and Top75% HI score segments. This can be attributed to ability of EMD to consider the relationships between grade classes and prevent fatal mistakes in the grade distribution estimation task. Particularly in the medical domain, it is important to guarantee the worst score from the perspective of reliability. Therefore, EMD is a suitable choice for practical applications. Conversely, CE performed the worst among all the loss functions. Similar to the findings presented in Section V-B, these results reinforce the unsuitability of CE for the grade distribution estimation task.

TABLE 7 Comparison of the top25/50/75% of HI Between Loss Functions

From the analysis so far, that the best Backbone is TimeSformer, and the best loss function is EMD. We visualized the latent space of our model in this setting and discussed the sufficiency of the quantity of video data in Section VI-A.

SECTION VI.

Conclusion

In this study, we curated the MERSV dataset and proposed a model for end-to-end sperm grade distribution estimation from videos to contribute to reduce experts workload in sperm selection for ICSI. Based on our experimental results, TimeSformer is the most promising of the video recognition models and outperforms the image recognition model ResNet. This result indicates that image-based analysis is insufficient for comprehensive sperm analysis, which highlights the importance of video-based analysis. We also identified the EMD as the most suitable loss function for the grade distribution estimation task, demonstrating its superior performance in lower-scoring segment samples.

However, our study has three limitations. Firstly, we cannot evaluate our models on multiple databases because there are no datasets with sperm videos and multi-expert annotations. In adddition, our dataset cannot be published now. Therefore, we will keep collecting new data and prepare to publish our dataset in the future. Secondly, the video model has a higher number of parameters than the image model, which may make it difficult to use in the clinics and reduce the inference speed (see Section VI-B). Therefore, we will also work on model compression or build a new effective model in the future. Thirdly, although we assessed the performance of our proposed model in estimating grade distributions, we are unable to gauge its effectiveness in reducing the workload for experts. Hense, we aim to introduce this system into clinical environments and validate its effectiveness.

Appendix

SECTION A.

Video Feature Distributions and Dataset Size Discussion

We checked the distribution of video features in our dataset using 768 dimensional features extracted from the TimeSformer trained by EMD. Figure 7 shows that 2-dimensional video features compressed from 768 dimensions by PCA. The proportion of the variance for the first principal component (PC1) was 69%, while for the second principal component (PC2), it was 4.7%. The color of the points in this plot shows that the grade mode score of each sample. In this plot, moving towards the upper right causes the samples to become reddish, while moving towards the lower left causes the samples to become bluish. This confirms that TimeSformer can extract appropriate features from a video.

FIGURE 7.

Feature distribution of TimeSformer-EMD compressed by PCA.

Show All

Figure 8(a) shows the histogram of PC1 with an added bias to prevent negative values from appearing, while Figure 8(b) presents the logarithmically transformed version of the PC1. The mean of logarithmically transformed PC1 ( $\mu$ ) was 3.15, while for the standard deviation ( $\sigma$ ) it was 0.82. The orange line in this plot represents a normal distribution with a mean of $\mu$ and a standard deviation of $\sigma$ . If the distribution of PC1 approximates the orange line, the population mean can be estimated.

FIGURE 8.

PC1 Distribution. Left(a): The histogram of PC1 with an added bias to prevent negative values from appearing. Right(b): The histogram of the logarithmically transformed version of the PC1.

Show All

The number of samples (n) is 615 and we can get the 95% confidence interval range ( $\mu _{r}$ ) in the following equation. $\begin{equation*} \mu _{r} = 1.96 \times \sqrt {\frac {\sigma ^{2}}{n}} = 1.96 \times \sqrt {\frac {0.82^{2}}{615}} \fallingdotseq 0.065 \tag {8}\end{equation*}$ View Source $\mu _{r}$ is somewhat small relative to $\mu$ , which suggests that the number of samples in our dataset is adequate.

SECTION B.

Model Inference Time

We measured the inference speed and show the results and the number of parameters in Table 8. The measurement of inference speed was conducted with a batch size of 8 using Intel(R) Core(TM) i9-10940X CPU for use in a medical setting.

TABLE 8 Model Inference Time With a Batch Size of 8

References is not available for this document.

Estimation of Multi-Expert Sperm Assessments Using Video Recognition Based Model Trained by EMD Loss

Alerts

Abstract:

Metadata

Abstract:

Introduction

Related Work

A. Datasets and Machine Learning Approaches for Sperm Assessment

B. Video Recognition Models

C. Label Distribution Learning

Multi-Expert Rated Sperm Video (MERSV) Dataset

A. Dataset Construction

1) Sperm Video Recording Method

2) Details of Recorded Sperm Video

3) Sperm Evaluation by Multiple Experts

B. Statistics of Grade Distribution

C. Grade Distribution Estimation Task

Sperm Grade Distribution Estimation Model

A. Model Structure

B. Loss Function

1) Cross Entropy Loss (CE)

2) Mean Squared Error (MSE)

3) Jensen-Shannon Divergence (JSD)

4) Earth Mover’S Distance (EMD)

Experiments and Results

A. Experimental Settings

1) Datasets

2) Implementation Details

3) Evaluation Metrics

B. Results

1) Comparison Between Backbone Models

2) Comparison Between Loss Functions

3) Observations During Training

4) Accuracy in Grade Mode Class

C. Analysis of HI Distribution in Test Data

1) Comparison Between Backbone Models

2) Comparison Between Loss Functions

Conclusion

Appendix

Appendix

Video Feature Distributions and Dataset Size Discussion

Model Inference Time

References

IEEE Account

Purchase Details

Profile Information

Need Help?