Introduction
Infertility is a widespread global concern, affecting approximately 17.5% of adults, or about one in six people [1]. To address this issue, affordable and high-quality fertility care is needed.
Some studies suggest that male factors contribute to infertility in up to 50% of couples [2], [3], [4]. Therefore, infertility treatment is important for both males and females. Intracytoplasmic sperm injection (ICSI) is common treatment for male infertility. In ICSI, individual sperm cells are meticulously selected and injected into eggs by experts. In this process, it is important to quickly and accurately classify sperm as either normal or abnormal. This assessment, however, requires expertise and is time-consuming.
Several computer-based systems for sperm analysis have been proposed to mitigate the burden on experts. However, there are no systems that can consider both sperm motility and morphology, or that can directly assess sperm suitability for ICSI. Computer-assisted semen analysis (CASA) systems [5] automated sperm analysis to some extent. This systems provide sperm concentration, motility analysis, morphology assessment, but they cannot assess sperm considering both motility and morphology and do not assess sperm suitability for ICSI. There are also several studies used machine learning [6], [7], [8], [9], [10], [11], [12]. However, they focused primarily on sperm images and ignored the crucial aspects, motion information and cannot assess sperm suitability for ICSI.
To address this problem, we collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for analysis, that included motion information, and developed an end-to-end sperm grade distribution estimation model trained by this dataset. Figure 1 illustrates this process. Owing to the end-to-end inference from the video, this model can consider both motion and morphology. The predicted distribution reflecting the evaluations of multiple experts, can easily help determine the suitability of a particular sperm for ICSI. By referring to predicted sperm grade distributions, experts can efficiently select a sperm from multiple candidates without the need to check all sperm. Furthermore, the collected dataset and predicted sperm grade distribution can be valuable resources for professional development. Therefore, our system can significantly help to reduce the workload for experts.
Overview of our sperm grade estimation system. We collected multiple expert evaluations of sperm videos, curated the multi-expert rated sperm video (MERSV) dataset for analysis, and developed an end-to-end sperm grade distribution estimation model trained by this dataset. By referring to the sperm grade distributions predicted by our model, experts can reduce their workload.
There are two questions in developing the end-to-end sperm grade distribution estimation model. First, how to extract sperm motion and morphological features from video data. Second, which loss function is appropriate for a model to estimate the grade distribution. To answer these research questions, we conducted a thorough comparison of models based on different feature extractors and loss functions. We considered three video feature extractors: R(
Our contributions can be summed up in three points:
We constructed the multi-expert rated sperm video (MERSV) dataset and proposed an end-to-end sperm grade distribution estimation system based on videos that helps to reduce expert’s workload for ICSI.
We analyzed three feature extractors and found that TimeSformer was the most effective feature extractor for video-based sperm analysis, that outperforms ResNet and highlights the indispensability of video data.
Earth mover’s distance (EMD) loss was identified as the most suitable loss function for the estimating grade distribution and demonstrated its superior performance in lower-scoring segment samples.
The rest of this paper is organised as follows. In Section II, we briefly present previous work that includes machine learning approaches for sperm assessment, video recognition models and label distribution learning. Section III describes details of our compiled MERSV dataset. In Section IV, we explains details of our proposed sperm grade distribution estimation model. Sections V shows the experimental setup for comparing of models based on different feature extractors and loss functions and the results. Finally, Section VI presents the conclusions.
Related Work
A. Datasets and Machine Learning Approaches for Sperm Assessment
Machine learning have been used on healthcare and medicine [13], [14], [15], [16], [17]. Several machine learning methods have been proposed for sperm morphology analysis too. The methods were validated using three datasets: SMIDS, HuSHeM and SCIAN.
The Sperm Morphology Image (SMIDS) dataset was constructed by the Medical Faculty of Istanbul University using smartphone-based data acquisition [6]. It included 3,000 segmented RGB sperm images labeled as normal(1,021), abnormal(1,005) and non-sperm(974) by an expert.
The Human Sperm Head Morphology (HuSHeM) dataset was curated from sperm samples of 15 patients at the Isfahan Fertility and Infertility Center [7]. This dataset included 216 RGB images of
The Laboratory for Scientific Image Analysis Gold-Standard for Morphological Sperm Analysis (SCIAN) dataset consists of sperm samples from the Medical Faculty of Chile University [8]. It included 1132 grayscale images of sperm head images in
Numerous studies used image recognition models, particularly CNN-based models. Riordon et al. [9] fine-tuned a pre-trained VGG16 from ImageNet. Spencer et al. [10] combined VGG16, VGG19, ResNet-34 and DenseNet-161 using a multi-class meta-classifier. Yüzkat et al. [11] proposed six CNN models and conbined their decisions with soft-voting. Non-CNN methods were also introduced. Ilhan et al. [12] presented a computational framework using multi-stage cascade-connected preprocessing techniques, region-based descriptor features, and non-linear kernel SVM-based learning.
However, these approaches do not consider sperm movement and, focus exclusively on sperm images. Therefore, we compiled the MERSV dataset that can be analyzed including movement information. Using this dataset, an end-to-end sperm grade distribution estimation model was developed.
B. Video Recognition Models
Deep learning is used for video data analysis such as human action recognition [19], [20], [21], [22], [23], [24], [25], surveillance system [26], [27] and visual speech recognition [28], [29]. Video recognition models, mainly used for action recognition, have evolved in parallel with image recognition models such as ResNet [30]. These models can be broadly classified into two categories: CNN-based and transformer-based models.
CNN-based video recognition models began with R3D [19], which extended 2D convolution to 3D. However, owing to the large number of parameters, R3D’s performance is limited. R(
Transformer-based video recognition models began with ViViT [22], which was inspired by the transformer-based image recognition model ViT [31]. TimeSformer [23] introduced divided space-time attention, which was superior to various self-attention methods in terms of computational complexity and accuracy, and achieved improved accuracy.
We focus on commonly used video recognition models such as R(
There are alternative methods for extracting features from a video using video coding techniques that involve motion estimation. In their work, Kumar et al. [32] proposed the K-MCSP algorithm for motion estimation, which incorporates a non-linear function as opposed to MCSP, which uses a linear function. This modification results in reduced computational complexity while minimizing PSNR variation during reconstruction. We didn’t employ this method because it requires learning the acquisition of features to be extracted from a video based on a grade distribution that reflects multiple expert assessments in the training data.
In addition to frame features, there are studies that have achieved high accuracy by incorporating new task-specific features. Cao et al. [33] proposed e-TSN, a TSN network for hand gesture recognition that incorporates hand skeletal features instead of optical flow. In this study, in addition to the video frame features of the detected sperm, the speed calculated from the position information obtained during detection is added as a feature.
C. Label Distribution Learning
Geng introduced the concept of a label distribution which includes different degrees of description across multiple labels, and investigated the best algorithm [34]. Gao et al. [35] subsequently demonstrated the effectiveness of end-to-end learning with KL-Divergence for tasks with label distributions.
In our context, the grade distribution serves as the estimation target. This distribution is a special case of a label distribution whose labels have an ordinal relationship. Consequently, we propose using earth mover’s distance for estimating the grade distribution, as this metric considers the distances between labels within the distribution.
Multi-Expert Rated Sperm Video (MERSV) Dataset
A. Dataset Construction
We compiled the multi-experts rated sperm video (MERSV) dataset for precise sperm assessment. The dataset included 615 videos recorded under a microscope, with each video having a grade distribution determined by annotations from approximately 40 experts. The study was approved by Ethics Committee for Medical and Biological Research Involving Human Subjects No. 2023-27.
1) Sperm Video Recording Method
The videos were filmed in the semen of patients who had undergone ICSI treatment with consent. The number of patients is 615, which corresponds to the number of videos. Consent has been obtained from all patients for the collection and utilization of data. Sperm suspensions for video recording were prepared using density-gradient centrifugation followed by the swim-up method to obtain sperm with good motility. Video recording was conducted using an Olympus IX73 microscope. Due to variations in the thickness of the sperm suspension, the focus and light source were adjusted accordingly for each recording. Other settings are consistent with those described in [36].
2) Details of Recorded Sperm Video
The videos were recorded at a rate of 15 frames per second (fps). Each frame had a resolution of
3) Sperm Evaluation by Multiple Experts
The experts assessed the sperm using a five-grade grading system: A(good), B(better), C(middle), D(worse), and E(bad). We refer to these labels as “grade class labels.” By combining the experts ratings, we obtain a distribution, which we refer to as “grade distribution.”
Figure 2 shows examples of sperm videos and their grade distributions. The sufficiency of the video data quantity is discussed in Section VI-A. While we cannot publish this dataset now, we plan to make it public in the future.
Six examples from our (MERSV) dataset. It shows 4 out of a 16 video frames (left) and their grade distribution (right) in each sample. Top samples, good; middle samples, medium; bottom samples show examples of concentrated bad ratings. Samples on the left side are examples of low variability in ratings, while samples on the right side are examples of high variability in ratings.
B. Statistics of Grade Distribution
The total number of expert ratings collected was 24,533. For statistical analysis, we represented grade class labels as categorical variables using assigned numerical grade scores to them: A,1; B,2; C,3; D,4; and E,5. The means and standard deviations of the grade distributions are shown in Figure 3(a). We refer to the most frequently evaluated grade class label/score as the “grade mode class label/score.” Histograms illustrating the grade mode class scores are shown in Figure 3(b). These graphs show that most of the samples had grade distributions concentrated in B (2) and C (3). When training a model with this dataset, data imbalance may occur. Therefore, evaluation results for each grade mode class label should be examined.
Statistical analysis of the results of multiple experts ratings (grade distribution). Left (a): Joint histogram of the mean and standard deviations of grade scores. Right (b): Histograms of the grade mode score. Grade mode score is the most frequently evaluated grade score. These graphs show that most of the samples had grade distributions concentrated in B (2) and C (3).
C. Grade Distribution Estimation Task
The grade distribution prediction task is referred to as the “grade distribution estimation task.” As mentioned in Section III-A, grade distribution consists of grade class labels that have been annotated by multiple experts. The grade distribution was normalized by dividing it by the number of experts, to ensure that the sum of its values was 1. Therefore, each value in the grade distribution represents the probability of each grade class (
Sperm Grade Distribution Estimation Model
A. Model Structure
We propose a sperm grade distribution estimation model for automated sperm assessment and provide an overview of the model in Figure 4. This model predicts grade distribution from detected frames and sperm positions.
Overview of sperm grade distribution estimation model. This model predicts grade distribution from detected frames and sperm positions. In our model, video recognition model is used as feature extractor (Backbone) and the last layer (Head) outputs the grade distribution based on extracted features and sperm speed.
Our proposed sperm grade estimation model is based on a video recognition model architecture. More explicitly, we examined three video recognition model architectures: R(\begin{equation*} \hat {\boldsymbol {y}} = H(B(\boldsymbol {x}), \boldsymbol {v}) \tag {1}\end{equation*}
In addition, we prepared an image-based model as a baseline for comparison. We believe that video information is necessary for sperm analysis, as image information alone is not adequate, and demonstrate this in later experiments.
B. Loss Function
There is a lack of research that clarifies the appropriate loss function for grade distribution estimation tasks. Four loss functions were prepared to determine the best loss function for this task: cross entropy loss (CE), mean squared error loss (MSE), Jensen-Shannon divergence loss (JSD) and earth mover’s distance loss (EMD).
1) Cross Entropy Loss (CE)
CE is widely used as a training loss in classification tasks.\begin{equation*} \mathcal {L}_{CE}(\boldsymbol {p}, \boldsymbol {q}) = - \sum _{i=1}^{n} p_{i} \log (q_{i}) \tag {2}\end{equation*}
2) Mean Squared Error (MSE)
MSE is widely used as a training loss function for regression tasks.\begin{equation*} \mathcal {L}_{MSE}(\boldsymbol {p}, \boldsymbol {q}) = \frac {1}{n} \sum _{i=1}^{n} (p_{i} - q_{i})^{2} \tag {3}\end{equation*}
3) Jensen-Shannon Divergence (JSD)
JSD measures the difference between two probability distributions. JSD is based on the Kullback-Leibler divergence (KLD), but it differs in that it is symmetric and always has a finite value.\begin{align*} \mathcal {L}_{JS}(\boldsymbol {p}, \boldsymbol {q}) & = \frac {1}{2} \mathcal {L}_{KL}\left ({{\boldsymbol {p}, \frac {\boldsymbol {p}+\boldsymbol {q}}{2}}}\right ) + \frac {1}{2} \mathcal {L}_{KL}\left ({{\boldsymbol {q}, \frac {\boldsymbol {p}+\boldsymbol {q}}{2}}}\right ) \tag {4}\\ \mathcal {L}_{KL}(\boldsymbol {p}, \boldsymbol {q}) & = \sum _{i=1}^{n} p_{i} \log \frac {p_{i}}{q_{i}} \tag {5}\end{align*}
These three loss functions lack the inter-class relationships between score buckets.
4) Earth Mover’S Distance (EMD)
EMD is defined as the minimum cost of transporting the mass from one distribution to another. The EMD is also known as the Wasserstein distance. It is used as a loss function in various scenarios, such as order-class classification [37], [38], [39], adversarial training [40], and modality alignment learning [41]. In classification tasks where labels have strong relationships, EMD-based losses yield better results than other loss functions [37]. EMD can be solved exactly in closed form if the sums of the distributions are equal and the class space can be represented by a one-dimensional embedding. If the sums of the distributions are equal and the class space can be represented by a one-dimensional embedding, then an exact closed-form solution can be obtained [42]. The graded distributions considered in this study satisfy these conditions. Because the grade classes have an ordering relation of \begin{equation*} \mathcal {L}_{EMD}(\boldsymbol {p}, \boldsymbol {q}) = \sum _{i = 1}^{n}\left ({{CDF_{i}(\boldsymbol {p}) - CDF_{i}(\boldsymbol {q})}}\right )^{2} \tag {6}\end{equation*}
Experiments and Results
A. Experimental Settings
1) Datasets
We used the MERSV Dataset described in Section III. For evaluation, a stratified 3-fold cross-validation was performed, considering the grade mode class as a label. This ensured that the distribution of the number of grade mode classes in the training samples remained the same in the validation samples. In each fold, the number of training data was 492, while for the test data, it was 123 and there is no confusion between videos of the same patient in train and test data. As each video differs in length, only the initial second, equivalent to 16 frames, is utilized. Additionally, fair sampling was performed such that the number of each grade mode class in training samples remained the same as that in validation samples. To align with the pretrained models, we utilize upsampled frames/images of size
2) Implementation Details
The sperm grade distribution estimation models presented in this study were implemented using PyTorch [43]. We used a pretrained image/video recognition model as the image/video feature extractor (Backbone). Specifically, ResNet [30] was used as the image feature extractor. In this image-based model, predictions were obtained from all 16 frames and evaluated individually. For the video feature extractor, we used R(
3) Evaluation Metrics
We employed multiple evaluation metrics to accurately ascertain the performance among models and loss functions. To evaluate distribution differences, We used EMD, MSE, JSD and CE. For a fair evaluation of the loss functions, the histogram intersection (HI) [44], which is not used as a loss function, was used as an evaluation metric. The HI is a similarity measure that quantifies the degree of overlap between two histograms. We defined HI as “grade distribution accuracy”.\begin{equation*} HI(\boldsymbol {p}, \boldsymbol {q}) = \sum _{i=1}^{n} \min (p_{i}, q_{i}) \tag {7}\end{equation*}
B. Results
Table 1 shows the performance of the grade distribution estimation models for different backbone models and loss function settings. Lower values of EMD, MSE, JSD, and CE indicate higher performance, whereas higher values of HI(Macro HI) and MacroF1 indicate higher performance. We analyzed these results by comparing the backbone models and the loss functions.
1) Comparison Between Backbone Models
For the each loss function, we ranked the models and compared them based on their average rankings across the different evaluation metrics.
Table 2 shows the average rankings of each metrics and their averages. For all metrics, the average rankings of the models exhibited similar trends. TimeSformer consistently outperformed all other models in all metrics, while R(
2) Comparison Between Loss Functions
For each model, we ranked the loss functions and compared them based on their average rankings across different evaluation metrics.
Table 3 shows the average rankings for each metric and overall averages. For all loss functions, the average ranking was the highest when the metric was identical to the loss function. EMD had the best average ranking, indicating that it was the most appropriate function for the grade distribution estimation task. Conversely, CE was the lowest for almost all metrics, suggesting that it was not suitable for the grade distribution estimation task. However, in some cases, CE demonstrated superior performance compared to other loss functions with respect to MacroF1, indicating its suitability for classification tasks. JSD outperformed CE, however, lagged behind EMD or MSE in average rating.
3) Observations During Training
Figure 5 shows the learning curves of test data for each loss function. The numbers within parentheses indicate the epochs in which the model performed best. These graphs confirm that all models exhibited well-converged learning. Additionally, it is evident that TimeSformer consistently attained the lowest value in the earliest epochs for all loss functions, indicating efficient and rapid convergence during training.
Learning curves of each loss functions in test data. The number within the parentheses indicates the epoch at which the model performs best.
4) Accuracy in Grade Mode Class
We also assessed the accuracy of our models in grade mode (grade mode accuracy), and Table 4 shows the results. Note that this accuracy is assessed under label imbalance. The accuracy ranges from 65% to 70%. On average, TimeSformer shows a 3.41% improvement compared to ResNet and video recognition models outperform ResNet, an image-based model, in almost all loss functions. These results reinforce the need for video-based analysis of sperm.
C. Analysis of HI Distribution in Test Data
We also compared the HI distributions in the test data. Figure 6 shows the HI distributions for each grade mode class label and the blue dotted line shows the average HI. To analyze these distributions, we focused on specific segments of the HI score distribution, namely, the higher (Top25%), middle (Top50%), and lower (Top75%) segments. Table 5 lists the Top25/50/75% HI results, which we analyzed by comparing backbone models and loss functions below.
Distributions of HI for each grade mode class label. For a more detailed analysis, we focused on specific segments of the HI score distribution (Table 5). The blue dotted line shows the average HI score.
1) Comparison Between Backbone Models
For each loss function, we ranked the backbone models and compared them based on their average rankings from the results in Table 5. Table 6 lists the average rankings for each metric and overall averages.
Analyzing Table 6, it is found that TimeSformer performs exceptionally well across all segments of the HI score, indicating its superiority in this dataset, which is consistent with the findings in Section V-B. SlowFast outperformed ResNet in the Top25% segment, however, it performed worse than ResNet in the Top50% and 75% of the segments. This discrepancy suggests that SlowFast has difficulty effectively extracting generic features from video data. R(
2) Comparison Between Loss Functions
For each models, we ranked the loss functions and compared them based on their average rankings from the results in Table 5.
Table 7 lists a summary of the average rankings for each loss function across the various metrics. Upon analyzing the overall average, it is found that EMD was the best among the loss functions. Specifically, EMD outperformed the other loss functions in the Top50% and Top75% HI score segments. This can be attributed to ability of EMD to consider the relationships between grade classes and prevent fatal mistakes in the grade distribution estimation task. Particularly in the medical domain, it is important to guarantee the worst score from the perspective of reliability. Therefore, EMD is a suitable choice for practical applications. Conversely, CE performed the worst among all the loss functions. Similar to the findings presented in Section V-B, these results reinforce the unsuitability of CE for the grade distribution estimation task.
From the analysis so far, that the best Backbone is TimeSformer, and the best loss function is EMD. We visualized the latent space of our model in this setting and discussed the sufficiency of the quantity of video data in Section VI-A.
Conclusion
In this study, we curated the MERSV dataset and proposed a model for end-to-end sperm grade distribution estimation from videos to contribute to reduce experts workload in sperm selection for ICSI. Based on our experimental results, TimeSformer is the most promising of the video recognition models and outperforms the image recognition model ResNet. This result indicates that image-based analysis is insufficient for comprehensive sperm analysis, which highlights the importance of video-based analysis. We also identified the EMD as the most suitable loss function for the grade distribution estimation task, demonstrating its superior performance in lower-scoring segment samples.
However, our study has three limitations. Firstly, we cannot evaluate our models on multiple databases because there are no datasets with sperm videos and multi-expert annotations. In adddition, our dataset cannot be published now. Therefore, we will keep collecting new data and prepare to publish our dataset in the future. Secondly, the video model has a higher number of parameters than the image model, which may make it difficult to use in the clinics and reduce the inference speed (see Section VI-B). Therefore, we will also work on model compression or build a new effective model in the future. Thirdly, although we assessed the performance of our proposed model in estimating grade distributions, we are unable to gauge its effectiveness in reducing the workload for experts. Hense, we aim to introduce this system into clinical environments and validate its effectiveness.
Appendix
Appendix
Video Feature Distributions and Dataset Size Discussion
We checked the distribution of video features in our dataset using 768 dimensional features extracted from the TimeSformer trained by EMD. Figure 7 shows that 2-dimensional video features compressed from 768 dimensions by PCA. The proportion of the variance for the first principal component (PC1) was 69%, while for the second principal component (PC2), it was 4.7%. The color of the points in this plot shows that the grade mode score of each sample. In this plot, moving towards the upper right causes the samples to become reddish, while moving towards the lower left causes the samples to become bluish. This confirms that TimeSformer can extract appropriate features from a video.
Figure 8(a) shows the histogram of PC1 with an added bias to prevent negative values from appearing, while Figure 8(b) presents the logarithmically transformed version of the PC1. The mean of logarithmically transformed PC1 (
PC1 Distribution. Left(a): The histogram of PC1 with an added bias to prevent negative values from appearing. Right(b): The histogram of the logarithmically transformed version of the PC1.
The number of samples (n) is 615 and we can get the 95% confidence interval range (\begin{equation*} \mu _{r} = 1.96 \times \sqrt {\frac {\sigma ^{2}}{n}} = 1.96 \times \sqrt {\frac {0.82^{2}}{615}} \fallingdotseq 0.065 \tag {8}\end{equation*}
Model Inference Time
We measured the inference speed and show the results and the number of parameters in Table 8. The measurement of inference speed was conducted with a batch size of 8 using Intel(R) Core(TM) i9-10940X CPU for use in a medical setting.