Introduction
The most common and familiar vegetable in our daily lives is the tomato, an essential horticultural crop in all regions of the World. Tomatoes play a significant role in the food industry. According to the data from FAOSTAT [1] and TILASTO [2], the worldwide production of tomatoes in 2021 reached 189 million tons. Sorting and grading these tomatoes is a heavy and tiresome task. Fruits have several qualities, depending on the parameters. The sorting and grading of tomatoes is a crucial process that ensures the quality and safety of the products for consumers. The grading process involves sorting tomatoes into different quality grades based on external features like size, shape, color, ripeness, and defects. Traditional methods used for grading tomatoes include a manual approach by trained experts. Manual grading has several challenges, such as the time-consuming process and the significant amount of labor required. Thus, each tomato is individually inspected and sorted by a human expert. In most cases, sorting and grading differ from expert to expert, and performance can be affected by fatigue, subjectivity, or even human error. This method is expensive and inefficient in terms of performance because it does not handle large volumes of tomatoes [3].
The Organization for Economic Cooperation and Development (OECD) [4] and the United States Department of Agriculture (USDA) [5] have established standards for tomatoes intended for consumption. These standards focus on tomatoes’ quality, classification, and grading for domestic and international markets. They often cover aspects such as size, color, shape, ripeness, and defect tolerance allowed in tomatoes. The aim of establishing standard criteria is to facilitate fair and transparent trade practices among member countries. According to these standards, this affects the fruit’s appearance, quality, and marketability. Some of the typical defects include cracks on the tomato’s surface, scars caused by wounds or marks, sunburn due to excessive exposure to sunlight, insect damage from pest penetrations, discoloration that impacts the tomato’s appearance, blemishes or irregular marks, distortions leading to misshapen tomatoes, bruises resulting from impact or pressure, and blossom end rot with brown. These defects can pose challenges to growers, distributors, and consumers, and adhering to standardized practices is essential to maintain quality and meet market demands. Figure 1 shows some of the tomato defects.
Tomato defects: (a) defect-free (b) cracks (c) pest (d) skin cracks (e) sunburn (f) end rot.
Machine learning has found extensive effectiveness in numerous applications, ranging from quality assessment, detection of diseases in crops, the precise recognition of handwritten digits to facilitate natural language processing, and the accurate identification of audio and speech patterns [6], [7], [8], [9], [10]. Recently, there has been a significant use of machine learning in the food industry, for sorting and grading vegetables or fruits to tackle the challenges of human errors, subjectiveness, labor costs, time-consuming, and increased performance [11], [12], [13]. Within the quality analysis of the tomato, many different approaches were implemented. Many studies have been conducted on developing and implementing machine learning systems for tomato sorting and grading. Various image processing techniques, such as color-based segmentation, morphological operations, and edge detection, have been used for tomato sorting and grading. Moreover, shallow and deep learning algorithms, such as CNN, SVM, KNN, and RF have been employed to classify tomatoes into different quality grades based on color.
Traditional machine learning methods require manual selection and extraction of features, while Deep Learning, such as CNN, can be hindered by longer training times [14], [15]. The surveyed research demonstrated enhanced performance, but this improvement often comes at the cost of increased system complexity, including longer training periods and larger model memory footprint. Despite the advantages of improved performance, researchers must strike a balance between achieving better results and managing the increased computational demands and resource requirements associated with the more complex models employed in Deep Learning approaches.
There are many studies for fruit and vegetable quality assessment, the following points spotlight the uniqueness of this study:
This study is uniquely tailored to address the characteristics, attributes, and challenges associated with tomato quality assessment.
This study proposes a model with the capability to effectively operate in real-world scenarios across various backgrounds as the dataset was collected in uncontrolled environmental conditions with varying light sources.
This study introduces the hybrid model between CNN and traditional machine learning algorithms for tomato image classification. The CNNs are efficient in feature extraction tasks with minimal computational resources as opposed to traditional machine learning while traditional machine learning requires less training time compared to CNN) In this regard, CNN was deployed for tomato image feature extraction, and traditional machine-learning algorithms were deployed to speed up training significantly and classification. To the best of our knowledge, this is the first model attempting tomato classification using hybrid algorithms (shallow and deep learning).
The rest of this paper is organized as follows: Section II introduces the related work, Section III introduces the framework of the proposed method, and Section IV presents results and discussions. Finally, Section V presents the conclusion and future work.
Related Work
Significant efforts have been directed toward developing automated systems for sorting and grading fruits and vegetables, focusing on evaluating external attributes, including shape, size, and color. An algorithm for identifying defects in orange fruits presented in a study [11]. The red, green, and blue (RGB) and near-infrared (NIR) images captured by a dual-sensor camera were used. The thresholding approach followed by the voting technique was used for color combination for defect detection. The algorithm attained an overall accuracy of 95%. Another study [16] proposed a comparative study on different deep learning algorithms (ResNet, VGGNet, GoogleNet, and AlexNet) for fruit classification. Also, the study utilized the you only look once (YOLO) algorithm to detect the region of interest to increase the performance of fruit grading.
The study in [17], proposed a system that employs image processing techniques for vegetable and fruit quality grading and classification. The system classifies by extracting external features (color and texture). A comparative study of two machine learning algorithms for vegetable and fruit classification was presented in [12]. The SVM achieved a superior accuracy of 94.3% over the KNN. The study in [18] introduced a hybrid model for weed identification in winter rape fields. The study compared five models, the hybrid model between VGGNet with SVM achieved a higher accuracy of 92.1% than others. The study in [19] performed similar work to [18] by adding the residual filter network for feature enhancement into the hybrid network (CNN-SVM), with the proposed model attaining an overall accuracy of 99% against others.
Classification of appearance quality in red grapes using transfer learning with convolutional neural networks was introduced in [20]. The investigation employed the transfer learning technique with four pre-trained networks, namely VGG19, Inceptionv3, GoogleNet, and ResNet50, to achieve its objectives. Notably, ResNet50 demonstrated the highest accuracy, reaching 82.85% in the precise categorization of red grapes into three distinct quality categories. To further enhance performance, the research employed ResNet50 for feature extraction and SVM for classification. The ResNet50-SVM model excelled, achieving the highest accuracy of 95.08% in effectively categorizing red grapes into the aforementioned categories.
In a study conducted by [13] within the domain of fruit quality assessment, a comparative analysis was undertaken involving CNN and Vision Transformers (ViT). The findings of the study revealed that the CNN model demonstrated superior performance compared to the ViT model. Specifically, the CNN model achieved a remarkable accuracy of 95% in classifying apple fruits into four distinct classes and banana fruits into two classes, surpassing the accuracy of 93% attained by the ViT model. The study in [21] performed similar work to [13] for olive disease classification, employing CNN and ViT for feature extraction purposes and utilizing SoftMax as a classifier. The study achieved notable results by employing a fusion of ViT and VGG-16 for feature extraction, attaining an average accuracy of 97% for binary and multiclass classification. However, according to [22], ViT requires a large training dataset of more than 14 million images to outperform CNN models like ResNet’s.
A system for tomato grading based on machine vision was presented in [23]. The system attained an average accuracy of 95% for detecting calyx and stalk from the tomatoes and an accuracy of 97% in classifying tomatoes into healthy and defective classes, with radial basis function-support vector machines (RBF-SVM) outperforming other classifiers. In this study, it was observed that the training set was also used for testing. In [24], the study presented an advanced tomato detection algorithm that utilizes enhanced hue saturation and value (HSV) color space and watershed techniques for efficient fruit separation. The algorithm achieved an overall accuracy of 81.6% in red fruit detection.
The study in [15] proposed tomato ripeness detection and classification using VGG-based CNN models. The proposed model involved transfer learning and fine-tuning techniques of VGG-16 to classify tomatoes into ripe and unripe classes. The average accuracy of 96% was attained by the model. The study in [25] introduced an automated grading system designed for the assessment of tomato ripeness through the application of deep learning methodologies. The investigation leveraged transfer learning with Resnet18, achieving an impressive average validation accuracy of 93.85% in the precise categorization of tomatoes into ripe, under-ripe, and over-ripe.
In [26], the study suggested a system for classifying tomatoes into three categories based on maturity: immature, partially mature, and mature. The system was accomplished by applying deep transfer learning while utilizing five pre-trained models. Among these models, VGG-19 demonstrated the highest accuracy of 97.37% compared to the others. A tomato classification system based on size was presented in [27] with a system utilizing thresholding, machine learning, and deep learning techniques based on area, perimeter, and enclosed circle radius features. The machine learning techniques showed the best performance, with an average accuracy of 94.5% achieved by the SVM. In [28], the study developed an automatic tomato detection method. The SVM algorithm was used as a classifier and achieved an average accuracy of 90%, outperforming others. A computer vision-based grading and Sorting system of tomatoes into defected and non-defected was presented in [29]. The task was performed using a backpropagation neural network algorithm, attaining an accuracy of 92%.
Materials and Methods
In this section, we present the proposed framework for tomato classification, as shown in Figure 2. The proposed system includes five major sections: image acquisition, preprocessing, transfer learning, feature extraction, and classifier.
A. Image Acquisition
The images were acquired in an uncontrolled lighting environment, as illustrated in Figure 4, which maintained a fixed distance of 20 cm between the table surface and the camera. The acquisition system employed a single-board computer NVIDIA Jetson Tx1 [30] with an onboard camera equipped with an RGB sensor capable of capturing images in JPEG format at
B. Image Preprocessing
The captured tomato images have non-uniform illumination due to the light’s reflection, which leads to contrast variation. These variations of the contrast in the images may impact the performance of the classification model and increase computational complexity. To overcome this problem, we performed segmentation and background cancellation to get the region of interest (ROI) before training our dataset. Red and green channels of the image were extracted, followed by simple thresholding using the Otsu method [33] to obtain two separate binary images (masks).
In the Otsu method, the threshold value for image binarization is determined by maximizing the between-class variance. First, the histogram of the grayscale or individual channel of the image is computed, representing the distribution of pixel intensities. The histogram is then normalized, and cumulative probabilities and means are calculated. The global mean, representing the average intensity of the entire image, is obtained. Next, for each possible threshold value, the between-class variance is computed by evaluating the separation between two classes: the class below the threshold and the class above it. The threshold that maximizes this between-class variance is identified as the optimal threshold for image binarization. By effectively maximizing the separation between foreground and background pixel intensities, the Otsu method provides an automated and robust way to determine the threshold value for accurate image segmentation and object extraction. A simplified mathematical representation of the Otsu thresholding is shown in equation (1).\begin{align*} B(x,y)= \begin{cases} \displaystyle 1, & p(x,y) > thr\\ \displaystyle 0, & Otherwise \end{cases} \tag{1}\end{align*}
The “OR” logical operator fused the generated masks, followed by morphological operations to remove gaps and holes to get the final mask that includes both color features. In addition, we performed background cancellation by multiplying the original image with its respective binary image (mask), which helps identify specific ROI. Thus, the images are taken for training. Figure 5 demonstrates the preprocessing techniques investigated in our work for the case of the tomato image with a plain background and Figure 6 demonstrates the capability of the preprocessing techniques to segment the tomato image with a complex background of rollers similar to the rotating mechanical support in an industrial setup.
Image preprocessing: (a) channels extraction (b) thresholding and binary logical combination (c) background cancellation.
C. Data Augmentation
Data augmentation is a machine learning and computer vision technique to increase the training data available for a model to learn from [34]. It involves applying various transformations to the existing data to create new, slightly altered versions of the original data. Data augmentation aims to improve the generalization and robustness of machine learning models by exposing them to a more extensive and diverse set of examples during training. By introducing variations in the training data, the model can learn to be more flexible and adaptable and handle new and unseen inputs during inference. We augmented our training set by applying rotation, reflection, translation, and scaling.
D. Transfer Learning
Transfer learning [35] addresses the challenges associated with training deep learning networks when the amount of available data is limited. Instead of starting from scratch, transfer learning uses pre-trained deep learning networks customized for the specific task. The pre-trained network can be utilized as a feature extractor or for end-to-end classification tasks by carefully adjusting some parameters. This study used a pre-trained network as a feature extractor and classify the extracted features using traditional machine learning classifiers. We evaluated four pre-trained networks, namely MobileNetv2 [36], Inceptionv3 [37], ResNet50 [38], and AlexNet [39], for transfer learning in our proposed approach. We analyzed the performance of these networks in terms of accuracy before deploying them for feature extraction. We then proceed with fine-tuning the selected pre-trained networks. Fine-tuning aims to teach the network to recognize classes that were not trained before. It involves removing the last convolutional and classification layers of the adopted pre-trained network to match the number of classes in the building classification challenge. To prevent overfitting, We froze some network layers, apply batch normalization, and dropout during training.
E. Feature Extraction
In our proposed system, the selected pre-trained networks were used to extract the deep features from the last convolutional layers, or dense layers of the network [40]. Usually, these networks are designed with more convolutional layers to increase network performance. A set of weight layers cascade with another layer separated by the activation layer such as a rectified linear unit (ReLU). Thus, features were extracted from the deepest layer of the network and used to train traditional machine learning classifiers.
F. Classifiers
Machine learning classifiers [41] are algorithms trained on labeled datasets to learn patterns and relationships between input features and corresponding output labels. The goal of a machine learning classifier is to predict the correct label for a new input based on the relationships learned from the training dataset. Machine learning classifiers include decision trees, random forests, logistic regression, support vector machines, k-nearest neighbors, and neural networks. The choice of a machine learning classifier depends on the specific characteristics of the dataset and the desired performance metrics. A labeled dataset is typically divided into training and validation sets to train a machine learning classifier. The training set is used to train the classifier, and the validation set is used to evaluate the classifier’s performance on new, unseen data. The classifier’s performance is measured using accuracy, precision, recall, specificity, and F1 score metrics. This study uses support vector machines, random forests, k-nearest neighbor’s classifiers, and principal component analysis as features dimensionality reduction.
1) Support Vector Machine – SVM
The support vector machine [28] is highly effective in image classification and regression tasks because it can handle dimensionality issues. This algorithm employs support vectors for determining the coordinates of individual observations. It can produce accurate outcomes even in high-dimensional spaces, such as when the number of dimensions exceeds the number of samples.
2) Random Forest - RF
The random forest classifier is a popular machine learning algorithm that belongs to the ensemble learning family of methods [41]. It is widely used for classification and regression tasks in which the goal is to make accurate predictions. The random forest algorithm builds an ensemble of decision trees by randomly selecting a subset of features and data samples to train each tree. During training, the trees learn to classify the input based on splitting rules. When making a prediction, the input is passed through each tree, and the final prediction is determined by aggregating the individual tree predictions, often using majority voting. random forest is known for its ability to handle high-dimensional data, provide accurate predictions, and identify important features for the task at hand.
3) K-Nearest Neighbor -KNN
The K-Nearest Neighbors (KNN) is a machine learning algorithm for classification and regression tasks. A simple but effective algorithm finds the k-nearest data points in the dataset to a new data point. Then it assigns the majority class label among those KNN as the predicted label for the new data point [42]. KNN works well with linear and non-linear decision boundaries and can be applied to various problem domains such as image recognition, text classification, and recommendation systems. However, KNN can be sensitive to the choice of K and the distance metric used and may be computationally expensive for large datasets. The distance metric used to measure the similarity between data points can vary depending on the problem domain. KNN may not handle large datasets such as image classification, so it is recommended to use principal component analysis [43] for feature dimensionality reduction before utilizing the KNN classifier.
Results and Discussion
A. Experimental Setup
The training and testing of the proposed model has been performed on 128GB of RAM, NVIDIA GeForce RTX 2080 Titan with 11GB of RAM, and an Intel (R) Xeon (R) Silver 4114 CPU 2.20 GHz PC using the MATLAB R2022b release. Our experiment deployed four pre-trained networks, namely MobileNetv2 [36], Inceptionv3 [37], ResNet50 [38], and Alex-Net [39], for feature extractions. These features were then classified using three traditional machine learning classifiers namely SVM, RF, and KNN. Also, a dataset of 2400 tomato images was used in our experiments, with the first batch having healthy and reject classes and the second batch having ripe, unripe, and reject classes, as described in Table 1. The dataset was randomly divided into 70% (1680 images) for training, 10% (240 images) for validation, and 20% (480 images) for testing. The standard hyperparameters used in this experiment were 0.005 learning rate, 32 minibatch sizes, and 60 epochs.
B. Discussion
To evaluate the performance of our proposed model, we performed four different experiments using our dataset and one experiment using a public dataset available on [44] as described in experiments 1 to 5. The performance was evaluated based on training time, testing time, accuracy, recall, precision, specificity, and F-1 score calculated according to equation (2) to (6).\begin{align*} \textit {Accuracy} &= \frac {\textit {TP} + \textit {TN}}{\textit {TP} + \textit {TN} + \textit {FP} + \textit {FN}} \tag{2}\\ \textit {Recall(R)} &=\frac {\textit {TP}}{\textit {TP} + \textit {FN}} \tag{3}\\ \textit {Precision(P)} &=\frac {\textit {TP}}{\textit {TP} + \textit {FP}} \tag{4}\\ \textit {Specificity} &=\frac {\textit {TN}}{\textit {TN} + \textit {FP}} \tag{5}\\ \textit {F1-score} &= 2 \left ({\frac { \textit {RP}} {R + {P}}}\right) \tag{6}\end{align*}
1) Experiment 1: Performance of the Proposed Model Using the Binary Dataset
Table 2 summarizes the performance of the Transfer Learning, CNN-SVM, CNN-RF, and CNN-KNN models on the binary classification (healthy and reject) regarding accuracy with corresponding training and testing times. In Table 2, first performed transfer learning to evaluate the performance of our selected pre-trained networks, where an average accuracy of 95.52% has been attained for classifying tomatoes into binary classes. The performance of transfer learning gives a starting point for determining whether a network can be deployed for feature extraction. Among the proposed hybrid models, CNN-SVM attained the highest accuracy of 97.50% when Inceptionv3 was used as a feature extractor. Furthermore, on average the training time of 4.45 seconds for 1680 images, and the testing time averaged 2.43 milliseconds per image for the proposed hybrid model CNN-SVM compared to other models.
2) Experiment 2: Performance of the Proposed Model Using the Multiclass Dataset
We evaluated our model using batch two datasets (the multiclass), as shown in Table 1. The model’s performance is summarized in Table 3, where transfer learning was first performed to evaluate the performance of our selected pre-trained networks for feature extraction, where an average accuracy of 94.95% has been obtained for multiclass classification. Among the proposed hybrid models, the highest accuracy of 96.67% was attained by CNN-SVM when Inceptionv3 was used as a feature extractor. On average the training time of 4.54 seconds for 1680 images and the testing time of 2.44 milliseconds per image was recorded for the proposed CNN-SVM model compared to other models.
3) Experiment 3: Performance of the Proposed Model in Lightweight CNN on the Binary Dataset
We evaluated our proposed model in lightweight pre-trained networks, i.e., ShuffleNet (5.4MB) [45], MobileNetv2 (20MB) [36], and EfficientNetB0 (13MB) [46], using the binary datasets (healthy and reject). The model’s performance is summarized in Table 4, where an average accuracy of 92.12% has been obtained for transfer learning as we evaluate the feature extractor. The highest accuracy of 95.21% was achieved by CNN-SVM when MobileNetv2 was used as a feature extractor. On average the training time of 5.68 seconds for 1680 images and the testing time of 3.28 milliseconds per image was recorded for the proposed hybrid model CNN-SVM compared to others.
4) Experiment 4: Performance of the Proposed Model in Lightweight CNN Using The Multiclass Dataset
Also, we evaluated our proposed model in lightweight pre-trained models using the multiclass dataset (ripe, unripe, and reject). The model’s performance is summarized in Table 5, where an average accuracy of 93.13% has been obtained for transfer learning for feature extractor evaluation. Among the proposed hybrid models, CNN-SVM attained the highest accuracy of 94.37% when MobileNetv2 was used as a feature extractor. On average the training time of 5.72 seconds for 1680 images and the testing time of 3.30 milliseconds per image was recorded for the proposed hybrid model CNN-SVM compared to others.
5) Experiment 5: Performance of the Proposed Model Using the Public Dataset
We evaluated our model using the public dataset, which has four classes, i.e., ripe, unripe, old, and damaged, available on [44]. The model’s performance is summarized in Table 6, where an average of 95.38% accuracy has been obtained for transfer learning for feature extraction evaluation. Among the proposed hybrid models, CNN-SVM attained the highest accuracy of 97.54% when Inceptionv3 was used as a feature extractor. Furthermore, on average the training time of 4.20 seconds for 1423 images, and the testing time of 2.68 milliseconds per image was recorded for the proposed hybrid model CNN-SVM compared to others. However, the performance of our proposed model in the public dataset is slightly higher compared to our dataset, this is because the inter-class color feature in the public dataset is large compared to our dataset as shown in 7–Figure 12.
C. Performance Metrics of the Proposed Model
Figures 7 to 11 show the confusion matrices of the best-performing proposed model in each experiment. From each confusion matrix, we computed respective standard performance metrics i.e., recall, precision, specificity, and F-1 score calculated according to equations (3) to (6) as summarized in Table 7.
D. Image Analysis and Design Optimization
Within the scope of our investigation, the examination of image features involved aspects related to both appearance and resolution within the dataset. Our analysis revealed that Inceptionv3 outperforms Resnet50 when handling images with closely aligned appearance characteristics, such as those present in our dataset, which exhibits inter-class color features that are in proximity. Conversely, Resnet50 excels with images characterized by slightly broader inter-class color features, as found in public datasets. Notably, Figure 12 provides visual representations of inter-class color features from both our dataset and the public dataset.
The resolution of our dataset stands at
In the process of model optimization, we conducted feature extraction from various layers, observing a consistent improvement in classification accuracy with the utilization of progressively deeper layers used for feature extraction [49]. Additionally, significant improvements in classification accuracy were achieved by employing classical machine learning classifiers in the hybrid model, as opposed to the softmax classifier used during transfer learning [50].
E. Limitations of the Study
While our model demonstrates superior performance, it is important to acknowledge the following remarkable limitations.
The performance of our model will be affected by the minimal inter-class color feature difference.
The performance of our model will be poor when images of low intensity are used.
F. Performance Comparison
We investigated the state-of-the-art (SOTA) works [20], [25], [26] performed in the field of fruit and vegetable quality assessment. We deployed their proposed model to our dataset [32] for binary and multiclass classification. Table 8 summarises the performance comparison of our proposed model with SOTA in terms of accuracy. From Table 8, it was observed that the proposed model performs better compared to SOTA.
Conclusion
We presented a new approach for tomato quality grading based on the external image features. The proposed approach utilized the pre-trained networks for feature extraction and traditional machine learning algorithms as the classifiers (hybrid model). We took advantage of fine-tuning techniques in the pre-trained networks to make the networks suitable for the deep layers to learn and concentrate on the complex and significant features of the tomato images. Also, we analyzed and performed various image preprocessing techniques for feature enhancement and performance improvement in our proposed method. Features obtained from these networks are then classified by SVM, RF, and KNN classifiers. We vividly demonstrated the performance of various fine-tuned networks and the highest accuracy attained by Inceptionv3. Later, Inceptionv3 was considered further for feature extraction and utilized in our classifiers.
Among the proposed hybrid models, the CNN-SVM method outperformed others with an accuracy of 97.50% in the binary classification of tomatoes into healthy and reject and an accuracy of 96.67% in the multiclass classification of tomatoes into ripe, unripe, and reject when Inceptionv3 was used as a feature extractor. The hybrid CNN-SVM method outperformed other models by achieving an accuracy of 97.54% with Inceptionv3 as a feature extractor once deployed in a public dataset to categorize tomatoes into ripe, unripe, old, and damaged. However, as compared to the public dataset, the classification accuracy of the proposed model in our dataset could potentially be improved if the difference in color characteristics between classes in the dataset were slightly higher. The investigation revealed that the proposed model outperforms the SOTA. Furthermore, the proposed model can operate in different backgrounds of varying light sources.
This study is subject to certain limitations, notably the potential for overfitting when training a model with a dataset containing images, particularly those with low intensity and minimal inter-class color features. In the future study, we plan to extend the application of our model to various crops and items while concurrently employing an in-depth analysis of image feature characteristics as a complementary approach, aimed at deriving sound and comparable conclusions, thereby enhancing the robustness of our findings beyond the post-training assessment method.
Furthermore, we will investigate optimization techniques for model size reduction without hurting accuracy to be suitable for hardware implementation and deploy it for real-time inference. For model deployment, the system may require multiple cameras or tomato rotating mechanical support to facilitate capturing images at different angles.
Conflict of Interest
The authors have no conflict of interest.
ACKNOWLEDGMENT
The authors would like to thank the Egypt-Japan University of Science and Technology (E-JUST) and the TICAD7 scholarship program for graciously providing them with the essential research facilities that were instrumental in the successful execution of their study.