Tomatoes and Chilies Type Classifications by using Machine Learning Methods

Vegetables are ingredients for flavoring, such as tomatoes and chilies. Both of these ingredients are processed to accompany the people's staple food in the form of sauce and seasoning. In supermarkets, these vegetables can be found easily, but many people do not understand how to choose the type and quality of chilies and tomatoes. This study discusses the classification of types of cayenne, curly, green, red chilies, and tomatoes with good and bad conditions using machine learning and contrast enhancement techniques. The machine learning methods used are Support Vector Machine (SVM), K - Nearest Neighbor (K - NN), Linear Discriminant Analysis (LDA), and Random Forest (RF). The re-sults of testing the best method are measured based on the value of accuracy. In addition to the accuracy of this study, it also measures the speed of computation so that the methods used are efficient. From the experimental results, we obtained that Random Forest has the highest accuracy than that of other methods, that is 85.21%. Nevertheless, Random Forest has the longest computational time, that is 1092.93 s. The best computational time duration is obtained by KNN, which is 17.7488 s.


Introduction
Fruit is one of the complementary foods that are consumed daily. Fruit has many benefits for the body because it contains vitamins and fiber. There are several types of fruit that are easily recognized. This is due to prominent physical differences, such as size and color.
In fact, some people still have difficulty determining the maturity and freshness of the fruit. In addition, sellers often commit fraud to buyers by exchanging ripe fruit for less good fruit. Thus, we need specific technique to classify the maturity and freshness of fruit automatically to prevent fraud by the seller.
Fruit classification is a technique to differentiate fruit types based on physical characteristics from the skin to the shape of the fruit. Previous research (Hossain, 2019) made a fruit harvesting robot to make it easier for farmers to harvest tomatoes. This study aims to ensure the tomatoes harvested have a uniform quality in the processing of sauce. However, the accuracy produced is quite low and can only distinguish tomatoes.
The other previous research (Pavithra, 2015) has focused on differentiating tomato species by their level of freshness. The method proposed in the study is the K-Nearest Neighbor (K -NN) Algorithm and Support Vector Machine (SVM). However, this research has a weakness in terms of accuracy that is not up to 90% and the dataset used only distinguishes rotten variations or not.
Machine learning is not only used to classify tomatoes (Sabilla, 2019). The previous study tried to compare several machine learning algorithms with a dataset in the form of seven types of bananas and three categorical levels of maturity. Accuracy results obtained on average get a value of 96%. Preprocessing techniques are also used to improve time efficiency and accuracy. However, the dataset used does not rep-resent the whole banana. Because the dataset used is single. This paper focuses on classifying tomatoes and chilies from their level of maturity and species using machine learning. The machine learning algorithms used are K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Random Forest (RF), and Decision Tree (DT). This research also proposes to reduce the size of the image so that the duration of computing time can run faster.

Materials and Methods
Machine learning is an artificial intelligence that is able to learn the character of data (Sunaryono, 2019). In this study, machine learning is used to study data with images. Each image is converted in pixels to produce a color intensity value for each pixel. The color values will be entered into machine learning to be studied and labeled. Thus, each class has different intensity characteristics. This paper uses KNN, SVM, LDA, RF and DT algorithms for machine learning. Each algorithm has a different way to separate each data, but all of them have the same goal of comparing values to be grouped into a class.

K-Nearest Neighbour (K-NN)
K-Nearest Neighbour (K-NN) (Connell, 1996) is an algorithm by comparing each neighbor's value and finding the closest value of the whole. This algorithm is quite lazy because it has to compare all available data, so it takes a long computational duration. However, the advantage of K-NN is the value of data precision because all data plays a role in parameters. The K-NN algorithm can be seen in Figure 1. where is class labels, is training data, and is unknown samples Support Vector Machine (SVM) Support Vector Machine (SVM) is an algorithm by determining classes based on two straight or curved lines. But image classification would be better using linear lines. SVM determines lines by giving a random line and calculating the distance between the two different classes. If it is still far away, this line will shift. The value of the shift is symbolized by and the maximum distance of each class is shifted using . In this study using the shift 0.0001, 0.001 and 0.01 seen the best conditions using the parameters grid (Sabilla, 2019). The computation time duration in SVM depends on the initial random line value.
If the distance to the class is right then the SVM line does not need to look for the line position again. However, it can also have a long-lasting effect if the initial value of a random line position is far from the class value. Figure 2 shows the SVM algorithm.

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a machine learning technique separating data classes by analyzing each data characteristic (Sunaryono, 2019). When the data has labels, the LDA will form a linear line to separate each class. This line was created from the results of data analysis conducted previously. LDA is often closely related to the PCA algorithm because it separates classes based on the results of data analysis. Equation 1 is the formula for LDA.
(1) where is input feature, is class, and is mean.

Random Forest (RF)
Random Forest (RF) is a machine learning technique by spreading data like trees (Chen, 2020). After being distributed or branched out until the maximum data is used up, it will be voted according to the class of data characteristics. To select data or vote by decision tree. Random forest has the advantage of comparing incomplete data because of its mechanism for extracting features. But the drawback of Random Forest is that it has a lot of computational parameters and all data is used as parameters even though the data has the potential to be noise, so it takes a long time to compute.

Decision Tree (DT)
Decision Tree (DT) is machine learning using a chain tree mechanism (Grana, 2012). This machine learning is an algorithm that is easy to understand. How it works DT by taking the middle or random values as the parent of the tree head. In this study the diagram tree works multi by dividing two sides. If the class data is in the final position, the computation will be unified. Each data will compare with the chain above until back to the parent. The advantage of DT is that of KNN, which is comparing in detail on each data. The computing time of DT is faster than that of KNN. This is because the computation process is multi or depends on the number of branches. However, DT has the disadvantage that the initial value of the tree is far from the intended class, giving rise to an unstable accuracy value. Figure 3 shows the DT algorithm.

Results and Discussions
Software Scenario Image processing is done using python 3.7 software with the help of spyder tools. In the library, this research utilizes two open source li-braries, namely OpenCV as an image process, Scikit-learn to model machine learning (Varoquaux, 2015).

Data Collection
Data is collected with the help of a 5 MP resolution camera with an average height of 30 cm. Each class of types of chilies and tomatoes has 17 pictures with different variations. Data collected was 16 classes consisting of red chili, cayenne pepper, and curly chili. For tomato data consists of fruit tomatoes, cherry tomatoes, and green tomatoes. Each class has two types of conditions, namely in a mature condition and a bad condition. For example data can be seen in Figure 4. Each class has 17 photographic images so that the total picture in this study is 272 images.

Testing Scenario
The trial scenario is carried out with several mechanisms, namely preprocessing and classification. Preprocessing aims to make it easier for classifiers to get high accuracy values and speed up the duration of computing time. Figure 5 is a research workflow.

Preprocessing
Before entering the image classification method, it will be preprocessed by increasing contrast and equalizing colors using Contrast Limited Adaptive Histogram Equalization (CLAHE). CLAHE aims to have images that have the same color and pattern values to increase the classification value. The results of the CLAHE method experiment can be seen in Figure 6. It is evident that preprocessing with CLAHE can improve image quality in terms of color spread. After CLAHE, the image will be resized to reduce the size of the image. This step aims at efficiency of computational time duration. The computational time duration after resizing the images reach 4.34 seconds, meanwhile the computational time duration before resizing

Classification
The classification process is the process of determining the type and class of images. The data will be divided as much as 70% for training and 30% for testing. The data that has been divided will be entered into the matrix and flattened. The purpose of flatten in data is to speed up computing and balancing the behavior of each image. The classifications used include KNN with K = 3, SVM with C = 0.0001, Y = 0.0001, Random forest, Multi-Layer Perceptron, and LDA. The results of the classification experiment can be seen in Table 1.

Conclusion
It can be concluded that the accuracy of Random Forest has the highest value with an accuracy of 85.21%. However, the computational time duration in Random Forest is the longest compared to other classifiers. This is due to the fact that the Random Forest mechanism has branches up to the start-to-end value including noise data that is not used to contribute to the parameters. Random forest also compares each branch which results in a computational time duration. The best computational time duration is obtained by KNN, which is 17.7488 s.
For further research, data classes and preprocessing stages are added. Random forest can be an effective algorithm if the data classified is data that is free from noise. In general, the Principal Component Analysis (PCA) method is very suitable for selecting data. So that the branches in the random forest classification are not too many to reduce computational time and increase the value of accuracy.