Semantic Search for Scientific Articles by Language Using Cosine Similarity Algorithm and Weighted Tree Similarity

The activity of writing scientific articles by academics at universities is one of the activities that is often carried out, but when writing scientific articles problems arise regarding the difficulty of finding ideas, literature studies, and reference sources that you want to use as references when writing. Sometimes when searching on a search engine, we have trouble finding the right document, because usually, the keywords we are looking for are not in the title section but another part of the structure. Since most search engines only match titles, other structures are usually excluded from matching. So that the search results that we do sometimes don't match what we want. In addition, usually, each scientific article has many language differences in its structure as found in the abstract section. To detect similarities through the structure of scientific articles, an algorithm is used, namely weighted tree similarity, and to detect language using the N-gram algorithm, then the cosine similarity algorithm can be used to check the level of similarity in keyword text with text in scientific articles.


Introduction
Writing scientific articles is an activity that is often carried out by academics in university, but in writing scientific articles we often have difficulty finding ideas or reference sources in writing scientific articles. Current technological developments make it easier to find referral sources, but often the search results we get are not what we expect, because sometimes the keywords we are looking for are actually in the structure of scientific article content, but sometimes search engines are only able to read the title so that the results the resulting search sometimes does not match or even excludes scientific articles because there are no keywords in the title.
In addition to the titles of scientific articles, there are other general structures such as abstracts, keywords, authors, and years that can be used to matching based on the keywords that are searched. For that, we need a process of retrieving data from the internet or what is called Web Scraping. Then for each structure in the scientific article, a weighting will be given based on the level of a structure against the search results.
Searches on search engines mostly use information retrieval, while searches using information retrieval must go through several processes, one of which is the stemming process. Stemming is the process of changing sentences into basic words by removing affixes. Many scientific articles are found in Indonesian but the abstracts are in a different language, namely English. Therefore, language detection is needed to determine the stemming algorithm to be used, because the stemming algorithm in Indonesian and English is different.
Based on the problem of finding reference sources in writing a scientific article, a similarity detection system is needed based on the structure of the scientific article. One of the algorithms used to present the structure of the article is the weighted tree similarity. In a semantic search that uses the weighted tree similarity algorithm, metadata is arranged based on a tree that has labeled nodes, labeled branches, and is weighted (Sarno & Rahutomo, 2015).
Research on the application of the weighted tree similarity algorithm for semantic search resulted in several important conclusions, including the search accuracy using the weighted tree similarity algorithm which was higher than full-text search and ordinary metadata search (Sarno & Rahutomo, 2015). In this study, besides using the weighted tree similarity algorithm, also combines other algorithms, namely cosine similarity to measure the level of similarity and the results are considered effective.
From the research (Wahyuni et al., 2017) using cosine similarity and TF-IDF weighting obtained an accuracy rate of 98%. Furthermore, (Sugiyamto et al., 2014) tried to compare Jaccard and cosine similarity on the document similarity test, with the results showing that the similarity test using the cosine similarity had a higher level of accuracy, namely 0.949808 compared to Jaccard of 0.949077. For language detection in (Zaman et al., 2015) research on language detection systems using N-Gram, the performance is good enough to detect languages with an average F-measure of 0.93.
After the author studies the literature, the algorithms that will be used by the author are N-Gram, Cosine Similarity, and Weighted Tree Similarity. N-Gram is used to represent the language of the sentence before the stemming process is carried out, Cosine Similarity is used to check the similarity of the text, and Weighted Tree Similarity is used to represent the structure of the article and to detect similarities based on structure. Figure 1 is the steps that the author took to research the creation of a scientific article search system.

Document Scrapping
Web Scraping or Document Scrapping is the process of taking an HTML document from the internet to retrieve certain data from the page for other purposes.
In this study, the authors used scientific article document sources from the internet on the website neliti.com, the website provides various kinds of scientific article documents. Figure 2 is the flow of the Document Scrapping process, starting from the input of keywords and then the system will process it to obtain data by the desired structure, namely title, abstract, keywords, author, and years later the data is saved to the database.

Language Dectection
In this language detection process, the data in the database is taken and then the language detection process is carried out, for the detection of the author's language using the N-gram algorithm. Figure 3 shows the process of detecting language, in this process, begins by entering a text or sentence then the sentence will be cut using the N-gram concept.
The use of N-gram for language detection is based on the assumption that the N-gram distribution pattern of a language is unique because it is related to the frequency of use of letters, or letter pairs, either vowels or consonants from a language which are generally different from other languages (Zaman et al., 2015).
N-grams are distinguished by the number of character pieces of n. To assist in retrieving word pieces in the form of these letter characters, padding is done with blanks at the beginning and end of a word. For example, the word "TEXT" can be broken down into the following n-grams ("_" represents blank) : uni-grams : T, E, X, T bi-grams : _T, TE, EX, XT, T_ tri-grams : _TE, TEX, EXT, XT_ quad-grams : _TEX, TEXT, EXT_ quint-grams : _TEXT, TEXT_ After the text is converted to N-gram, then it is matched with the sample text data in each language. The matching results are calculated using the TF-IDF formula.

Preprocessing
The next stage after the language has been detected is preprocessing. The preprocessing process is carried out so that the data used is clean from noise, has smaller dimensions, and is more structured so that it can be processed further. The preprocessing stage has several processes, namely case folding, stop words removal, tokenizing, and stemming (Hermawan & Bellaniar Ismiati, 2020).
The following are the stages in prepocessing: 1.
Case Folding. Case folding is the first process in a series of document preprocessing. In this process all the letters in the document are converted to lowercase. Only letters a through z are accepted (Riyani, 2019).

2.
Stopword Removal. The stopword removal stage is the stage of removing unnecessary words from the text. Stopwords are nondescriptive words that can be discarded in a bag-of-words approach (Naf'an et al., 2019). At this stage the author uses the stopword removal function in the stemmer library literature.

3.
Stemming. At this stage, the process of returning various word formations is carried out into the same representation. Stemming is where the process of word mapping on a sentence is rewarding to be the original word (without the prefix, suffix, insertion, combination) that is executed specific algorithm (Rofiqi et al., 2019). The stemming stage in this study for Indonesian uses the stemmer literary algorithm, while for English it uses the snowball stemmer algorithm. The results of this stemming process will be followed by the next stage to do word weighting using the tf-idf algorithm (Melita et al., 2018).

4.
Tokenizing. Tokenization is the process of dividing text in the form of sentences or paragraphs in a document into certain tokens (Naf'an et al., 2019). At this stage the text will be compiled based on the terms of the stemming results.

Term Weighting
At this stage the search query and scientific article dataset are weighted words or terms to calculate the frequency of appearance of each search query word in each scientific article in the dataset. TFIDF method is a method for calculate the weight of each word the most commonly used in information retrieval. This method is also known to be efficient, easy and have accurate results (Maarif, 2015).
The weighting formula for term weighting for this study uses the TF-IDF formula.

Cosine Similarity
Cosine Similarity measures the similarity between two documents or text. In Cosine Similarity the document or text is considered as a vector. For text matching, the values of vectors A and B is the term-frequency vector of the document (Samuel et al., 2018). Cosine Similarity is used to calculate the similarity of scientific articles, the formula for cosine similarity is as follows: Vector A, which will be compared as similar Vector B, which will be compared as similar Dot product between vector A and vector B Vector length A Vector length B

Weighted Tree Similarity
Documents to be calculated for the similarity are represented in a tree that has the characteristics of labeled nodes, labeled branches, and weighted branches (Sarno & Rahutomo, 2015). The article can be split to create a structure. Splits are determined to follow the path that might control of a process (Anugrah et al., 2016). An example of a tree representation in a scientific article in this study can be seen in Figure 3. In Figure 4, the tree structure in scientific articles is divided into 5, namely title, abstract, keywords, authors, and year. Each structure is given a preference weight according to the level of importance of the structure. The weights used in this study are title 0.25 Abstract 0.35 Keywords 0.2 Authors 0.15 and Year 0.05. (2)

Result and Discussion
This chapter discusses the implementation of a search system for scientific articles by language using the cosine similarity algorithm and weighted tree similarity.

Implementation Results
A scientific article search system has been successfully created by following the flow in Figure 1. Starting from the scrapping process with the source from the neliti.com website, the data taken include the title, abstract, keywords, author, and year. Then the language checking process was carried out using the library language detection from Patrick Schur. The results of the detection of the language are used as a reference to the next stage, namely the stemming stage, at the steamming stage if the results of language detection are detected in Indonesian, the algorithm used for the stemming process is the literary algorithm, but if the results are detected in English, the snowball library used is the snowball library porter's algorithm.
After the preprocessing is complete, the next stage is the term weighting stage, after getting the results of term weighting these results will be calculated using the Simillarty Cosain formula in equation 2, after obtaining the Simillarty Cosain Value for each structure, the next calculation will be carried out using the weighted tree similarty algorithm. Table 1 is the results of experiments using the keyword "sistem" with a total of 140 scientific articles. From the results of the weighted tree similarity, the similarity level in the Extended Weighted Tree Similarity algorithm is determined in the value range 0 to 1 (Suharso et al., 2017). The value of 1 indicates that the scientific article has a high similarity to the keyword, otherwise if the result is 0, the smaller the level of similarity to the keywords.   Recall describes the success of the model in recovering information. The recall formula is in equation 5.
In table 2, the test results when only using the cosine similarity method produce an average value of accuracy, precision, and recall that is almost the same in each language. In table 3, the test results when only using the weighted tree similarity namely the scrapping page, the preprocessing page, and the search page. Figure 5 is a scrapping page on this page, users can scrapping data to a research website, and the results of the scrapping will be used as a structured dataset to be saved to the database. Figure 6 is a preprocessing page where the dataset was processed, namely by detecting the language and carrying out the preprocessing stages. Figure 7 is a search page on this page, keywords and scientific articles will be matched using the Simillarty Cosain algorithm and Similarity weighted tree then the calculation values are sorted from the largest, and the results are displayed to the user as shown in Figure 8.

System Testing
To determine the level of accuracy of the research results, in this study the authors used a    In table 5 the results of the test Result using language detection, cosine similarity method, and weighted tree similarity produce an average value of accuracy, precision, and recall which is almost the same in every language. Table 6 shows the comparison of the average values of accuracy, precision, and recall of all tested methods. The search results on scientific articles have the best accuracy, precision, and recall values when combining language detection, cosine similarity method, and weighted tree similarity. The accuracy value when all methods are combined is 0,973 while the precision is 0,904 and the recall value is 0,924.

Conclusion
Based on research, implementation, and testing, the following conclusions can be drawn: (1) The language detection process has a significant effect on the search results, it is proven when using only the cosine similarity and weighted tree similarity methods without the language detection process, the search results with the keyword education do not show results. In fact, there are 15 scientific articles that contain the keyword education. (2) The average test shows a number above 0.90 which means that the search for scientific articles based on language using the Cosain Simillarty algorithm and the Similarity weigted tree gives quite good similarity results and the search system also finds back information with fairly good results as evidenced by the recall value. an average of 0.924.

Suggestion
For further research, it might be possible to try with documents that have more sub-tree levels so that we can find out how stable this algorithm is and can also be combined with new methods or detection of languages other than Indonesian and English.