Implementation of Winnowing Algorithm and Simple Additive Weighting SAW for Publication Reference Journal Search System

The Information retrieval systems are very influential in everyday life, especially in searching for reference journals. Reference journals have several different writing arrangements. In the reference journal, there is some information or words needed in making scientific journals. With this information or word retrieval system, it can help to find journals that match the similarities between input and information or words in the journal that will be used as a reference. In this study, the input process of equations and information or words will be processed using the Winnowing algorithm as an algorithm that can find the similarity of words or texts with N-gram functions, rolling hash, and Jaccard Coefficient. In general, the search only uses the same words or text without any weighting on the composition of the reference journal. To be able to find journals and their level of importance in the order of journals, a weighting method is needed. This study also uses the Simple Additive Weighting (SAW) method as a process to determine the value of the order of urgency in journals so that it can provide results in the form of rankings based on searches and urgency weights in reference journals. The results of the similarity query with documents obtained 60% precision, 77% recall, and 81% accuracy, documents, and documents had 41% precision, 83% recall, and 66% accuracy. using Winnowing Algorithm, the search system can detect the similarity of text. Determining the correct N-Gram, window, and base can increase the similarity value in the document. The document scraping process plays an important role in finding the text in each document structure. stemming can affect system performance because the steaming process helps in getting basic words (terms) in the word bag.


Introduction
The reuse of information nowadays has an important role, one of which is in finding reference sources used in writing research results (scientific publications). In searching for reference sources, a mechanism is needed to recognize the similarity in the text structure of a query, both in the form of keywords and documents.
Journal publications have a taxonomy or writing structure that must exist, including Title, Abstract, Introduction, Literature Study, Methods, Discussion, Results and Conclusions, and Bibliography. A document consists of several sentences composed of several words so that we need a mechanism to detect the similarity of the query with the contents of the published document.
Several obstacles in finding references to publication documents. First, how to recognize the structure or arrangement of words so that the similarity of the sentences can be measured. The second problem with sentence similarity detec-tion applications is that it usually only detects similarity in word structure in documents without paying attention to the weight of sentence similarities based on the level of `` importance '' in the taxonomy or structure of the publication document, for example, titles, abstracts, keywords, introduction, and conclusions. The taxonomic structure of text similarity of wording needs to be known in textual structure analysis (Anugrah & Sarno, 2017). So that it can make it easier to get information retrieval.
This study focuses on the application of the method of measuring text similarity using the Winnowing Algorithm and considering the weight of importance in the taxonomy of publication documents using the Simple Additive Weighting (SAW) method is able to produce the expected publication journal reference search system.
The winnowing algorithm is a method to find the similarity of text structures or words in the text structure by forming grams in each window of the hash value. In the next step, the similarity will be calculated using the Jaccard Coefficient Similarity so that it provides a percentage value for the similarity of text or documents.
The Simple Additive Weighting method is an attribute decision-making method used to obtain the best alternative by determining criteria and creating a normalization matrix and calculating the total alternative value that will be used as a ranking value.
The result of applying the proposed method is the ranking of scientific publication documents based on the similarity between queries or documents with scientific journal documents in the publication repository.
Previous research (Wibowo & Hastuti, 2016) concluded that the Winnowing Algorithm is one of the algorithms capable of detecting the similarity of text structures in a fairly fast time. The weakness of this algorithm is that it cannot provide guarantees and evidence against similar texts.
Several studies were previously conducted (Sunardi et al., 2018) The results of the above research can be concluded that n-gram greatly affects the results of similarity.
The use of the correct n-gram greatly affects the percentage of similarity. To solve the problem of important value in the structure or taxonomy of publications, it is possible to use the Simple Additive Weighting (SAW) method such as (Mukroji et al., 2019).
SAW is one of the attribute decision-making methods, this method is usually used to get the best alternative by determining the criteria and calculating the total value of the alternative that will be used as a decision-making method. Ranking values (Mukroji et al., 2019).
The Researcher Gives the conclusion that the SAW algorithm is able to provide better results than the alternative WP ranking that appears low. SAW than WP, because the algorithm is proven to be able to minimize the same preference value, so that alternative ranking can be done well (Berlilana et al., 2018).
From several previous studies, it can be assumed or hypothesized that the Winnowing Algorithm and Simple Additive Weighting can be used for document searches by measuring the similarity of text structures according to the taxonomic importance weight of the publication journals which will later be ranked according to the entered keywords.

Materials and Methods
This Search for publication journals is a form of information retrieval system using a winnowing algorithm that can detect the similarity of contents in reference text in documents compared to similar proportions of several words or sentences. The Simple Additive Weighting method is ratings on alternative weights. The results of weighting can produce the best order (Berlilana et al., 2018).
This study will use a combined method of measuring the similarity of text structures using the Winnowing Algorithm and the calculation of the ranking of publication journal reference documents using the Simple Additive Weighting (SAW) method. At the initial stage of uploading documents, text extraction will be carried out on the document based on taxonomy or journal publication structure. The journal publication structure used is the title, abstract, keywords, introduction and results. Furthermore, pre-text processing will be carried out which aims to obtain the basic words that compose the document.
Then the measurement of the similarity of the text structure is carried out and the final step is to calculate the ranking of the publication journal reference document. An overview of the flow of the research method is shown in Figure  1. In this study using the N-Gram setting with N = 8, Prime number (Basis) B = 2, Window W = 7. A high N-Gram value will affect the completion of letters in the text so that the text the less detailed but the results will be close to the same similarity, on the other hand, if the N-Gram value is small, the letters taken will be less and more detailed so that it affects the percentage value of the search. . For the window value, if the window value is high, fewer results will be found and if the window is small, more results will be found so that it can affect the similarity level in the percentage of document searches. Figure 1 shows the process flow The system starts with uploading documents, then continues with the document scraping process. In the scraping process, the text is extracted from the document in pdf format.
The next step is text pre-processing in-text pre-processing, filtering the text using casefolding to normalize the text, then stopping word deletion to remove irrelevant words and stemming literature to convert sentences into basic words then continue the process of searching. the similarity of text to Winnowing Algorithm. contained in the formation of N-Gram, N-Gram is the stage of breaking each character from the N value in the text, then making the ASCII value for each character from N-Gram after that the rolling hash calculation process is carried out to get the value of each gram and continue the process Window formation which is determined from the value of W as the number of windows per row formed by rolling hash results, followed by the fingerprint process, in this process the smallest rolling hash value of each row of windows will be found so that the fingerprint value is found in every document. In the search section, a process similar to uploading a document is carried out but without scraping. Search does not require scraping as text can be retrieved immediately.
The next step is to calculate the value generated from the document fingerprint and the fingerprint from the query to get the total text similarity value based on the taxonomy of the document being searched using the Jaccard Coefficient. Because there is a total similarity value that is taken based on the taxonomy of the similarity results, the ranking of values will be carried out using the Simple Additive Weighting (SAW) method to find the largest document value based on weight or taxonomic structure. You can also search for similarities in the selected documents based on the query results found. So that the results of document similarities will appear.

Winnowing Algorithm
Winnowing algorithm is a method of finding text similarities. This method is an extension of the Rabin-Karp Fingerprint method with window characteristics on the process and good results. The Winnowing algorithm looks for word equivalence (document fingerprinting) to find similar text. The winnowing algorithm is quite reliable because it can look for text similarities in words (Sibarani et al., 2019). Selection of the smallest fingerprint value results in each window that will be calculated by the equation with Jaccard Coefficient.
Following are the steps for the Winnowing Algorithm:

Word Combining
At this point, all the words will be combined without spaces.

N-gram process
N-gram is a series on winnowing by splitting the length of the character = n. the initial sequence starts from the 1st to the nth character, then the next character starts from the 2nd to the nth + 1 and so on, after that the chained characters will be hashed by rolling the hash process.

Rolling Hash Process
Make ASCII per character based on a series of grams and calculate the value of ASCII characters, the formula for the rolling hash is as Equation (1).

Process Window
The resulting window formation from each gram becomes a hash value with was the window size, the 1st window contains the hash value from the first gram continued to the nth window, and for the second window 2 to the n + 1 window and so on.

Fingerprint Process
After the window is complete, the next step is to take each smallest window value from each row into a fingerprint, if there are two or more smallest values of the same fingerprint, only one will be taken that is on the right, from the results of the window will look for the smallest value for each window.

The Jaccard Coefficient Equation
The results of the winnowing algorithm will form a fingerprint value, this fingerprint value will be used as the level of text similarity in the Jaccard coefficient as Equation (2): information : fingerprint value on text, dj fingerprint values in the text w (di) ⋂ w (dj) the number of fingerprint values is the same between the with text and jth text and w (di) ⋃ w (dj) is the total value fingerprint ith text and jth text.

Simple Additive Weighting
Simple Additive Weighting (SAW) is a decision-making method used to determine the best alternative. The weighting criteria that exist in Simple Additive Weighting (SAW) will affect the value at the time of the ranking process, making matrices, normalizing the matrix, and calculating the total value on the criteria whose value has been determined. The Simple Additive Weighting (SAW) method is combined with the winnowing algorithm because this study uses several values from the taxonomy of titles, abstracts, keywords, introduction and results in documents so that it requires an alternative to being able to get documentary values which will be ranked according to the highest value (Alamsyah, 2017) as Equation (3).

Rij
( 3) Rij: Normalized performance rating value xij: The attribute value that each criterion has : The largest value of each criterion : The smallest value of each criterion Benefit: If greatest value is best. Cost: If smallest value is best.

Scraping Documents
Document scraping is a way to retrieve information by retrieving the contents of the text in the file, by providing initial and final limits that have been determined according to the structure of scientific publication journals, namely "Title", "Abstract", "Keywords", "Introduction" and " Results ". Here's the scraping source code, shown in Figure 2:

Preprocess Text
This process performs text filtering from scraping results. Text will be normalized to be uniform, for example, lowercase letters with lowercase, removing punctuation or symbols, meaningless words will be deleted then the steaming process is carried out using stemmer literature so that a good root word is obtained. The following is the pre-text stage:

Case Folding
At this stage, the text normalization process is carried out, converting all text to lowercase, deleting punctuation marks, deleting symbols, and deleting numbers such as symbols @ # $! * & ^% <? / :, 123456789 ". Case folding is done to produce more consistent text.

Stopword Removal
After the case folding process is carried out at this stage, words that have no meaning, do not provide important information or words that cannot stand alone, for example, 'yang', 'di', 'ke', 'because', etc. (Anwar et al., 2019) Stemming The sentence produced after the case folding and stopword removal process is a sentence that still contains affix words. Words that have an affix cause a root word to have a double meaning so the stemming process must be carried out. The steaming process is a mechanism for generating basic words by eliminating prefixes, insertions, and suffixes. The stemming process must consider the grammatical structure, for example, words from Indonesian or English. In this study, the document used is a published journal document in Indonesian so that the steaming method uses a stemmer for Indonesian. Literary stemmer is a stemmer used for Indonesian words. The literary library has a dictionary obtained from kateglo.com. kateglo.com is an Indonesian dictionary website (Anwar et al., 2019). Figure 3 is an example from a literary stemmer: Copyright © 2021, JDR, E-ISSN 2579-9347 P-ISSN 2579-9290 $source_pdf="jurnal/$name_file"; $output_folder="MyFolder/jurnal"; if (!file_exists($output_folder)) { mkdir($output_folder, 07 77, true); } $a= passthru("pdftohtml $source_pdf $out-put_folder/$name_file",$b); $myfile =fopen("MyFolder/jurnal/ $name_file$s.html" ,"r") or error($name_file); $text= strtolower(fread($myfile,filesize ("MyFolder/jurnal/$name_file$s.html")));

Winnowing Algorithm
After preprocessing the text of the sentence as shown in Figure 3 and the results of the sentence are shown in Figure 4. The sentence that has been pre-processed will continue the process using the Winnowing Algorithm. The application of the Winnowing Algorithm in this example will be carried out using the N-gram value N = 3, Basis (prime number) = 2, and Window W = 7 to look for the similarity value of the text structure. The thing that must be done before looking for similarities in text structures using the Winnowing Algorithm is to remove sentence spacing so that it will produce the text in Figure 5.

Solving N-Gram sentences
From the results of combining the text, the text will be split with N-Gram N = 3. Shown in Figure 6.

Fingerprint
After segmenting using Window, it will be continued by finding the smallest value from each window in the Fingerprint process. Figure 9 and Figure 10 are the processes of finding the smallest value from the Fingerprint Query with the Fingerprint of each Document. The query in this example after pre-processing the text is "The final match of the European league between team a against team b directly results from the team alike champions" and the document in this example is "team a won the European league after defeating team b in the final match last night".

Rolling Hash Process
After the sentence is broken down in the N-Gram process, then the characters on the N-Gram are converted into ASCII per character based on the Gram series and then the ASCII character values will be calculated using the formula from Rolling Hash with Equation (1). The example in Figure 7 is the Rolling Hash calculation from Gram that has been formed previously from Figure 6. The example in Figure 7 grams taken is "tan".

Process Window
Process Window is a segment formation mechanism that contains a collection of Rolling Hash processes. In this example, a Window value of 7 is used, so that the segment shown in Figure 8 will be 7 rows.

Similarity Jaccard Coefficient
The last stage of the Winnowing Algorithm is to calculate the similarity between the Fingerprint Query and the Fingerprint of each Document using the Jaccard Coefficient.
At this stage, the similarity value between the Query and the Document from Figure 11 will be calculated. Then Union Fingerprint Query with Documents: 708 720 707 704 701 689 = 6 is the same number between the Fingerprint Query and the Document Fingerprint. The total of Fingerprints that do not include Union is: 12 and added from the results that are included in Union are: 6. The total is 18. So that the result of the Measurement of the Similarity between Fingerprint Query and Documents: (6/18) x 100 = 33.33% Equation (2).

Simple Additive Weighting (SAW)
From the measurement results of the similarity of the text structure using the Winnowing Algorithm which is carried out on each taxonomy journal publication. In this study, the taxonomies used were title, abstract, introductory keywords, and results. From each of these taxonomies, the value will be calculated according to the preference weight in Table 1 using Simple Additive Weighting (SAW) for ranking documents. The following formula is designated in Equation (3).

Formation of Matrix X
The Simple Additive Weighting (SAW) method begins with the creation of an M x N matrix of the suitability of each alternative on the criteria (Supriyatna & Ekaputra, 2017). where M is the number of scientific journal publication document data and N is the taxonomic number of scientific journal publication documents. Table 2 is an example of the Matrix generated from the results of the Jaccard Coefficient equation.

SAW Ratings
After the R Matrix (Normalized Matrix) is produced, the publication process will be carried out by using Equation (3) so that a document ranking is produced from the largest to the smallest value. The greater the value of the calculation result, the higher the ranking compared to the smaller value. The following Table 3 is the result of the Simple Additive Weighting (SAW) ranking from the results of the R Matrix calculation and taxonomic weights in Table 1. To find out the results of document ranking using SAW, it is necessary to carry out manual testing by experts. Manual testing is done by matching the text of each sentence in the document from the rating results using SAW by matching the text of each sentence by the expert manually. Figure 14 is the test result.

System Implementation
From the Implementation of the Winnowing Algorithm and Simple Additive Weighting (SAW) Application System in the Search for Publication Journal Reference and the following system interface:

Upload documents
The document upload page is used for users who want to save results from previous journal searches from the internet and carry out the scraping, pre-processing, and winnowing algorithm processes. Shown in Figure 12.

Display search and search results
The search page is on the same page as the upload in Figure 13 which is the query input for the search. The search results contain documents from the document upload repository by performing a winnowing algorithm process as a detection of text similarities and ranking using the simple additive weighting method so that results are obtained on document contents that are most similar to the input in the query. Shown in Figure 14.

Document view by document
On this page are the results of the search for document similarity to the document in Figure  14. Figure 15 is the results from document to document.

Testing
To find out the results of the recommendation system for finding a publication journal reference document, the proposed method is a combination of the Winnowing Algorithm and Simple Additive Weighting (SAW). The test will be carried out with 4 different queries for the document. The standard value of the 40 existing documents will be taken using the threshold formula (Francq, 2014). So that every query that is entered has a different standard value. The recommended search results will be calculated and the system output results are matched with the user (Human). Precision, Recalsl, and Accuracy values will be calculated using the Confusion Matrix (Falsini et al., 2012).   The recall value on the similarity of the document to the document has a higher value than recall on the similarity of the query to the document. Because recall in document and document compares the entire contents of each structure of each document. Meanwhile, the similarity of the query to the document only compares the text content of the query with the overall contents of each structure of each publication journal document. This is proportional to the precision performance because the more contents of the text are compared, the greater the similarity value is possible.

Conclusion
From the test results, the following conclusions were obtained: (1) In testing the similarity system using a query with documents, the results are 60% precision, 77% recall, and 81% accuracy, while the similarity of documents with documents has 41% precision, 83% recall, and 66% accuracy. (2) By using the Winnowing Algorithm the search system can detect text similarities. In research conducted using N-Gram 8, Window 7, and basis 2 and the application of the Simple Additive Weighting method helps document ranking based on the structure of published journal documents. (3) Determining the correct N-Gram, window, and basis in the Winnowing algorithm can increase the similarity value (similarity) in the document. (4) The process of scraping documents plays an important role in the retrieval of text in each journal publication document structure which is used as data in the document corpus. (5) The use of stemming algorithms can affect system performance because the steaming process helps in getting the basic words (terms) in the bag of words.

Suggestion
In this study using Indonesian language journals, it is hoped that further development can process words or texts with the English stemming method so that they can filter English words and can find better N-gram values and rolling hashes for settings on the winnowing algorithm used to increase precision in document search. evaluation and selection methodology