Implementation of Weighted Tree Similarity and Cosine Sorensen - Dice Algorithms for Semantic Search in Document Repository Information System

Document search has several approaches, including full - text search, plain metadata search and semantic search. This study uses the Weighted Tree Similarity algorithm with the Cosine Sorensen Dice algorithm to calculate the semantic search similarity. In this study, document metadata is represented in the form of a tree that has labeled nodes, labeled branches and weighted branches. The similarity calculation on the subtree edge label uses Cosine Sorensen - Dice, while the total similarity of a document uses the weighted tree similarity. The metadata structure of the document uses the taxono-my owner, description, title, disposition content and type. The result of this research is a document search application with taxonomic weight on file storage. From the experimental results combination Weighted Tree Similarity method with Tanimoto Cosine has an average recall of 58%, 88% precision, and 83% accuracy, while the combination of Weighted Tree Similarity with Cosine Sorensen Dice has an average recall value of 66%, precision 88%. and accuracy 85%. Combination of Weighted Tree Similarity with Cosine Sorensen - Dice has better than the combination of Weigthed Tree Similarity with Tanimoto Cosine for search documents at the University of Muhammadiyah Gresik with a average recall value of 66% and an average accuracy of 85%. Similarity value on text labels using Cosine Sorensen - Dice is also influenced by the number of terms and documents in the repository.


INTRODUCTION
As the number of documents we manage increases, the need for information retrieval becomes important. As more and more documents are stored, the search process becomes increasingly difficult. The usual approach to finding information from a document is usually a query. The use of queries in data retrieval is done by matching words, so that the search results will determine the presence or absence of words in the database. In contrast to information retrieval, information retrieval is an attempt to process data with the aim of obtaining a relationship from that data. The data in this case is a collection of documents containing words. To look for relationships between words, it is usually done in the textual analysis process (Anugrah & Sarno, 2017). This data connection is the main focus of information retrieval.
Each document usually has a structure, this structure will be used in the document search process using metadata search. Several document search algorithms use metadata, one of which is the Weighted Tree Similarity (WT Similarity) method. The WT Similarity algorithm generates a tree similarity value which is carried out by visiting the lowest node (leaf) and then calculating the similarity of each branch pair (branch) through the weight of the edge that connects the leaf to the branch. The problem with WT Similarity is in the calculation of the similarity of pairs of nodes, where the effect of labels from the edges that connect between nodes. If the labels on the edges have the same meaning in the sense of being exactly the same (exact match) then it has a weight of 1 and otherwise if the labels on the edges are not exactly the same (non exact match) then the weight value is 0 (Basmalah Wicaksono et al., 2016).
Several related studies that underlie this research include research (Basmalah Wicaksono et al., 2016) In this study, It was concluded that the search by using the Weighted Tree Similarity Method gave better precision values compared to VSM even though the recall value was smaller than VSM, as evidenced by system and expert testing.
Meanwhile, (Alkaff et al., 2020) In this research use Weighted Tree Similarity and Content-Based Filtering, from the test results using five test scenarios, it was found that the system succeeded in providing good performance with a precision value of 88%.
From the research (Adi P & Palgunadi, 2014) results found that the combination of weighted tree similarity with tanimoto cosine resulted in a better search. by having a precision value of 100% and a recall of 84.44%.
Research (Putro & Thamrin, 2018) obtained the results of the similarity function assessment in sentence pairs from the three functions, dice similarity has the best similarity score to calculate sentence similarity, whereas euclidean distance has a poor similarity score for calculating sentence similarity.
From several previous studies it can be concluded that the weighted tree similarity method can be used in document search and in the study (Adi P & Palgunadi, 2014), the results of combining weighted tree similarity with tanimoto similarity have low recall results, namely 63.94%, while weighted tree similarity with cosine similarity is 80.89%. weighted tree similarity by combining tanimoto and cosine results in better searches by increasing recall to 84.44%. in research (Putro & Thamrin, 2018) the results of the comparison of cosine, dice and euclidean, dice similarity has the best similarity score to calculate the similarity of sentences. So that in this study a combination analysis of weighted tree similarity will be carried out by combining cosine similarity and sorensen-dice similarity, whether it can provide a better search than tanimoto cosine when used in the file search case at the University of Muhammadiyah Gresik

MATERIALS AND METHODS
The method used in the semantic search research used Weighted Tree Similarity (WT Similarity) to measure the similarity of tree structures and the use of the Vector Space Model (VSM) and cosine sorensen-dice to measure the semantic similarities between the edge labels being compared. Cosine sorensen-dice is a combination of cosine similarity with sorensen-dice similarity. The document data used is in accordance with the letter structure and document storage application at the University of Muhammadiyah Gresik (UMG). The storage structure for letters and documents consists of the title, owner, description, contents of the disposition and type or description of the letter.
At the initial stage, document extraction will be carried out according to the structure which is then represented in the tree structure. Subsequently, paired sub-tree similarities were measured to be compared. The similarity measurement of subtrees is done by taking every two vertices connected to one side (adjacent node). This paired contiguous node will be calculated Figure 1 research overview using a Weighted Tree Similarity. In the process of measuring the similarity of adjacent nodes in pairs, there is a process of measuring the similarity of edge labels by taking text labels to be carried out at the pre-processing stage of text, then look for the similarity of the text of the labels compared using cosine sorensen-dice. The text of the labels being compared. The next step, after knowing the similarities of the subtrees, is continued by looking for similarities to the trees being compared. Figure 1 shows an overview of the research methods used.
A tree consists of several subtrees, while each subtree has at least two nodes and one connected edge (Adi P & Palgunadi, 2014). The branch or branch itself is a subtree consisting of at least two connected nodes (adjacent node) and one of these nodes is a branch and a leaf. is an example of tree representation in the research carried out, namely the owner, description, title, content disposition, and type. This study emphasizes on the search results for the owner's sub-tree branch which has a weight of 0.3 compared to other branches of the subtree. Total of weights at the tree branch level is 1.

RESULT AND DISCUSSION
This chapter describes the results and discusses calculating the document's similarity to the input query. The following is the process of calculating the semantic search using a combination of the Weighted Tree Similarity method to calculate the similarity of the structure and the application of the cosine sorensen-dice used to calculate the semantic similarity of the subtree labels being compared. Table 1 is an example of the document structure to be processed.

Preprocess Text
This process is performed on each edge label of the document tree and from the input query. Some of the steps taken at the text preprocess stage are as follows:

Case Folding
This process is used to normalize text, convert text to lowercase and remove punctuation marks. Example: Query

Stopword Removal
After the case folding process is carried out, this stage is done by eliminating conjunctions or words that have no meaning or meaning. Example: Query

Stemming
The results of the stopword removal process still produce words that have affixes so that the stemming process must be carried out. The stemming process removes word affixes, either prefix,

TF/IDF
After obtaining the preprocess text results, then TF / IDF calculations will be carried out. This stage is carried out to get the weight of a word and is carried out on each document subtree. For example, calculations will be performed on the title subtree. The results in table 3 are obtained from the IDF results * the number of terms contained in the query and the sample document include terms "undang" in Q is 1 and idf term "undang" is 0.176 so the result is 1 * 0.176 = 0.176 and soon.

Similarity Calculations
The next process after obtaining the word weight from the TF/IDF process, calculating the similarity of the leaf node subtrees using the cosine sorensen-dice in Equations (1), (2), and (3). (1) (2) (3) Information: CS = cosine sorensen-dice ∑ = Total Data D1 = The first sentence that will compare equations. D2 = The second sentence, which will compare the similarities.
An example of data that will be calculated for the similarity is an input query with a leaf node subtree title from D1.  After knowing the value of cosine similarity and sorensen-dice similarity, we will look for the value of cosine sorensen-dice using equation (3): After being implemented to all data the similarity will be calculated, the results are as shown in Table 4.

Weighted Tree Similarity
From the results of the calculation of the similarity of all subtrees of document tree leaf nodes, the total tree calculation will be carried out using the input query. To determine ownerbranch similarity, the owner's node similarity is multiplied by the average weight of ownerbranches in Figure 2, 0 * (0.3 * 0.3) / 2 yields owner-branch similarity. The algorithm then looks for similarities to the next branch node description, title, disposition content because this node is not a leaf, the algorithm will go down to calculate the similarity of the contents and attachment of the branch because there are no leaves, the results will be summed to determine the similarity value of the disposition content node, then the average weight of the disposition content branch to calculate the value similarity, proceed to type branch because this branch is also not a leaf, it will go down to the letter branch and type to calculate the similarity value. The following is an example of calculating the similarity of document tree D1 to Q using equation (4).

(4)
Information : A (Si) = similarity in leaf nodes wi = weighted tree arc weighted pair wi '= weighted tree arc weighted pair The result of the query equation with document d1 is 0.32. After the calculation of the weighted tree similarity algorithm is implemented on all data, the results are shown in Table 5, the higher the document similarity value, the document order is in the top position.

Implementation System
Implementation system of weighted tree similarity and cosine sorensen-dice on file search and system interfaces:

Search page
This page is used by the user to enter a query to find the desired document. Shown in figure 3.

Search Results page
This page displays the documents most similar to the keywords the user has entered on the search page. Shown in figure 4.

Testing
The search result test is done by entering 5 different keywords and the resulting document will determine the threshold value with a cosine similarity score (Francq, 2014). If the document similarity value is less than the threshold value then the document will not appear, conversely if the document similarity value exceeds the document threshold value will appear and as an evaluation a matching will be carried out by the expert (secretary staff) then the recall, precision and accuracy values are calculated using a confusion matrix (Alkaff, 2020). The results of the calculation of the confusion matrix can be seen in table 6. Table 6 shows that the combination of the weighted tree similarity method with tanimoto cosine has an average recall of 58%, 88% precision, and 83% accuracy, while the combination of weighted tree similarity with cosine sorensen-  dice has an average recall value of 66%, precision 88%. and accuracy 85%. Combination of weighted tree similarity with cosine sorensendice has better recall and accuracy values. The difference of score accuracy in several query because the search results cosine sorensen-dice found more documents related to the query so that the accuracy value of some queries is higher.

CONCLUSION
The From the results of the analysis and discussion that has been carried out, it can be concluded that the combination of the weighted tree similarity method with the cosine sorensendice results in a better search for documents at the University of Muhammadiyah Gresik with a average recall value of 66% and an average accuracy of 85% which is higher than the combination of weigthed tree similarity with tanimoto cosine. And the similarity value on text labels using cosine sorensen-dice is also influenced by the number of terms and documents in the repository. For the development of this research, it can be done by adding a synonym detection method for a word contained in the leaf node. Copyright © 2021, JDR, E ISSN 2579-9347 P ISSN 2579-9290