Normalized Google Distance for Collocation Extraction from Islamic Domain

Abstract

This study investigates the properties of Arabic collocations, and classifies them according to their structural patterns on Islamic domain. Based on linguistic information, the patterns and the variation of the collocations have been identified.  Then, a system that extracts the collocations from Islamic domain based on statistical measures has been described. In candidate ranking, the normalized Google distance has been adapted to measure the associations between the words in the candidates set. Finally, the n-best evaluation that selects n-best lists for each association measure has been used to annotate all candidates in these lists manually. The following association measures (log-likelihood ratio, t-score, mutual information, and enhanced mutual information) have been utilized in the candidate ranking step to compare these measures with the normalized Google distance in Arabic collocation extraction. In the experiment of this work, the normalized Google distance achieved the highest precision value 93% compared with other association measures. In fact, this strengthens our motivation to utilize the normalized Google distance to measure the relatedness between the constituent words of the collocations instead of using the frequency-based association measures as in the state-of-the-art methods. Keywords: normalized Google distance, collocation extraction, Islamic domai

    Similar works