2,772 research outputs found

    Arabic Information Retrieval: A Relevancy Assessment Survey

    Get PDF
    The paper presents a research in Arabic Information Retrieval (IR). It surveys the impact of statistical and morphological analysis of Arabic text in improving Arabic IR relevancy. We investigated the contributions of Stemming, Indexing, Query Expansion, Text Summarization (TS), Text Translation, and Named Entity Recognition (NER) in enhancing the relevancy of Arabic IR. Our survey emphasizing on the quantitative relevancy measurements provided in the surveyed publications. The paper shows that the researchers achieved significant enhancements especially in building accurate stemmers, with accuracy reaches 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed positive relevancy effect. However, other tasks such as NER and TS still need more research to realize their impact on Arabic IR

    Development of Arabic Information Retrieval Systems in the 21st Century

    Get PDF
    The present study deals with the development of Arabic Information Retrieval Systems starting from 2000, its vital role in the Text Retrieval Conference (TREC), and in the cross-language information retrieval track. It has overviewed the developments concerning the Holy Qur'an, Arabic language, terms relevant to Arabic information retrieval systems, and the characteristics of the Arabic language compared with other languages since the early 21st century. These developments include rich resources of up to date information so as to develop research in this area, modern developments in assessing and measuring Arabic information retrieval systems, relevant theses, and some research studies of contemporary universities on the use of TREC in Arabic information retrieval, and the researchers with no prior knowledge of Arabic language. The study ends with some studies of the Arab universities. Keywords: Retrieval Systems, Arabic Information, Twenty- first centur

    The Enhancement of Arabic Information Retrieval Using Arabic Text Summarization

    Get PDF
    The massive upload of text on the internet makes the text overhead one of the important challenges faces the Information Retrieval (IR) system. The purpose of this research is to maintain reasonable relevancy and increase the efficiency of the information retrieval system by creating a short and informative inverted index and by supporting the user query with a set of semantically related terms extracted automatically. To achieve this purpose, two new models for text mining are developed and implemented, the first one called Multi-Layer Similarity (MLS) model that uses the Latent Semantic Analysis (LSA) in the efficient framework. And the second is called the Noun Based Distinctive Verbs (NBDV) model that investigates the semantic meanings of the nouns by identifying the set of distinctive verbs that describe them. The Arabic Language has been chosen as the language of the case study, because one of the primary objectives of this research is to measure the effect of the MLS model and NBDV model on the relevancy of the Arabic IR (AIR) systems that use the Vector Space model, and to measure the accuracy of applying the MLS model on the recall and precision of the Arabic language text extraction systems. The initiating of this research requires holding a deep reading about what has been achieved in the field of Arabic information retrieval. In this regard, a quantitative relevancy survey to measure the enhancements achieved has been established. The survey reviewed the impact of statistical and morphological analysis of Arabic text on improving the AIR relevancy. The survey measured the contributions of Stemming, Indexing, Query Expansion, Automatic Text Summarization, Text Translation, Part of Speech Tagging, and Named Entity Recognition in enhancing the relevancy of AIR. Our survey emphasized the quantitative relevancy measurements provided in the surveyed publications. The survey showed that the researchers achieved significant achievements, especially in building accurate stemmers, with precision rates that convergent to 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed a positive relevancy effect. However, other tasks such as Named Entity Recognition and Automatic Text Summarization still need more research to realize their impact on Arabic IR. The use of LSA in text mining demands large space and time requirements. In the first part of this research, a new text extraction model has been proposed, designed, implemented, and evaluated. The new method sets a framework on how to efficiently employ the statistical semantic analysis in the automatic text extraction. The method hires the centrality feature that estimates the similarity of the sentence with respect to every sentence found in the text. The new model omits the segments of text that have significant verbatim, statistical, and semantic resemblance with previously processed texts. The identification of text resemblance is based on a new multi-layer process that estimates the text-similarity at three statistical layers. It employes the Jaccard coefficient similarity and the Vector Space Model (VSM) in the first and second layers respectively and uses the Latent Semantic Analysis in the third layer. Due to high time complexity, the Multi-Layer model restricts the use of the LSA layer for the text segments that the Jaccard and VSM layers failed to estimate their similarities. ROUGE tool is used in the evaluation, and because ROUGE does not consider the extract’s size, it has been supplemented with a new evaluation strategy based on the ratio of sentences intersections between the automatic and the reference extracts and the condensation rate. The MLS model has been compared with the classical LSA that uses the traditional definition of the singular value decomposition and with the traditional Jaccard and VSM text extractions. The results of our comparison showed that the run of the LSA procedure in the MLS-based extraction reduced by 52%, and the original matrix dimensions dwindled by 65%. Also, the new method achieved remarkable accuracy results. We found that combining the centrality feature with the proposed multi-layer framework yields a significant solution regarding the efficiency and precision in the field of automatic text extraction. The automatic synonym extractor built in this research is based on statistical approaches. The traditional statistical approach in synonyms extraction is time-consuming, especially in real applications such as query expansion and text mining. It is necessary to develop a new model to improve the efficiency and accuracy during the extraction. The research presents the NBDV model in synonym extraction that replaces the traditional tf.idf weighting scheme with a new weighting scheme called the Orbit Weighing Scheme (OWS). The OWS weights the verbs based on their singularity to a group of nouns. The method was manipulated over the Arabic language because it has more varieties in constructing the verbal sentences than the other languages. The results of the new method were compared with traditional models in automatic synonyms extraction, such as the Skip-Gram and Continuous Bag of Words. The NBDV method obtained significant accuracy results (47% R and 51% P in the dictionary-based evaluation, and 57.5% precision using human experts’ assessment). It is found that on average, the synonyms extraction of a single noun requires the process of 186 verbs, and in 63% of the runs, the number of singular verbs was less than 200. It is concluded that the developed new method is efficient and processed the single run in linear time complexity (O(n)). After implementing the text extractors and the synonyms extractor, the VSM model was used to build the IR system. The inverted index was constructed from two sources of data, the original documents taken from various datasets of the Arabic language (and one from the English language for comparison purposes), and from the automatic summaries of the same documents that were generated from the automatic extractors developed in this research. A series of experiments were held to test the effectiveness of the extraction methods developed in this research on the relevancy of the IR system. The experiments examined three groups of queries, 60 Arabic queries with manual relevancy assessment, 100 Arabic queries with automatic relevancy assessment, and 60 English queries with automatic relevancy assessment. Also, the experiments were performed with and without synonyms expansions using the synonyms generated by the synonyms extractor developed in the research. The positive influence of the MLS text extraction was clear in the efficiency of the IR system without noticeable loss in the relevancy results. The intrinsic evaluation in our research showed that the bag of words models failed to reduce the text size, and this appears clearly in the large values of the condensation Rate (68%). Comparing with the previous publications that addressed the use of summaries as a source of the index, The relevancy assessment of our work was higher than their relevancy results. And, the relevancy results were obtained at 42% condensation rate, whereas, the relevancy results in the previous publication achieved at high values of condensation rate. Also, the MLS-based retrieval constructed an inverted index that is 58% smaller than the Main Corpus inverted index. The influence of the NBDV synonyms expansion on the IR relevancy had a slightly positive impact (only 1% improvement in both recall and precision), but no negative impact has been recorded in all relevancy measures

    Arabic stemmers and their effectiveness on the information retrieval system

    Full text link
    Arabic is a semitic language that has a complex morphology. Therefore, using a stemmer algorithm in an information retrieval system is almost always beneficial; An Arabic stemmer has been implemented and included in the information retrieval system developed at the Information Science Research Institute at the University of Nevada Las Vegas. The Arabic stemmer is written in the Ruby Language and removes affixes then matches the remaining word against patterns of the same length. The retrieval experiment uses the TREC collection which consists of over a million documents. We will test the effectiveness of the Arabic stemmer using recall/precision measurement and compare the result to other stemmers

    The design, implementation, and effectiveness of a Farsi word stemmer

    Full text link
    A stemmer for the Farsi language has been designed, implemented, and evaluated. The stemmer uses Farsi morphology to remove affixes, producing effective stems. The implementation is written in C, using strings of unicode-encoded characters to represent Farsi words. It is meant to enhance the Farsi information retrieval system currently being developed at the Information Science Research Institute at the University of Nevada at Las Vegas. The effectiveness of the Farsi stemmer and stopword list on recall/precision was tested

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
    • …
    corecore