231 research outputs found

    The Effect of the Multi-Layer Text Summarization Model on the Efficiency and Relevancy of the Vector Space-based Information Retrieval

    Full text link
    The massive upload of text on the internet creates a huge inverted index in information retrieval systems, which hurts their efficiency. The purpose of this research is to measure the effect of the Multi-Layer Similarity model of the automatic text summarization on building an informative and condensed invert index in the IR systems. To achieve this purpose, we summarized a considerable number of documents using the Multi-Layer Similarity model, and we built the inverted index from the automatic summaries that were generated from this model. A series of experiments were held to test the performance in terms of efficiency and relevancy. The experiments include comparisons with three existing text summarization models; the Jaccard Coefficient Model, the Vector Space Model, and the Latent Semantic Analysis model. The experiments examined three groups of queries with manual and automatic relevancy assessment. The positive effect of the Multi-Layer Similarity in the efficiency of the IR system was clear without noticeable loss in the relevancy results. However, the evaluation showed that the traditional statistical models without semantic investigation failed to improve the information retrieval efficiency. Comparing with the previous publications that addressed the use of summaries as a source of the index, the relevancy assessment of our work was higher, and the Multi-Layer Similarity retrieval constructed an inverted index that was 58% smaller than the main corpus inverted index

    The Enhancement of Arabic Information Retrieval Using Arabic Text Summarization

    Get PDF
    The massive upload of text on the internet makes the text overhead one of the important challenges faces the Information Retrieval (IR) system. The purpose of this research is to maintain reasonable relevancy and increase the efficiency of the information retrieval system by creating a short and informative inverted index and by supporting the user query with a set of semantically related terms extracted automatically. To achieve this purpose, two new models for text mining are developed and implemented, the first one called Multi-Layer Similarity (MLS) model that uses the Latent Semantic Analysis (LSA) in the efficient framework. And the second is called the Noun Based Distinctive Verbs (NBDV) model that investigates the semantic meanings of the nouns by identifying the set of distinctive verbs that describe them. The Arabic Language has been chosen as the language of the case study, because one of the primary objectives of this research is to measure the effect of the MLS model and NBDV model on the relevancy of the Arabic IR (AIR) systems that use the Vector Space model, and to measure the accuracy of applying the MLS model on the recall and precision of the Arabic language text extraction systems. The initiating of this research requires holding a deep reading about what has been achieved in the field of Arabic information retrieval. In this regard, a quantitative relevancy survey to measure the enhancements achieved has been established. The survey reviewed the impact of statistical and morphological analysis of Arabic text on improving the AIR relevancy. The survey measured the contributions of Stemming, Indexing, Query Expansion, Automatic Text Summarization, Text Translation, Part of Speech Tagging, and Named Entity Recognition in enhancing the relevancy of AIR. Our survey emphasized the quantitative relevancy measurements provided in the surveyed publications. The survey showed that the researchers achieved significant achievements, especially in building accurate stemmers, with precision rates that convergent to 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed a positive relevancy effect. However, other tasks such as Named Entity Recognition and Automatic Text Summarization still need more research to realize their impact on Arabic IR. The use of LSA in text mining demands large space and time requirements. In the first part of this research, a new text extraction model has been proposed, designed, implemented, and evaluated. The new method sets a framework on how to efficiently employ the statistical semantic analysis in the automatic text extraction. The method hires the centrality feature that estimates the similarity of the sentence with respect to every sentence found in the text. The new model omits the segments of text that have significant verbatim, statistical, and semantic resemblance with previously processed texts. The identification of text resemblance is based on a new multi-layer process that estimates the text-similarity at three statistical layers. It employes the Jaccard coefficient similarity and the Vector Space Model (VSM) in the first and second layers respectively and uses the Latent Semantic Analysis in the third layer. Due to high time complexity, the Multi-Layer model restricts the use of the LSA layer for the text segments that the Jaccard and VSM layers failed to estimate their similarities. ROUGE tool is used in the evaluation, and because ROUGE does not consider the extract’s size, it has been supplemented with a new evaluation strategy based on the ratio of sentences intersections between the automatic and the reference extracts and the condensation rate. The MLS model has been compared with the classical LSA that uses the traditional definition of the singular value decomposition and with the traditional Jaccard and VSM text extractions. The results of our comparison showed that the run of the LSA procedure in the MLS-based extraction reduced by 52%, and the original matrix dimensions dwindled by 65%. Also, the new method achieved remarkable accuracy results. We found that combining the centrality feature with the proposed multi-layer framework yields a significant solution regarding the efficiency and precision in the field of automatic text extraction. The automatic synonym extractor built in this research is based on statistical approaches. The traditional statistical approach in synonyms extraction is time-consuming, especially in real applications such as query expansion and text mining. It is necessary to develop a new model to improve the efficiency and accuracy during the extraction. The research presents the NBDV model in synonym extraction that replaces the traditional tf.idf weighting scheme with a new weighting scheme called the Orbit Weighing Scheme (OWS). The OWS weights the verbs based on their singularity to a group of nouns. The method was manipulated over the Arabic language because it has more varieties in constructing the verbal sentences than the other languages. The results of the new method were compared with traditional models in automatic synonyms extraction, such as the Skip-Gram and Continuous Bag of Words. The NBDV method obtained significant accuracy results (47% R and 51% P in the dictionary-based evaluation, and 57.5% precision using human experts’ assessment). It is found that on average, the synonyms extraction of a single noun requires the process of 186 verbs, and in 63% of the runs, the number of singular verbs was less than 200. It is concluded that the developed new method is efficient and processed the single run in linear time complexity (O(n)). After implementing the text extractors and the synonyms extractor, the VSM model was used to build the IR system. The inverted index was constructed from two sources of data, the original documents taken from various datasets of the Arabic language (and one from the English language for comparison purposes), and from the automatic summaries of the same documents that were generated from the automatic extractors developed in this research. A series of experiments were held to test the effectiveness of the extraction methods developed in this research on the relevancy of the IR system. The experiments examined three groups of queries, 60 Arabic queries with manual relevancy assessment, 100 Arabic queries with automatic relevancy assessment, and 60 English queries with automatic relevancy assessment. Also, the experiments were performed with and without synonyms expansions using the synonyms generated by the synonyms extractor developed in the research. The positive influence of the MLS text extraction was clear in the efficiency of the IR system without noticeable loss in the relevancy results. The intrinsic evaluation in our research showed that the bag of words models failed to reduce the text size, and this appears clearly in the large values of the condensation Rate (68%). Comparing with the previous publications that addressed the use of summaries as a source of the index, The relevancy assessment of our work was higher than their relevancy results. And, the relevancy results were obtained at 42% condensation rate, whereas, the relevancy results in the previous publication achieved at high values of condensation rate. Also, the MLS-based retrieval constructed an inverted index that is 58% smaller than the Main Corpus inverted index. The influence of the NBDV synonyms expansion on the IR relevancy had a slightly positive impact (only 1% improvement in both recall and precision), but no negative impact has been recorded in all relevancy measures

    Pre Processing Techniques for Arabic Documents Clustering

    Get PDF
    Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced

    Ontology-Based Model for Information Retrieval: an Application of Time Nouns in Nahj Al-Balagha

    Get PDF
    نت دوراً اساسياً في الحياة من خلال كمية واهمية المعلومات التي يوفرها. حاليا، تعتبر إدارة البيانات وإيجاد المعلومات غير دقيقة وذلك لأنها تعتمد على شكل الكلمة وليس معناها. ان عملية تمثيل البيانات والوصول لها من اهم العوامل التي تساهم باسترجاع المعلومات والتغلب على مشكلة التشابه بين المستندات. توجد وسائل لقياس التشابه مختلفة تعمل وفقا ً للوزن والفهرسة والمطابقة. الانطولوجيا هي البنية الأساسية لإدارة البيانات لأنها تستند الى معنى الكلمة والعلاقة بين الكلمات ومجال المعرفة. يقدم هذا البحث اقتراحاً لنموذج نظام دلالي مبني على مجال معرفة محدد (في هذا البحث أسماء الزمن في نهج البلاغة) ويعتمد على المدخلات الدلالية عن طريقة فهرسة محرك البحث باستخدام Vector Space Model (VSM). الهدف من البحث هو تحسين المعلومات الدلالية المسترجعة عن طريق إنشاء استعلام يستند الى المطابقة والتشابه بين كلمات الاستعلام في النظام. هذا العمل مبني على عمل سابق [1]. تم تقييم النظام باستخدام معدل التشابه والدقة والاسترجاع لنتائج التجارب.The internet plays a key role in life through the massive data that it provides. Currently, managing data and finding information on the internet is inaccurate because it depends on the form of the word rather than its meaning. Data representation and access are important factors when it comes to Information Retrieval (IR). In order to overcome the problem of document similarity, there are various similarity measurements in place that function according to weight, indexing and matching. Ontology is a data management infrastructure that gives precedence to the meaning of a word, the relationship between words and the domain of knowledge.  This paper presents a semantic system proposal based on a particular field of knowledge (time nouns) and relies on semantic input by indexing the search engine using a Vector Space Model (VSM). The aim of this work is to improve the retrieved semantic information by constructing a query based on the matching and similarity between the query words in the system. This paper builds upon previous work carried out in the same area [1]. The system was evaluated by using the similarity, average precision and recall of the experiments' results

    ON RELEVANCE FILTERING FOR REAL-TIME TWEET SUMMARIZATION

    Get PDF
    Real-time tweet summarization systems (RTS) require mechanisms for capturing relevant tweets, identifying novel tweets, and capturing timely tweets. In this thesis, we tackle the RTS problem with a main focus on the relevance filtering. We experimented with different traditional retrieval models. Additionally, we propose two extensions to alleviate the sparsity and topic drift challenges that affect the relevance filtering. For the sparsity, we propose leveraging word embeddings in Vector Space model (VSM) term weighting to empower the system to use semantic similarity alongside the lexical matching. To mitigate the effect of topic drift, we exploit explicit relevance feedback to enhance profile representation to cope with its development in the stream over time. We conducted extensive experiments over three standard English TREC test collections that were built specifically for RTS. Although the extensions do not generally exhibit better performance, they are comparable to the baselines used. Moreover, we extended an event detection Arabic tweets test collection, called EveTAR, to support tasks that require novelty in the system's output. We collected novelty judgments using in-house annotators and used the collection to test our RTS system. We report preliminary results on EveTAR using different models of the RTS system.This work was made possible by NPRP grants # NPRP 7-1313-1-245 and # NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation)
    corecore