11 research outputs found

    Arabic Book Retrieval using Class and Book Index Based Term Weighting

    Get PDF
    One of the most common issue in information retrieval is documents ranking. Documents ranking system collects search terms from the user and orderly retrieves documents based on the relevance. Vector space models based on TF.IDF term weighting is the most common method for this topic. In this study, we are concerned with the study of automatic retrieval of Islamic Fiqh (Law) book collection. This collection contains many books, each of which has tens to hundreds of pages. Each page of the book is treated as a document that will be ranked based on the user query. We developed class-based indexing method called inverse class frequency (ICF) and book-based indexing method inverse book frequency (IBF) for this Arabic information retrieval. Those method then been incorporated with the previous method so that it becomes TF.IDF.ICF.IBF. The term weighting method also used for feature selection due to high dimensionality of the feature space. This novel method was tested using a dataset from 13 Arabic Fiqh e-books. The experimental results showed that the proposed method have the highest precision, recall, and F-Measure than the other three methods at variations of feature selection. The best performance of this method was obtained when using best 1000 features by precision value of 76%, recall value of 74%, and F-Measure value of 75%

    Arabic Text Classification Using Learning Vector Quantization

    Get PDF
    Text classification aims to automatically assign document in predefined category. In our research, we used a model of neural network which is called Learning Vector Quantization (LVQ) for classifying Arabic text. This model has not been addressed before in this area. The model based on Kohonen self organizing map (SOM) that is able to organize vast document collections according to textual similarities. Also, from past experiences, the model requires less training examples and much faster than other classification methods. In this research we first selected Arabic documents from different domains. Then, we selected suitable pre-processing methods such as term weighting schemes, and Arabic morphological analysis (stemming and light stemming), to prepare the data set for achieving the classification by using the selected algorithm. After that, we compared the results obtained from different LVQ improvement version (LVQ2.1, LVQ3, OLVQ1 and OLVQ3). Finally, we compared our work with other most known classification algorithms; decision tree (DT), K Nearest Neighbors (KNN) and Naïve Bayes. The results presented that the LVQ's algorithms especially LVQ2.1 algorithm achieved high accuracy and less time rather than others classification algorithms and other neural networks algorithms

    An Intelligent Framework for Natural Language Stems Processing

    Get PDF
    This work describes an intelligent framework that enables the derivation of stems from inflected words. Word stemming is one of the most important factors affecting the performance of many language applications including parsing, syntactic analysis, speech recognition, retrieval systems, medical systems, tutoring systems, biological systems,…, and translation systems. Computational stemming is essential for dealing with some natural language processing such as Arabic Language, since Arabic is a highly inflected language. Computational stemming is an urgent necessity for dealing with Arabic natural language processing. The framework is based on logic programming that creates a program to enabling the computer to reason logically. This framework provides information on semantics of words and resolves ambiguity. It determines the position of each addition or bound morpheme and identifies whether the inflected word is a subject, object, or something else. Position identification (expression) is vital for enhancing understandability mechanisms. The proposed framework adapts bi-directional approaches. It can deduce morphemes from inflected words or it can build inflected words from stems. The proposed framework handles multi-word expressions and identification of names. The framework is based on definiteclause grammar where rules are built according to Arabic patterns (templates) using programming language prolog as predicates in first-order logic. This framework is based on using predicates in firstorder logic with object-oriented programming convention which can address problems of complexity. This complexity of natural language processing comes from the huge amount of storage required. This storage reduces the efficiency of the software system. In order to deal with this complexity, the research uses Prolog as it is based on efficient and simple proof routines. It has dynamic memory allocation of automatic garbage collection. This facility, in addition to relieve th

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Matching Meaning for Cross-Language Information Retrieval

    Get PDF
    Cross-language information retrieval concerns the problem of finding information in one language in response to search requests expressed in another language. The explosive growth of the World Wide Web, with access to information in many languages, has provided a substantial impetus for research on this important problem. In recent years, significant advances in cross-language retrieval effectiveness have resulted from the application of statistical techniques to estimate accurate translation probabilities for individual terms from automated analysis of human-prepared translations. With few exceptions, however, those results have been obtained by applying evidence about the meaning of terms to translation in one direction at a time (e.g., by translating the queries into the document language). This dissertation introduces a more general framework for the use of translation probability in cross-language information retrieval based on the notion that information retrieval is dependent fundamentally upon matching what the searcher means with what the document author meant. The perspective yields a simple computational formulation that provides a natural way of combining what have been known traditionally as query and document translation. When combined with the use of synonym sets as a computational model of meaning, cross-language search results are obtained using English queries that approximate a strong monolingual baseline for both French and Chinese documents. Two well-known techniques (structured queries and probabilistic structured queries) are also shown to be a special case of this model under restrictive assumptions

    Contribution à l’amélioration de la recherche d’information par utilisation des méthodes sémantiques: application à la langue arabe

    Get PDF
    Un système de recherche d’information est un ensemble de programmes et de modules qui sert à interfacer avec l’utilisateur, pour prendre et interpréter une requête, faire la recherche dans l’index et retourner un classement des documents sélectionnés à cet utilisateur. Cependant le plus grand challenge de ce système est qu’il doit faire face au grand volume d’informations multi modales et multilingues disponibles via les bases documentaires ou le web pour trouver celles qui correspondent au mieux aux besoins des utilisateurs. A travers ce travail, nous avons présenté deux contributions. Dans la première nous avons proposé une nouvelle approche pour la reformulation des requêtes dans le contexte de la recherche d’information en arabe. Le principe est donc de représenter la requête par un arbre sémantique pondéré pour mieux identifier le besoin d'information de l'utilisateur, dont les nœuds représentent les concepts (synsets) reliés par des relations sémantiques. La construction de cet arbre est réalisée par la méthode de la Pseudo-Réinjection de la Pertinence combinée à la ressource sémantique du WordNet Arabe. Les résultats expérimentaux montrent une bonne amélioration dans les performances du système de recherche d’information. Dans la deuxième contribution, nous avons aussi proposé une nouvelle approche pour la construction d’une collection de test de recherche d’information arabe. L'approche repose sur la combinaison de la méthode de la stratégie de Pooling utilisant les moteurs de recherches et l’algorithme Naïve-Bayes de classification par l’apprentissage automatique. Pour l’expérimentation nous avons créé une nouvelle collection de test composée d’une base documentaire de 632 documents et de 165 requêtes avec leurs jugements de pertinence sous plusieurs topics. L’expérimentation a également montré l’efficacité du classificateur Bayésien pour la récupération de pertinences des documents, encore plus, il a réalisé des bonnes performances après l’enrichissement sémantique de la base documentaire par le modèle word2vec

    New techniques and framework for sentiment analysis and tuning of CRM structure in the context of Arabic language

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophyKnowing customers’ opinions regarding services received has always been important for businesses. It has been acknowledged that both Customer Experience Management (CEM) and Customer Relationship Management (CRM) can help companies take informed decisions to improve their performance in the decision-making process. However, real-word applications are not so straightforward. A company may face hard decisions over the differences between the opinions predicted by CRM and actual opinions collected in CEM via social media platforms. Until recently, how to integrate the unstructured feedback from CEM directly into CRM, especially for the Arabic language, was still an open question. Furthermore, an accurate labelling of unstructured feedback is essential for the quality of CEM. Finally, CRM needs to be tuned and revised based on the feedback from social media to realise its full potential. However, the tuning mechanism for CEM of different levels has not yet been clarified. Facing these challenges, in this thesis, key techniques and a framework are presented to integrate Arabic sentiment analysis into CRM. First, as text pre-processing and classification are considered crucial to sentiment classification, an investigation is carried out to find the optimal techniques for the pre-processing and classification of Arabic sentiment analysis. Recommendations for using sentiment analysis classification in MSA as well as Saudi dialects are proposed. Second, to deal with the complexities of the Arabic language and to help operators identify possible conflicts in their original labelling, this study proposes techniques to improve the labelling process of Arabic sentiment analysis with the introduction of neural classes and relabelling. Finally, a framework for adjusting CRM via CEM for both the structure of the CRM system (on the sentence level) and the inaccuracy of the criteria or weights employed in the CRM system (on the aspect level) are proposed. To ensure the robustness and the repeatability of the proposed techniques and framework, the results of the study are further validated with real-word applications from different domains

    Assessing relevance using automatically translated documents for cross-language information retrieval

    Get PDF
    This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt.The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory
    corecore