1,611 research outputs found

    Semantic-based Ontology for Malay Qur'an Reader

    Get PDF
    The Quran has been translated into various languages around the world by Muslim experts. One of them is in Malay. There are numerous applications built to facilitate the retrieval of knowledge from the Malay Qur’an. However, there are limited resources and tools that are available or made accessible for the research on Malay Qur’an. Furthermore, there are several issues that need to be considered when dealing with Malay Qur’an translation; such as ambiguities of words, lack of equivalence words between Malay and English or Malay and Arabic, and different structures of word, sentence, and discourse in these two languages. Therefore, this research summarizes the search techniques used in existing research on Qur’an. Moreover, this paper also studied the previous research conducted on Qur’an Semantic Search and Quran Ontology-Based Search focusing on Malay Qur’an. This review helps the research in addressing the general problems and limitations in Malay Qur’an that influence its accessibility. This research proposed the research framework for new semantic based ontology for Malay Qur’an. The final outcome will be an accessible tool that can help a Malay reader to understand the Qur’an in better ways

    Improving Arabic Light Stemming in Information Retrieval Systems

    Get PDF
    Information retrieval refers to the retrieval of textual documents such as newsprint and magazine articles or Web documents. Due to extensive research in the IR field, there are many retrieval techniques that have been developed for Arabic language. The main objective of this research to improve Arabic information retrieval by enhancing light stemming and preprocessing stage and to contribute to the open source community, also establish a guideline for Arabic normalization and stop-word removal. To achieve these objectives, we create a GUI toolkit that implements preprocessing stage that is necessary for information retrieval. One of these steps is normalizing, which we improved and introduced a set of rules to be standardized and improved by other researchers. The next preprocessing step we improved is stop-word removal, we introduced two different stop-word lists, the first one is intensive stop-word list for reducing the size of the index and ambiguous words, and the other is light stop-word list for better results with recall in information retrieval applications. We improved light stemming by update a suffix rule, and introduce the use of Arabized words, 100 words manually collected, these words should not follow the stemming rules since they came to Arabic language from other languages, and show how this improve results compared to two popular stemming algorithms like Khoja and Larkey stemmers. The proposed toolkit was integrated into a popular IR platform known as Terrier IR platform. We implemented Arabic language support into the Terrier IR platform. We used TF-IDF scoring model from Terrier IR platform. We tested our results using OSAC datasets. We used java programming language and Terrier IR platform for the proposed systems. The infrastructure we used consisted of CORE I7 CPU ran speed at 3.4 GHZ and 8 GB RAM

    Users' Traces for Enhancing Arabic Facebook Search

    Get PDF
    International audienceThis paper proposes an approach on Facebook search in Arabic, which exploits several users' traces (e.g. comment, share, reactions) left on Facebook posts to estimate their social importance. Our goal is to show how these social traces (signals) can play a vital role in improving Arabic Facebook search. Firstly, we identify polarities (positive or negative) carried by the textual signals (e.g. comments) and non-textual ones (e.g. the reactions love and sad) for a given Facebook post. Therefore, the polarity of each comment expressed on a given Facebook post, is estimated on the basis of a neural sentiment model in Arabic language. Secondly, we group signals according to their complementarity using features selection algorithms. Thirdly, we apply learning to rank (LTR) algorithms to re-rank Facebook search results based on the selected groups of signals. Finally, experiments are carried out on 13,500 Facebook posts, collected from 45 topics in Arabic language. Experiments results reveal that Random Forests combined with ReliefFAttributeEval (RLF) was the most effective LTR approach for this task

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Inexpensive fusion methods for enhancing feature detection

    Get PDF
    Recent successful approaches to high-level feature detection in image and video data have treated the problem as a pattern classification task. These typically leverage the techniques learned from statistical machine learning, coupled with ensemble architectures that create multiple feature detection models. Once created, co-occurrence between learned features can be captured to further boost performance. At multiple stages throughout these frameworks, various pieces of evidence can be fused together in order to boost performance. These approaches whilst very successful are computationally expensive, and depending on the task, require the use of significant computational resources. In this paper we propose two fusion methods that aim to combine the output of an initial basic statistical machine learning approach with a lower-quality information source, in order to gain diversity in the classified results whilst requiring only modest computing resources. Our approaches, validated experimentally on TRECVid data, are designed to be complementary to existing frameworks and can be regarded as possible replacements for the more computationally expensive combination strategies used elsewhere

    Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

    Get PDF
    In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages
    corecore