24 research outputs found
Sentiment Analysis or Opinion Mining: A Review
Opinion Mining (OM) or Sentiment Analysis (SA) can be defined as the task of detecting, extracting and classifying opinions on something. It is a type of the processing of the natural language (NLP) to track the public mood to a certain law, policy, or marketing, etc. It involves a way that development for the collection and examination of comments and opinions about legislation, laws, policies, etc., which are posted on the social media. The process of information extraction is very important because it is a very useful technique but also a challenging task. That mean, to extract sentiment from an object in the web-wide, need to automate opinion-mining systems to do it. The existing techniques for sentiment analysis include machine learning (supervised and unsupervised), and lexical-based approaches. Hence, the main aim of this paper presents a survey of sentiment analysis (SA) and opinion mining (OM) approaches, various techniques used that related in this field. As well, it discusses the application areas and challenges for sentiment analysis with insight into the past researcher's works
Schema Matching for Large-Scale Data Based on Ontology Clustering Method
Holistic schema matching is the process of identifying semantic correspondences among multiple schemas at once. The key challenge behind holistic schema matching lies in selecting an appropriate method that has the ability to maintain effectiveness and efficiency. Effectiveness refers to the quality of matching while efficiency refers to the time and memory consumed within the matching process. Several approaches have been proposed for holistic schema matching. These approaches were mainly dependent on clustering techniques. In fact, clustering aims to group the similar fields within the schemas in multiple groups or clusters. However, fields on schemas contain much complicated semantic relations due to schema level. Ontology which is a hierarchy of taxonomies, has the ability to identify semantic correspondences with various levels. Hence, this study aims to propose an ontology-based clustering approach for holistic schema matching. Two datasets have been used from ICQ query interfaces consisting of 40 interfaces, which refer to Airfare and Job. The ontology used in this study has been built using the XBenchMatch which is a benchmark lexicon that contains rich semantic correspondences for the field of schema matching. In order to accommodate the schema matching using the ontology, a rule-based clustering approach is used with multiple distance measures including Dice, Cosine and Jaccard. The evaluation has been conducted using the common information retrieval metrics; precision, recall and f-measure. In order to assess the performance of the proposed ontology-based clustering, a comparison among two experiments has been performed. The first experiment aims to conduct the ontology-based clustering approach (i.e. using ontology and rule-based clustering), while the second experiment aims to conduct the traditional clustering approaches without the use of ontology. Results show that the proposed ontology-based clustering approach has outperformed the traditional clustering approaches without ontology by achieving an f-measure of 94% for Airfare and 92% for Job datasets. This emphasizes the strength of ontology in terms of identifying correspondences with semantic level variation
Classification of Encouragement (Targhib) And Warning (Tarhib) Using Sentiment Analysis on Classical Arabic
The Holy Qur’an is the main religious text of Islam. The Qur’an has its own methods of Targhib (encouragement) and Tarhib (warning), which are important features of the Qur’an. Most of the Quranic verses would urge and encourage people to do right and good deeds, and also warn them from committing evil and bad deeds. The method of classifying a text into two opposing opinions has been applied previously in solving the problem of sentiment analysis. Currently, it is applied in identifying between Targhib (encouragement) and Tarhib (warning) verses in the Qur’an. Each verse of the Qur’an can be treated as either an encouragement, warning or neutral. The language of the Holy Qur’an is one of the most challenging natural languages in sentiment analysis. The aim of this work is to classify the verses of encouragement and warning using sentiment analysis and NLP techniques. Several approaches are used in the Sentiment Analysis classification, such as the machine learning approach, the lexicon-based approach and the hybrid approach. In carrying out this aim, the applied machine learning approach was used, where the impact of the use of different techniques such as POS tagging, N-Gram and Feature selection with correlation based were evaluated and investigated. 95.6% accuracy was achieved using Naïve Bayes (NB) and 91.5% accuracy was achieved using the Support Vector Machines (SVM). This study is a significant study in extracting information and knowledge from the Holy Qur’an. It is significant for both researchers in the field of Islamic studies as well as non-specialized researchers
Named entity recognition for quranic text using rule based approaches
The variety and difference between domains for textual data require customization in the Natural Language
Processing component especially in Named Entity Recognition where different domains contain several types of
entities. The current NER model is deemed not fit to accurately extract entities from Quranic text due to its unique
content. This paper describes the building of a rule-based Named Entity Recognition method to extract the entities
that exist in the English translation to the meaning of the Quranic text and its performance evaluation. Named
entity tagging, a common task in-text annotation, in which entities (nouns) in the unstructured text are identified
and assigned a class. A few rules are built to extract several types of entities such as the name of prophets and
people, creation, location, time, and the various names of God. The rules are built mainly using regular expressions
and gazetteers. The rules that have been built result in high precision and recall as well as a satisfactory F-score
of over 90%. The results from this experiment can be used as annotation in building a machine learning model to
extract entities from the same type of domain specifically on the Quranic text or generally in the Islamic domain
text
The effectiveness of url features on phishing emails classification using machine learning approach
Phishing email classification requires features so that the performance obtained produces good accuracy. One of
the reasons for the lack of development of models for detecting phishing emails is the complexity of the feature
selection. Feature selection is one of the essential parts of getting a good classification result, commonly used
features are header, body, and Uniform Resource Locator (URL). Besides the email body text content, the URL
is one of the leading indicators that the phishing attack successfully happened. The URL is commonly located on
the body of the phishing email to get the victim's attention. It will redirect the victim to a fake website to obtain
personal information from the victim. There is a lack of information about how the URL features affect the
phishing email classification results. Therefore, this work focuses on using URL features to determine whether an
email is phishing or legitimate using machine learning approaches. Two public datasets used in this work are the
Online Phishing Corpus and Enron Corpus. The URL features are extracted using the Beautiful Soup library. Two
machine learning classifiers used in this work are Support Vector Machine (SVM) and Artificial Neural Network
(ANN). The experiments were divided into two based on features used in the classifiers. The first experiment used
raw email data with URL features, while the second only used raw email data. The first experiment shows higher
accuracy in both classifiers, SVM and ANN. Hence, this research proves that the impact of selecting URL features
will increase the performance of the classification
Preliminary study on an ontology learning from textual data
Natural language understanding is needed to intelligently handle the large volumes of information that is explosive growth over the last decade on the WWW. Ontologies may help with analyzing and understanding text where ontology provides a capability to represent objects, concepts and other entities and the relationships between them. Ontologies may be used as a tool for finding possible meanings of words in text, and meaning of text in general. Now, much of this ontology development has been directed towards extraction from textual data as human language is a primary mode of knowledge transfer. The aim of this paper is to give a general overview and preliminary study on some of the ontology learning from text that plays a prominent role on the knowledge retrieval and how the ontological semantic can be improved through the adoption of semantic web technolog
Pashto language stemming algorithm
This paper presents a stemming algorithm for morphological analysis for less popular or minor language like
Pashto language. There is lack of resources and tools that can be applied in different applications such as in
document indexing, clustering, language processing, text analysis, database search systems, information retrieval,
and linguistic applications. The review of literature shows that only a few morphological studies have been
conducted in the Pashto language, and research which focused on automatic stemming has not yet been fully
analysed. In addition, no stemming algorithm has been proposed for extracting Pashto root words from the Pashto
corpus, which is applicable for the above mentioned functions. Therefore, the objective of the current thesis is to
develop a rule-based stemming algorithm for the Pashto language. The Pashto corpus is directly used as the input
and the stemming algorithm uses both inflectional and derivational morphemes. The output is in the form of
meaningful root word without affixes. Furthermore, the accuracy and strength of the proposed algorithm is
evaluated using word count method. To validate the function of the developed algorithm, two native speakers of
Pashto were recruited to evaluate the algorithm in terms of its accuracy and strength. The result of the study shows
that the proposed algorithm has the accuracy of 87%. This study can have a great contribution to Pashto language
in terms of extracting the root words useful for different purposes including data indexing, information retrieval,
linguistic application, etc. This research also lays the ground for further studies on Pashto language analysis
Ontology extraction
Ontology is an important emerging discipline that has the huge potential to improve information organization, management and understanding. Ontology has become an important mean for structuring knowledge and building knowledge-intensive systems. The importance of domain ontologies is widely recognized, particularly in its relation to the expected advent of the Semantic Web. As the term refers to the shared understanding of some domains of interest, which is often conceived as a set of concepts, relations, functions, axioms and instances (Gruber, 1993), the goal of a domain ontology is to reduce the conceptual and terminological confusion among the members of a virtual community of users that need to share electronic documents and information of various kinds. According to Uschold and Jasper (1999), `An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are interrelated which collectively impose a structure on the domain and constrain the possible interpretations of terms.' Gruber (1993) defines ontology as `the specification of conceptualizations, used to help programs and humans share knowledge’. The conceptualization is the couching of knowledge about the world in terms of entities (things, the relationships they hold and the constraints between them). The specification is the representation of this conceptualization in a concrete form. One step in this specification is the encoding of the conceptualization in a knowledge representation language
Towards context-sensitive domain of Islamic knowledge ontology extraction
Ontology is one of the essential topics in the scope of an important area of current computer science and Semantic Web. Ontologies present well defined, straightforward and standardized form of the repositories (vast and reliable knowledge) where it can be interoperable and machine understandable. There are many possible utilization of ontologies from automatic annotation of web resources to domain representation and reasoning task. Ontology is an effective conceptualism used for the semantic web. However there is none of the research try to construct an ontology from Islamic knowledge which consist of Holy Quran, Hadiths and etc. Therefore as a first stage, in this paper we try to propose a simple methodology in order to extract a concept based on Al-Quran. Finally, we discuss about the experiment that have been conducted