42 research outputs found

    Taxonomy learning from Malay texts using artificial immune system based clustering

    Get PDF
    In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-means and Guided Agglomerative Hierarchical Clustering (GAHC). However these methods suffer low recall. Therefore, the purpose of this study is to investigate the application of two hybridized artificial immune system (AIS) in taxonomy learning from Malay text and develop a Google-based Text Miner (GTM) for feature selection to reduce data sparseness. Two novel taxonomy learning algorithms have been proposed and compared with the benchmark methods (i.e., HAC, GAHC and Bisecting K-means). The first algorithm is designed through the hybridization of GAHC and Artificial Immune Network (aiNet) called GCAINT (Guided Clustering and aiNet for Taxonomy Learning). The GCAINT algorithm exploits a Hypernym Oracle (HO) to guide the hierarchical clustering process and produce better results than the benchmark methods. However, the Malay HO introduces erroneous hypernym-hyponym pairs and affects the result. Therefore, the second novel algorithm called CLOSAT (Clonal Selection Algorithm for Taxonomy Learning) is proposed by hybridizing Clonal Selection Algorithm (CLONALG) and Bisecting k-means. CLOSAT produces the best results compared to the benchmark methods and GCAINT. In order to reduce sparseness in the obtained dataset, the GTM is proposed. However, the experimental results reveal that GTM introduces too many noises into the dataset which leads to many false positives of hypernym-hyponym pairs. The effect of different combinations of affinity measurement (i.e., Hamming, Jaccard and Rand) on the performance of the developed methods was also studied. Jaccard is found better than Hamming and Rand in measuring the similarity distance between terms. In addition, the use of Particle Swarm Optimization (PSO) for automatic parameter tuning the GCAINT and CLOSAT was also proposed. Experimental results demonstrate that in most cases, PSO-tuned CLOSAT and GCAINT produce better results compared to the benchmark methods and able to reduce data sparseness and noise in the dataset

    Self-adaptive Based Model for Ambiguity Resolution of The Linked Data Query for Big Data Analytics

    Get PDF
    Integration of heterogeneous data sources is a crucial step in big data analytics, although it creates ambiguity issues during mapping between the sources due to the variation in the query terms, data structure and granularity conflicts. However, there are limited researches on effective big data integration to address the ambiguity issue for big data analytics. This paper introduces a self-adaptive model for big data integration by exploiting the data structure during querying in order to mitigate and resolve ambiguities. An assessment of a preliminary work on the Geography and Quran dataset is reported to illustrate the feasibility of the proposed model that motivates future work such as solving complex query

    Normalization of common noisy terms in Malaysian online media

    Get PDF
    This paper proposes a normalization technique of noisy terms that occur in Malaysian micro-texts.Noisy terms are common in online messages and influence the results of activities such as text classification and information retrieval.Even though many researchers have study methods to solve this problem, few had looked into the problems using a language other than English. In this study, about 5000 noisy texts were extracted from 15000 documents that were created by the Malaysian.Normalization process was executed using specific translation rules as part or preprocessing steps in opinion mining of movie reviews.The result shows up to 5% improvement in accuracy values of opinion mining

    An evolutionary variable neighbourhood search for the unrelated parallel machine scheduling problem

    Get PDF
    This article addresses a challenging industrial problem known as the unrelated parallel machine scheduling problem (UPMSP) with sequence-dependent setup times. In UPMSP, we have a set of machines and a group of jobs. The goal is to find the optimal way to schedule jobs for execution by one of the several available machines. UPMSP has been classified as an NP-hard optimisation problem and, thus, cannot be solved by exact methods. Meta-heuristic algorithms are commonly used to find sub-optimal solutions. However, large-scale UPMSP instances pose a significant challenge to meta-heuristic algorithms. To effectively solve a large-scale UPMSP, this article introduces a two-stage evolutionary variable neighbourhood search (EVNS) methodology. The proposed EVNS integrates a variable neighbourhood search algorithm and an evolutionary descent framework in an adaptive manner. The proposed evolutionary framework is employed in the first stage. It uses a mix of crossover and mutation operators to generate diverse solutions. In the second stage, we propose an adaptive variable neighbourhood search to exploit the area around the solutions generated in the first stage. A dynamic strategy is developed to determine the switching time between these two stages. To guide the search towards promising areas, a diversity-based fitness function is proposed to explore different locations in the search landscape. We demonstrate the competitiveness of the proposed EVNS by presenting the computational results and comparisons on the 1640 UPMSP benchmark instances, which have been commonly used in the literature. The experiment results show that our EVNS obtains better results than the compared algorithms on several UPMSP instances

    Normalization of noisy texts in Malaysian online reviews

    Get PDF
    The process of gathering useful information from online messages has increased as more and more people use the Internet and other online applications such as Facebook and Twitter to communicate with each other.One of the problems in processing online messages is the high number of noisy texts that exist in these messages.Few studies have shown that the noisy texts decreased the result of text mining activities.On the other hand, very few works have investigated on the patterns of noisy texts that are created by Malaysians.In this study, a common noisy terms list and an artificial abbreviations list were created using specific rules and were utilized to select candidates of correct words for a noisy term.Later, the correct term was selected based on a bi-gram words index.The experiments used online messages that were created by the Malaysians.The result shows that normalization of noisy texts using artificial abbreviations list compliments the use of common noisy texts list

    Using Bayesian Network for Determining The Recipient of Zakat in BAZNAS Pekanbaru

    Get PDF
    Abstract—The National Amil-Zakat Agency (Baznas) in Pekanbaru has the function to collect and distribute zakat in Pekanbaru city. Baznas Pekanbaru should be able to determine Mustahik properly. Mustahik is a person eligible to receive zakat. The Baznas committee interviews and observes every Mustahik candidates to decide whom could be receive the zakat. Current Mustahik determination process could lead to be subjective assessment, due to large number of zakat recipient applicants and the complexity of rules in determining a Mustahik. Therefore, this study utilize artificial intelligence in determining Mustahik. The Bayesian Network method is appropriate to apply as an inference engine. Based on the experimental results, we found that Bayesian network produces a good accuracy 93.24% and effective to use in data set have an uneven class distribution. In addition, based on experiments by setting an alpha estimator’s values, at 0.6 to 1.0 can increase the accuracy of a Bayesian Network to 95.95%. Keywords—bayesian network, baznas pekanbaru, mustahik, zaka

    Time Series Prediction of Bitcoin Cryptocurrency Price Based on Machine Learning Approach

    Get PDF
    Over the past few years, Bitcoin has attracted the attention of numerous parties, ranging from academic researchers to institutional investors. Bitcoin is the first and most widely used cryptocurrency to date. Due to the significant volatility of the Bitcoin price and the fact that its trading method does not require a third party, it has gained great popularity since its inception in 2009 among a wide range of individuals. Given the previous difficulties in predicting the price of cryptocurrencies, this project will be developing and implementing a time series approach-based solution prediction model using machine learning algorithms which include Support Vector Machine Regression (SVR), K-Nearest Neighbor Regression (KNN), Extreme Gradient Boosting (XGBoost), and Long Short-Term Memory (LSTM) to determine the trend of bitcoin price movement, and assessing the effectiveness of the machine learning models. The data that will be used is the close prices of Bitcoin from the year 2018 up to the year 2023. The performance of the machine learning models is evaluated by comparing the results of R-squared, mean absolute error (MAE), mean squared error (RMSE), and also through a visualization graph of the original close price and predicted close price of Bitcoin in a dashboard. Among the models compared, LSTM emerged as the most accurate, followed by SVR, while XGBoost and KNN exhibited comparatively lower performance

    Automatic Rule Generator via FP-Growth for Eye Diseases Diagnosis

    Get PDF
    The conventional approach in developing a rule-based expert system usually applies a tedious, lengthy and costly knowledge acquisition process. The acquisition process is known as the bottleneck in developing an expert system. Furthermore, manual knowledge acquisition can eventually lead to erroneous in decision-making and function ineffective when designing any expert system. Another dilemma among knowledge engineers are handing conflict of interest or high variance of inter and intrapersonal decisions among domain experts during knowledge elicitation stage. The aim of this research is to improve the acquisition of knowledge level using a data mining technique. This paper investigates the effectiveness of an association rule mining technique in generating new rules for an expert system. In this paper, FP-Growth is the machine learning technique that was used in acquiring rules from the eye disease diagnosis records collected from Sumatera Eye Center (SMEC) Hospital in Pekanbaru, Riau, Indonesia. The developed systems are tested with 17 cases. The ophthalmologists inspected the results from automatic rule generator for eye diseases diagnosis.  We found that the introduction of FP-Growth association rules into the eye disease knowledge-based systems, able to produce acceptable and promising eye diagnosing results approximately 88% of average accuracy rate. Based on the test results, we can conclude that Conjunctivitis and Presbyopia disease are the most dominant suffering in Indonesia. In conclusion, FP-growth association rules are very potential and capable of becoming an adequate automatic rules generator, but still has plenty of room for improvement in the context of eye disease diagnosing
    corecore