62 research outputs found

    Clustering Arabic Tweets for Sentiment Analysis

    Get PDF
    The focus of this study is to evaluate the impact of linguistic preprocessing and similarity functions for clustering Arabic Twitter tweets. The experiments apply an optimized version of the standard K-Means algorithm to assign tweets into positive and negative categories. The results show that root-based stemming has a significant advantage over light stemming in all settings. The Averaged Kullback-Leibler Divergence similarity function clearly outperforms the Cosine, Pearson Correlation, Jaccard Coefficient and Euclidean functions. The combination of the Averaged Kullback-Leibler Divergence and root-based stemming achieved the highest purity of 0.764 while the second-best purity was 0.719. These results are of importance as it is contrary to normal-sized documents where, in many information retrieval applications, light stemming performs better than root-based stemming and the Cosine function is commonly used

    Clustering Arabic Tweets for Sentiment Analysis

    Get PDF
    The focus of this study is to evaluate the impact of linguistic preprocessing and similarity functions for clustering Arabic Twitter tweets. The experiments apply an optimized version of the standard K-Means algorithm to assign tweets into positive and negative categories. The results show that root-based stemming has a significant advantage over light stemming in all settings. The Averaged Kullback-Leibler Divergence similarity function clearly outperforms the Cosine, Pearson Correlation, Jaccard Coefficient and Euclidean functions. The combination of the Averaged Kullback-Leibler Divergence and root-based stemming achieved the highest purity of 0.764 while the second-best purity was 0.719. These results are of importance as it is contrary to normal-sized documents where, in many information retrieval applications, light stemming performs better than root-based stemming and the Cosine function is commonly used

    A geographic knowledge discovery approach to property valuation

    Get PDF
    This thesis involves an investigation of how knowledge discovery can be applied in the area Geographic Information Science. In particular, its application in the area of property valuation in order to reveal how different spatial entities and their interactions affect the price of the properties is explored. This approach is entirely data driven and does not require previous knowledge of the area applied. To demonstrate this process, a prototype system has been designed and implemented. It employs association rule mining and associative classification algorithms to uncover any existing inter-relationships and perform the valuation. Various algorithms that perform the above tasks have been proposed in the literature. The algorithm developed in this work is based on the Apriori algorithm. It has been however, extended with an implementation of a ‘Best Rule’ classification scheme based on the Classification Based on Associations (CBA) algorithm. For the modelling of geographic relationships a graph-theoretic approach has been employed. Graphs have been widely used as modelling tools within the geography domain, primarily for the investigation of network-type systems. In the current context, the graph reflects topological and metric relationships between the spatial entities depicting general spatial arrangements. An efficient graph search algorithm has been developed, based on the Djikstra shortest path algorithm that enables the investigation of relationships between spatial entities beyond first degree connectivity. A case study with data from three central London boroughs has been performed to validate the methodology and algorithms, and demonstrate its effectiveness for computer aided property valuation. In addition, through the case study, the influence of location in the value of properties in those boroughs has been examined. The results are encouraging as they demonstrate the effectiveness of the proposed methodology and algorithms, provided that the data is appropriately pre processed and is of high quality

    Water filtration by using apple and banana peels as activated carbon

    Get PDF
    Water filter is an important devices for reducing the contaminants in raw water. Activated from charcoal is used to absorb the contaminants. Fruit peels are some of the suitable alternative carbon to substitute the charcoal. Determining the role of fruit peels which were apple and banana peels powder as activated carbon in water filter is the main goal. Drying and blending the peels till they become powder is the way to allow them to absorb the contaminants. Comparing the results for raw water before and after filtering is the observation. After filtering the raw water, the reading for pH was 6.8 which is in normal pH and turbidity reading recorded was 658 NTU. As for the colour, the water becomes more clear compared to the raw water. This study has found that fruit peels such as banana and apple are an effective substitute to charcoal as natural absorbent

    Predictive trend mining for social network analysis

    Get PDF
    This thesis describes research work within the theme of trend mining as applied to social network data. Trend mining is a type of temporal data mining that provides observation into how information changes over time. In the context of the work described in this thesis the focus is on how information contained in social networks changes with time. The work described proposes a number of data mining based techniques directed at mechanisms to not only detect change, but also support the analysis of change, with respect to social network data. To this end a trend mining framework is proposed to act as a vehicle for evaluating the ideas presented in this thesis. The framework is called the Predictive Trend Mining Framework (PTMF). It is designed to support "end-to-end" social network trend mining and analysis. The work described in this thesis is divided into two elements: Frequent Pattern Trend Analysis (FPTA) and Prediction Modeling (PM). For evaluation purposes three social network datasets have been considered: Great Britain Cattle Movement, Deeside Insurance and Malaysian Armed Forces Logistic Cargo. The evaluation indicates that a sound mechanism for identifying and analysing trends, and for using this trend knowledge for prediction purposes, has been established

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery

    Get PDF
    Evaluating transactional payment behaviour offers a competitive advantage in the modern payment ecosystem, not only for confirming the presence of good credit applicants or unlocking the cross-selling potential between the respective product and service portfolios of financial institutions, but also to rule out bad credit applicants precisely in transactional payments streams. In a diagnostic test for analysing the payment behaviour, I have used a hybrid approach comprising a combination of supervised and unsupervised learning algorithms to discover behavioural patterns. Supervised learning algorithms can compute a range of credit scores and cross-sell candidates, although the applied methods only discover limited behavioural patterns across the payment streams. Moreover, the performance of the applied supervised learning algorithms varies across the different data models and their optimisation is inversely related to the pre-processed dataset. Subsequently, the research experiments conducted suggest that the Two-Class Decision Forest is an effective algorithm to determine both the cross-sell candidates and creditworthiness of their customers. In addition, a deep-learning model using neural network has been considered with a meaningful interpretation of future payment behaviour through categorised payment transactions, in particular by providing additional deep insights through graph-based visualisations. However, the research shows that unsupervised learning algorithms play a central role in evaluating the transactional payment behaviour of customers to discover associations using market basket analysis based on previous payment transactions, finding the frequent transactions categories, and developing interesting rules when each transaction category is performed on the same payment stream. Current research also reveals that the transactional payment behaviour analysis is multifaceted in the financial industry for assessing the diagnostic ability of promotion candidates and classifying bad credit applicants from among the entire customer base. The developed predictive models can also be commonly used to estimate the credit risk of any credit applicant based on his/her transactional payment behaviour profile, combined with deep insights from the categorised payment transactions analysis. The research study provides a full review of the performance characteristic results from different developed data models. Thus, the demonstrated data science approach is a possible proof of how machine learning models can be turned into cost-sensitive data models

    An Evaluation of the Use of Diversity to Improve the Accuracy of Predicted Ratings in Recommender Systems

    Get PDF
    The diversity; versus accuracy trade off, has become an important area of research within recommender systems as online retailers attempt to better serve their customers and gain a competitive advantage through an improved customer experience. This dissertation attempted to evaluate the use of diversity measures in predictive models as a means of improving predicted ratings. Research literature outlines a number of influencing factors such as personality, taste, mood and social networks in addition to approaches to the diversity challenge post recommendation. A number of models were applied included DecisionStump, Linear Regression, J48 Decision Tree and Naive Bayes. Various evaluation metrics such as precision, recall, ROC area, mean squared error and correlation coefficient were used to evaluate the model types. The results were below a benchmark selected during the literature review. The experiment did not demonstrate that diversity measures as inputs improve the accuracy of predicted ratings. However, the evaluation results for the model without diversity measures were low also and comparable to those with diversity indicating that further research in this area may be worthwhile. While the experiment conducted did not clearly demonstrate that the inclusion of diversity measures as inputs improve the accuracy of predicted ratings, approaches to data extraction, pre-processing, and model selection could inform further research. Areas of further research identified within this paper may also add value for those interested in this topic

    A design science framework for research in health analytics

    Get PDF
    Data analytics provide the ability to systematically identify patterns and insights from a variety of data as organizations pursue improvements in their processes, products, and services. Analytics can be classified based on their ability to: explore, explain, predict, and prescribe. When applied to the field of healthcare, analytics presents a new frontier for business intelligence. In 2013 alone, the Centers for Medicare and Medicaid Services (CMS) reported that the national health expenditure was $2.9 trillion, representing 17.4% of the total United States GDP. The Patient Protection and Affordable Care Act of 2010 (ACA) requires all hospitals to implement electronic medical record (EMR) technologies by year 2014 (Patient Protection and Affordable Care Act, 2010). Moreover, the ACA makes healthcare process and outcomes more transparent by making related data readily available for research. Enterprising organizations are employing analytics and analytical techniques to find patterns in healthcare data (I. R. Bardhan & Thouin, 2013; Hansen, Miron-Shatz, Lau, & Paton, 2014). The goal is to assess the cost and quality of care and identify opportunities for improvement for organizations as well as the healthcare system as a whole. Yet, there remains a need for research to systematically understand, explain, and predict the sources and impacts of the widely observed variance in the cost and quality of care available. This is a driving motivation for research in healthcare. This dissertation conducts a design theoretic examination of the application of advanced data analytics in healthcare. Heart Failure is the number one cause of death and the biggest contributor healthcare costs in the United States. An exploratory examination of the application of predictive analytics is conducted in order to understand the cost and quality of care provided to heart failure patients. The specific research question is addressed: How can we improve and expand upon our understanding of the variances in the cost of care and the quality of care for heart failure? Using state level data from the State Health Plan of North Carolina, a standard readmission model was assessed as a baseline measure for prediction, and advanced analytics were compared to this baseline. This dissertation demonstrates that advanced analytics can improve readmission predictions as well as expand understanding of the profile of a patient readmitted for heart failure. Implications are assessed for academics and practitioners
    • …
    corecore