996 research outputs found

    Data mining reduction methods and performances of rules

    Get PDF
    In data mining the accuracy of models are associated with the strength of the rules.However, most machine learning techniques produce a large number of rules.The consequence is with large number of rules generated,processing time is much longer. This study examines rules of different lengths of attributes in terms of performance based on percentage of accuracy. The research adopts the Knowledge Discovery in Databases “KDD” methodology for analysis and applies various data mining techniques in the experiments.Data of 50 hardware dataset companies which, contains 31 attributes and 400 records have been used. In summary, results show that in terms of performance of rules, Genetic Algorithm has produced the highest number of rules followed by Johnson’s Algorithm and Holte’s 1R.The best classifier for extracting rules in this study is VOT (Voting of Object Tracking).In terms of performance of rules, best results comes from rules with 30 attributes, followed by rules with 1 intersection attribute and lastly rules with 3 intersection attributes. Among the three sets of attributes, the 3 intersection attributes are considered as the attributes that can be used as predictor attributes

    Stock Prediction Based on Social Media Data via Sentiment Analysis: a Study on Reddit

    Get PDF
    With the development of internet and information technology, online text data has become available and accessible for research in many fields including stock prediction. Social media, being one of the biggest content generators on the internet, is a great data resource for text mining and stock prediction. It has a large capacity, high data density, and fast information spread. In this thesis, analyses on the relationship between the stock-related text in social media (Reddit) and the price changes of corresponding stocks are implemented. In the analysis, sentiment analysis is first applied to extract the individual users’ emotions and opinions about the stocks. After that, the extracted features are analyzed via descriptive statistics and predictive analysis using the Pearson correlation coefficient and machine learning models. The predictive analysis is designed to examine the dependence between the social media text data and stock price change by evaluating the performance of predictions, four indicators are used in the evaluation including “prediction accuracy on price change direction” and three indicators in simulated algorithm trading experiments based on prediction results. They are “total profit with trading strategy for single stock”, “daily profit efficiency of trading strategy” and “total profit with Portfolio trading strategy”. From the results and the comparison with a Buy and Hold (B&H) baseline strategy, the predictions show good results in terms of “daily profit efficiency” and “total profit with Portfolio trading strategy”. Therefore, the online forum text from Reddit are proved to be correlated with future stock price changes and might be used to make more profit than B&H strategy by incorporating their information in portfolio trading strategies

    Stock market prediction using machine learning classifiers and social media, news

    Get PDF
    Accurate stock market prediction is of great interest to investors; however, stock markets are driven by volatile factors such as microblogs and news that make it hard to predict stock market index based on merely the historical data. The enormous stock market volatility emphasizes the need to effectively assess the role of external factors in stock prediction. Stock markets can be predicted using machine learning algorithms on information contained in social media and financial news, as this data can change investors’ behavior. In this paper, we use algorithms on social media and financial news data to discover the impact of this data on stock market prediction accuracy for ten subsequent days. For improving performance and quality of predictions, feature selection and spam tweets reduction are performed on the data sets. Moreover, we perform experiments to find such stock markets that are difficult to predict and those that are more influenced by social media and financial news. We compare results of different algorithms to find a consistent classifier. Finally, for achieving maximum prediction accuracy, deep learning is used and some classifiers are ensembled. Our experimental results show that highest prediction accuracies of 80.53% and 75.16% are achieved using social media and financial news, respectively. We also show that New York and Red Hat stock markets are hard to predict, New York and IBM stocks are more influenced by social media, while London and Microsoft stocks by financial news. Random forest classifier is found to be consistent and highest accuracy of 83.22% is achieved by its ensemble

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns

    Full text link
    © 2016 Background and objective Critical care patient events like sepsis or septic shock in intensive care units (ICUs) are dangerous complications which can cause multiple organ failures and eventual death. Preventive prediction of such events will allow clinicians to stage effective interventions for averting these critical complications. Methods It is widely understood that physiological conditions of patients on variables such as blood pressure and heart rate are suggestive to gradual changes over a certain period of time, prior to the occurrence of a septic shock. This work investigates the performance of a novel machine learning approach for the early prediction of septic shock. The approach combines highly informative sequential patterns extracted from multiple physiological variables and captures the interactions among these patterns via coupled hidden Markov models (CHMM). In particular, the patterns are extracted from three non-invasive waveform measurements: the mean arterial pressure levels, the heart rates and respiratory rates of septic shock patients from a large clinical ICU dataset called MIMIC-II. Evaluation and results For baseline estimations, SVM and HMM models on the continuous time series data for the given patients, using MAP (mean arterial pressure), HR (heart rate), and RR (respiratory rate) are employed. Single channel patterns based HMM (SCP-HMM) and multi-channel patterns based coupled HMM (MCP-HMM) are compared against baseline models using 5-fold cross validation accuracies over multiple rounds. Particularly, the results of MCP-HMM are statistically significant having a p-value of 0.0014, in comparison to baseline models. Our experiments demonstrate a strong competitive accuracy in the prediction of septic shock, especially when the interactions between the multiple variables are coupled by the learning model. Conclusions It can be concluded that the novelty of the approach, stems from the integration of sequence-based physiological pattern markers with the sequential CHMM model to learn dynamic physiological behavior, as well as from the coupling of such patterns to build powerful risk stratification models for septic shock patients

    ActivityNET: Neural networks to predict public transport trip purposes from individual smart card data and POIs

    Get PDF
    Predicting trip purpose from comprehensive and continuous smart card data is beneficial for transport and city planners in investigating travel behaviors and urban mobility. Here, we propose a framework, ActivityNET, using Machine Learning (ML) algorithms to predict passengers’ trip purpose from Smart Card (SC) data and Points-of-Interest (POIs) data. The feasibility of the framework is demonstrated in two phases. Phase I focuses on extracting activities from individuals’ daily travel patterns from smart card data and combining them with POIs using the proposed “activity-POIs consolidation algorithm”. Phase II feeds the extracted features into an Artificial Neural Network (ANN) with multiple scenarios and predicts trip purpose under primary activities (home and work) and secondary activities (entertainment, eating, shopping, child drop-offs/pick-ups and part-time work) with high accuracy. As a case study, the proposed ActivityNET framework is applied in Greater London and illustrates a robust competence to predict trip purpose. The promising outcomes demonstrate that the cost-effective framework offers high predictive accuracy and valuable insights into transport planning

    Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey

    Full text link
    Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language
    • …