30 research outputs found

    Efficient Text Classification with Linear Regression Using a Combination of Predictors for Flu Outbreak Detection

    Get PDF
    Early prediction of disease outbreaks and seasonal epidemics such as Influenza may reduce their impact on daily lives. Today, the web can be used for surveillance of diseases.Search engines and Social Networking Sites can be used to track trends of different diseases more quickly than government agencies such as Center of Disease Control and Prevention(CDC). Today, Social Networking Sites (SNS) are widely used by diverse demographic populations. Thus, SNS data can be used effectively to track disease outbreaks and provide necessary warnings. Although the generated data of microblogging sites is valuable for real time analysis and outbreak predictions, the volume is huge. Therefore, one of the main challenges in analyzing this huge volume of data is to find the best approach for accurate analysis in an efficient time. Regardless of the analysis time, many studies show only the accuracy of applying different machine learning approaches. Current SNS-based flu detection and prediction frameworks apply conventional machine learning approaches that require lengthy training and testing, which is not the optimal solution for new outbreaks with new signs and symptoms. The aim of this study is to propose an efficient and accurate framework that uses SNS data to track disease outbreaks and provide early warnings, even for newest outbreaks accurately. The presented framework of outbreak prediction consists of three main modules: text classification, mapping, and linear regression for weekly flu rate predictions. The text classification module utilizes the features of sentiment analysis and predefined keyword occurrences. Various classifiers, including FastText and six conventional machine learning algorithms, are evaluated to identify the most efficient and accurate one for the proposed framework. The text classifiers have been trained and tested using a pre-labeled dataset of flu-related and unrelated Twitter postings. The selected text classifier is then used to classify over 8,400,000 tweet documents. The flu-related documents are then mapped ona weekly basis using a mapping module. Lastly, the mapped results are passed together with historical Center for Disease Control and Prevention (CDC) data to a linear regression module for weekly flu rate predictions. The evaluation of flu tweet classification shows that FastText together with the extracted features, has achieved accurate results with anF-measure value of 89.9% in addition to its efficiency. Therefore, FastText has been chosen to be the classification module to work together with the other modules in the proposed framework, including the linear regression module, for flu trend predictions. The prediction results are compared with the available recent data from CDC as the ground truth and show a strong correlation of 96.2%

    Layered performance modelling and evaluation for cloud topic detection and tracking based big data applications

    No full text
    “Big Data” best characterized by its three features namely “Variety”, “Volume” and “Velocity” is revolutionizing nearly every aspect of our lives ranging from enterprises to consumers, from science to government. A fourth characteristic namely “value” is delivered via the use of smart data analytics over Big Data. One such Big Data Analytics application considered in this thesis is Topic Detection and Tracking (TDT). The characteristics of Big Data brings with it unprecedented challenges such as too large for traditional devices to process and store (volume), too fast for traditional methods to scale (velocity), and heterogeneous data (variety). In recent times, cloud computing has emerged as a practical and technical solution for processing big data. However, while deploying Big data analytics applications such as TDT in cloud (called cloud-based TDT), the challenge is to cost-effectively orchestrate and provision Cloud resources to meet performance Service Level Agreements (SLAs). Although there exist limited work on performance modeling of cloud-based TDT applications none of these methods can be directly applied to guarantee the performance SLA of cloud-based TDT applications. For instance, current literature lacks a systematic, reliable and accurate methodology to measure, predict and finally guarantee performances of TDT applications. Furthermore, existing performance models fail to consider the end-to-end complexity of TDT applications and focus only on the individual processing components (e.g. map reduce). To tackle this challenge, in this thesis, we develop a layered performance model of cloud-based TDT applications that take into account big data characteristics, the data and event flow across myriad cloud software and hardware resources and diverse SLA considerations. In particular, we propose and develop models to capture in detail with great accuracy, the factors having a pivotal role in performances of cloud-based TDT applications and identify ways in which these factors affect the performance and determine the dependencies between the factors. Further, we have developed models to predict the performance of cloud-based TDT applications under uncertainty conditions imposed by Big Data characteristics. The model developed in this thesis is aimed to be generic allowing its application to other cloud-based data analytics applications. We have demonstrated the feasibility, efficiency, validity and prediction accuracy of the proposed models via experimental evaluations using a real-world Flu detection use-case on Apache Hadoop Map Reduce, HDFS and Mahout Frameworks

    Preliminary Flu Outbreak Prediction Using Twitter Posts Classification and Linear Regression With Historical Centers for Disease Control and Prevention Reports: Prediction Framework Study

    Get PDF
    Background: Social networking sites (SNSs) such as Twitter are widely used by diverse demographic populations. The amount of data within SNSs has created an efficient resource for real-time analysis. Thus, data from SNSs can be used effectively to track disease outbreaks and provide necessary warnings. Current SNS-based flu detection and prediction frameworks apply conventional machine learning approaches that require lengthy training and testing, which is not the optimal solution for new outbreaks with new signs and symptoms. Objective: The objective of this study was to propose an efficient and accurate framework that uses data from SNSs to track disease outbreaks and provide early warnings, even for newest outbreaks, accurately. Methods: We presented a framework of outbreak prediction that included 3 main modules: text classification, mapping, and linear regression for weekly flu rate predictions. The text classification module used the features of sentiment analysis and predefined keyword occurrences. Various classifiers, including FastText (FT) and 6 conventional machine learning algorithms, were evaluated to identify the most efficient and accurate one for the proposed framework. The text classifiers were trained and tested using a prelabeled dataset of flu-related and unrelated Twitter postings. The selected text classifier was then used to classify over 8,400,000 tweet documents. The flu-related documents were then mapped on a weekly basis using a mapping module. Finally, the mapped results were passed together with historical Centers for Disease Control and Prevention (CDC) data to a linear regression module for weekly flu rate predictions. Results: The evaluation of flu tweet classification showed that FT, together with the extracted features, achieved accurate results with an F-measure value of 89.9% in addition to its efficiency. Therefore, FT was chosen to be the classification module to work together with the other modules in the proposed framework, including a regression-based estimator, for flu trend predictions. The estimator was evaluated using several regression models. Regression results show that the linear regression–based estimator achieved the highest accuracy results using the measure of Pearson correlation. Thus, the linear regression model was used for the module of weekly flu rate estimation. The prediction results were compared with the available recent data from CDC as the ground truth and showed a strong correlation of 96.29%. Conclusions: The results demonstrated the efficiency and the accuracy of the proposed framework that can be used even for new outbreaks with new signs and symptoms. The classification results demonstrated that the FT-based framework improves the accuracy and the efficiency of flu disease surveillance systems that use unstructured data such as data from SNSs.https://doi.org/10.2196/1238

    Reinventing the Social Scientist and Humanist in the Era of Big Data

    Get PDF
    This book explores the big data evolution by interrogating the notion that big data is a disruptive innovation that appears to be challenging existing epistemologies in the humanities and social sciences. Exploring various (controversial) facets of big data such as ethics, data power, and data justice, the book attempts to clarify the trajectory of the epistemology of (big) data-driven science in the humanities and social sciences

    Big Data Processing Attribute Based Access Control Security

    Get PDF
    The purpose of this research is to analyze the security of next-generation big data processing (BDP) and examine the feasibility of applying advanced security features to meet the needs of modern multi-tenant, multi-level data analysis. The research methodology was to survey of the status of security mechanisms in BDP systems and identify areas that require further improvement. Access control (AC) security services were identified as priority area, specifically Attribute Based Access Control (ABAC). The exemplar BDP system analyzed is the Apache Hadoop ecosystem. We created data generation software, analysis programs, and posted the detailed the experiment configuration on GitHub. Overall, our research indicates that before a BDP system, such as Hadoop, can be used in operational environment significant security configurations are required. We believe that the tools are available to achieve a secure system, with ABAC, using Apache Ranger and Apache Atlas. However, these systems are immature and require verification by an independent third party. We identified the following specific actions for overall improvement: consistent provisioning of security services through a data analyst workstation, a common backplane of security services, and a management console. These areas are partially satisfied in the current Hadoop ecosystem, continued AC improvements through the open source community, and rigorous independent testing should further address remaining security challenges. Robust security will enable further use of distributed, cluster BDP, such as Apache Hadoop and Hadoop-like systems, to meet future government and business requirements

    Computational Methods for Medical and Cyber Security

    Get PDF
    Over the past decade, computational methods, including machine learning (ML) and deep learning (DL), have been exponentially growing in their development of solutions in various domains, especially medicine, cybersecurity, finance, and education. While these applications of machine learning algorithms have been proven beneficial in various fields, many shortcomings have also been highlighted, such as the lack of benchmark datasets, the inability to learn from small datasets, the cost of architecture, adversarial attacks, and imbalanced datasets. On the other hand, new and emerging algorithms, such as deep learning, one-shot learning, continuous learning, and generative adversarial networks, have successfully solved various tasks in these fields. Therefore, applying these new methods to life-critical missions is crucial, as is measuring these less-traditional algorithms' success when used in these fields

    Geospatial Data Science to Identify Patterns of Evasion

    Get PDF
    University of Minnesota Ph.D. dissertation.January 2018. Major: Computer Science. Advisor: Shashi Shekhar. 1 computer file (PDF); x, 153 pages.Over the last decade, there has been a significant growth in the availability of cheap raw spatial data in the form of GPS trajectories, activity/event locations, temporally detailed road networks, satellite imagery, etc. These data are being collected, often around the clock, from location-aware applications, sensor technologies, etc. and represent an unprecedented opportunity to study our economic, social, and natural systems and their interactions. For example, finding hotspots (areas with unusually high concentration of activities/events) from activity/event locations plays a crucial role in epidemiology since it may help public health officials prevent further spread of an infectious disease. In order to extract useful information from these datasets, many geospatial data tools have been proposed in recent years. However, these tools are often used as a “black box”, where a trial-error strategy is used with multiple approaches from different scientific disciplines (e.g. statistics, mathematics and computer science) to find the best solution with little or no consideration of the actual phenomena being investigated. Hence, the results may be biased or some important information may be missed. To address this problem, we need geospatial data science with a stronger scientific foundation to understand the actual phenomena, develop reliable and trustworthy models and extract information through a scientific process. Thus, my thesis investigates a wide-lens perspective on geospatial data science, considering it as a transdisciplinary field comprising statistics, mathematics, and computer science. This approach aims to reduce the redundant work across disciplines as well as define scientific boundaries of geospatial data science to distinguish it from being a black box that claims to solve every possible geospatial problem. In my proposed approaches, I used ideas from those three disciplines, e.g. spatial scan statistics from statistical science to reduce chance patterns in the output and provide statistical robustness; mathematical definitions of geometric shapes of the patterns, which maintain correctness and completeness; and computational approaches (along with prune and refine framework and dynamic programming ideas) to scale up to large spatial datasets. In addition, the proposed approaches incorporate domain-specific geographic theories (e.g., routine activity theory in criminology) for applicability in those domains that are interested in specific patterns, which occur due to the actual phenomena, from geospatial datasets. The proposed techniques have been applied to real world disease and crime datasets and the evaluations confirmed that our techniques outperform current state-of-the-art such as density based clustering approaches as well as circular hotspot detection methods

    Big Data mining and machine learning techniques applied to real world scenarios

    Get PDF
    Data mining techniques allow the extraction of valuable information from heterogeneous and possibly very large data sources, which can be either structured or unstructured. Unstructured data, such as text files, social media, mobile data, are much more than structured data, and grow at a higher rate. Their high volume and the inherent ambiguity of natural language make unstructured data very hard to process and analyze. Appropriate text representations are therefore required in order to capture word semantics as well as to preserve statistical information, e.g. word counts. In Big Data scenarios, scalability is also a primary requirement. Data mining and machine learning approaches should take advantage of large-scale data, exploiting abundant information and avoiding the curse of dimensionality. The goal of this thesis is to enhance text understanding in the analysis of big data sets, introducing novel techniques that can be employed for the solution of real world problems. The presented Markov methods temporarily achieved the state-of-the-art on well-known Amazon reviews corpora for cross-domain sentiment analysis, before being outperformed by deep approaches in the analysis of large data sets. A noise detection method for the identification of relevant tweets leads to 88.9% accuracy in the Dow Jones Industrial Average daily prediction, which is the best result in literature based on social networks. Dimensionality reduction approaches are used in combination with LinkedIn users' skills to perform job recommendation. A framework based on deep learning and Markov Decision Process is designed with the purpose of modeling job transitions and recommending pathways towards a given career goal. Finally, parallel primitives for vendor-agnostic implementation of Big Data mining algorithms are introduced to foster multi-platform deployment, code reuse and optimization

    Language change and evolution in Online Social Networks

    Get PDF
    Language is in constant flux, whether through the creation of new terms or the changing meanings of existing words. The process by which language change happens is through complex reinforcing interactions between individuals and the social structures in which they exist. There has been much research into language change and evolution, though this has involved manual processes that are both time consuming and costly. However, with the growth in popularity of osn, for the first time, researchers have access to fine-grained records of language and user interactions that not only contain data on the creation of these language innovations but also reveal the inter-user and inter-community dynamics that influence their adoptions and rejections. Having access to these osn datasets means that language change and evolution can now be assessed and modelled through the application of computational and machine-learning-based methods. Therefore, this thesis looks at how one can detect and predict language change in osn, as well as the factors that language change depends on. The answer to this over-arching question lies in three core components: first, detecting the innovations; second, modelling the individual user adoption process; and third, looking at the collective adoption across a network of individuals. In the first question, we operationalise traditional language acceptance heuristics (used to detect the emergence of new words) into three classes of computation time-series measures computing the variation in frequency, form and/or meaning. The grounded methods are applied to two osn, with results demonstrating the ability to detect language change across both networks. By additionally applying the methods to communities within each network, e.g. geographical regions, on Twitter and Subreddits in Reddit, the results indicate that language variation and change can be dependent on the community memberships. The second question in this thesis focuses on the process of users adopting language innovations in relation to other users with whom they are in contact. By modelling influence between users as a function of past innovation cascades, we compute a global activation threshold at which users adopt new terms dependent on exposure to them from their neighbours. Additionally, by testing the user interaction networks through random shuffles, we show that the time at which a user adopts a term is dependent on the local structure; however, a large part of the influence comes from sources external to the observed osn. The final question looks at how the speakers of a language are embedded in social networks, and how the networks' resulting structures and dynamics influence language usage and adoption patterns. We show that language innovations diffuse across a network in a predictable manner, which can be modelled using structural, grammatical and temporal measures, using a logistic regression model to predict the vitality of the diffusion. With regard to network structure, we show how innovations that manifest across structural holes and weak ties diffuse deeper across the given network. Beyond network influence, our results demonstrate that the grammatical context through which innovations emerge also play an essential role in diffusion dynamics - this indicates that the adoption of new words is enabled by a complex interplay of both network and linguistic factors. The three questions are used to answer the over-arching question, showing that one can, indeed, model language change and forecast user and community adoption of language innovations. Additionally, we also show the ability to apply grounded models and methods and apply them within a scalable computational framework. However, it is a challenging process that is heavily influenced by the underlying processes that are not recorded within the data from the osns
    corecore