893 research outputs found

    Prediction of user behaviour on the web

    Get PDF
    The Web has become an ubiquitous environment for human interaction, communication, and data sharing. As a result, large amounts of data are produced. This data can be utilised by building predictive models of user behaviour in order to support business decisions. However, the fast pace of modern businesses is creating the pressure on industry to provide faster and better decisions. This thesis addresses this challenge by proposing a novel methodology for an effcient prediction of user behaviour. The problems concerned are: (i) modelling user behaviour on the Web, (ii) choosing and extracting features from data generated by user behaviour, and (iii) choosing a Machine Learning (ML) set-up for an effcient prediction. First, a novel Time-Varying Attributed Graph (TVAG) is introduced and then a TVAG-based model for modelling user behaviour on the Web is proposed. TVAGs capture temporal properties of user behaviour by their time varying component of features of the graph nodes and edges. Second, the proposed model allows to extract features for further ML predictions. However, extracting the features and building the model may be unacceptably hard and long process. Thus, a guideline for an effcient feature extraction from the TVAG-based model is proposed. Third, a method for choosing a ML set-up to build an accurate and fast predictive model is proposed and evaluated. Finally, a deep learning architecture for predicting user behaviour on the Web is proposed and evaluated. To sum up, the main contribution to knowledge of this work is in developing the methodology for fast and effcient predictions of user behaviour on the Web. The methodology is evaluated on datasets from a few Web platforms, namely Stack Exchange, Twitter, and Facebook

    When Things Matter: A Data-Centric View of the Internet of Things

    Full text link
    With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. While IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy, and continuous. This article surveys the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed

    Development and Applications of Similarity Measures for Spatial-Temporal Event and Setting Sequences

    Get PDF
    Similarity or distance measures between data objects are applied frequently in many fields or domains such as geography, environmental science, biology, economics, computer science, linguistics, logic, business analytics, and statistics, among others. One area where similarity measures are particularly important is in the analysis of spatiotemporal event sequences and associated environs or settings. This dissertation focuses on developing a framework of modeling, representation, and new similarity measure construction for sequences of spatiotemporal events and corresponding settings, which can be applied to different event data types and used in different areas of data science. The first core part of this dissertation presents a matrix-based spatiotemporal event sequence representation that unifies punctual and interval-based representation of events. This framework supports different event data types and provides support for data mining and sequence classification and clustering. The similarity measure is based on the modified Jaccard index with temporal order constraints and accommodates different event data types. This approach is demonstrated through simulated data examples and the performance of the similarity measures is evaluated with a k-nearest neighbor algorithm (k-NN) classification test on synthetic datasets. These similarity measures are incorporated into a clustering method and successfully demonstrate the usefulness in a case study analysis of event sequences extracted from space time series of a water quality monitoring system. This dissertation further proposes a new similarity measure for event setting sequences, which involve the space and time in which events occur. While similarity measures for spatiotemporal event sequences have been studied, the settings and setting sequences have not yet been considered. While modeling event setting sequences, spatial and temporal scales are considered to define the bounds of the setting and incorporate dynamic variables along with static variables. Using a matrix-based representation and an extended Jaccard index, new similarity measures are developed to allow for the use of all variable data types. With these similarity measures coupled with other multivariate statistical analysis approaches, results from a case study involving setting sequences and pollution event sequences associated with the same monitoring stations, support the hypothesis that more similar spatial-temporal settings or setting sequences may generate more similar events or event sequences. To test the scalability of STES similarity measure in a larger dataset and an extended application in different fields, this dissertation compares and contrasts the prospective space-time scan statistic with the STES similarity approach for identifying COVID-19 hotspots. The COVID-19 pandemic has highlighted the importance of detecting hotspots or clusters of COVID-19 to provide decision makers at various levels with better information for managing distribution of human and technical resources as the outbreak in the USA continues to grow. The prospective space-time scan statistic has been used to help identify emerging disease clusters yet results from this approach can encounter strategic limitations imposed by the spatial constraints of the scanning window. The STES-based approach adapted for this pandemic context computes the similarity of evolving normalized COVID-19 daily cases by county and clusters these to identify counties with similarly evolving COVID-19 case histories. This dissertation analyzes the spread of COVID-19 within the continental US through four periods beginning from late January 2020 using the COVID-19 datasets maintained by John Hopkins University, Center for Systems Science and Engineering (CSSE). Results of the two approaches can complement with each other and taken together can aid in tracking the progression of the pandemic. Overall, the dissertation highlights the importance of developing similarity measures for analyzing spatiotemporal event sequences and associated settings, which can be applied to different event data types and used for data mining, sequence classification, and clustering

    Developing Leading and Lagging Indicators to Enhance Equipment Reliability in a Lean System

    Get PDF
    With increasing complexity in equipment, the failure rates are becoming a critical metric due to the unplanned maintenance in a production environment. Unplanned maintenance in manufacturing process is created issues with downtimes and decreasing the reliability of equipment. Failures in equipment have resulted in the loss of revenue to organizations encouraging maintenance practitioners to analyze ways to change unplanned to planned maintenance. Efficient failure prediction models are being developed to learn about the failures in advance. With this information, failures predicted can reduce the downtimes in the system and improve the throughput. The goal of this thesis is to predict failure in centrifugal pumps using various machine learning models like random forest, stochastic gradient boosting, and extreme gradient boosting. For accurate prediction, historical sensor measurements were modified into leading and lagging indicators which explained the failure patterns in the equipment were developed. The best subset of indicators was selected by filtering using random forest and utilized in the developed model. Finally, the models give a probability of failure before the failure occurs. Appropriate evaluation metrics were used to obtain the accurate model. The proposed methodology was illustrated with two case studies: first, to the centrifugal pump asset performance data provided by Meridium, Inc. and second, the data collected from aircraft turbine engine provided in the NASA prognostics data repository. The automated methodology was shown to develop and identify appropriate failure leading and lagging indicators in both cases and facilitate machine learning model development

    Report on the Information Retrieval Festival (IRFest2017)

    Get PDF
    The Information Retrieval Festival took place in April 2017 in Glasgow. The focus of the workshop was to bring together IR researchers from the various Scottish universities and beyond in order to facilitate more awareness, increased interaction and reflection on the status of the field and its future. The program included an industry session, research talks, demos and posters as well as two keynotes. The first keynote was delivered by Prof. Jaana Kekalenien, who provided a historical, critical reflection of realism in Interactive Information Retrieval Experimentation, while the second keynote was delivered by Prof. Maarten de Rijke, who argued for more Artificial Intelligence usage in IR solutions and deployments. The workshop was followed by a "Tour de Scotland" where delegates were taken from Glasgow to Aberdeen for the European Conference in Information Retrieval (ECIR 2017

    Cheating in online gaming spreads through observation and victimization

    Get PDF
    Antisocial behavior can be contagious, spreading from individual to individual and rippling through social networks. Moreover, it can spread not only through third-party influence from observation, just like innovations or individual behavior do, but also through direct experience, via “pay-it-forward” retaliation. Here, we distinguish between the effects of observation and victimization for the contagion of antisocial behavior by analyzing large-scale digital trace data. We study the spread of cheating in more than a million matches of an online multiplayer first-person shooter game, in which up to 100 players compete individually or in teams against strangers. We identify event sequences in which a player who observes or is killed by a certain number of cheaters starts cheating and evaluate the extent to which these sequences would appear if we preserve the team and interaction structure but assume alternative gameplay scenarios. The results reveal that social contagion is only likely to exist for those who both observe and experience cheating, suggesting that third-party influence and “pay-it-forward” reciprocity interact positively. In addition, the effect is present only for those who both observe and experience more than once, suggesting that cheating is more likely to spread after repeated or multi-source exposure. Approaching online games as models of social systems, we use the findings to discuss strategies for targeted interventions to stem the spread of cheating and antisocial behavior more generally in online communities, schools, organizations, and sports

    Mining sequences in distributed sensors data for energy production.

    Get PDF
    Brief Overview of the Problem: The Environmental Protection Agency (EPA), a government funded agency, provides both legislative and judicial powers for emissions monitoring in the United States. The agency crafts laws based on self-made regulations to enforce companies to operate within the limits of the law resulting in environmentally safe operation. Specifically, power companies operate electric generating facilities under guidelines drawn-up and enforced by the EPA. Acid rain and other harmful factors require that electric generating facilities report hourly emissions recorded via a Supervisory Control and Data Acquisition (SCADA) system. SCADA is a control and reporting system that is present in all power plants consisting of sensors and control mechanisms that monitor all equipment within the plants. The data recorded by a SCADA system is collected by the EPA and allows them to enforce proper plant operation relating to emissions. This data includes a lot of generating unit and power plant specific details, including hourly generation. This hourly generation (termed grossunitload by the EPA) is the actual hourly average output of the generator on a per unit basis. The questions to be answered are do any of these units operate in tandem and do any of the units start, stop, or change operation as a result of another\u27s change in generation? These types of questions will be answered for the years of April 2002 through April 2003 for facilities that operate pipeline natural-gas-fired generating units. Purpose of Research The research conducted has dual uses if fruitful. First, the use of a local modeling between generating units would be highly profitable among energy traders. Betting that a plant will operate a unit based on another\u27s current characteristics would be sensationally profitable to energy traders. This profitability is variable due to fuel type. For instance, if the price of coal is extremely high due to shortages, the value of knowing a semioperating characteristic of two generating units is highly valuable. Second, this known characteristic can also be used in regulation and operational modeling. The second use is of great importance to government agencies. If regulatory committees can be aware of past (or current) similarities between power producers, they may be able to avoid a power struggle in a region caused by greedy traders or companies. Not considering profitable motives, the Department of Energy may use something similar to generate a model of power grid generation availability based on previous data for reliability purposes. Type of Problem: The problem tackled within this Master\u27s thesis is of multiple time series pattern recognition. This field is expansive and well studied, therefore the research performed will benefit from previously known techniques. The author has chosen to experiment with conventional techniques such as correlation, principal component analysis, and kmeans clustering for feature and eventually pattern extraction. For the primary analysis performed, the author chose to use a conventional sequence discovery algorithm. The sequence discovery algorithm has no prior knowledge of space limitations, therefore it searches over the entire space resulting in an expense but complete process. Prior to sequence discovery the author applies a uniform coding schema to the raw data, which is an adaption of a coding schema presented by Keogh. This coding and discovery process is deemed USD, or Uniform Sequence Discovery. The data is highly dimensional along with being extremely dynamic and sporadic with regards to magnitude. The energy market that demands power generation is profit and somewhat reliability driven. The obvious factors are more reliability based, for instance to keep system frequency at 60Hz, units may operate in an idle state resulting in a constant or very low value for a period of time (idle time). Also to avoid large frequency swings on the power grid, companies are require
    • …
    corecore