131 research outputs found

    A lossless online Bayesian classifier.

    Get PDF
    We are living in a world progressively driven by data. Besides the issue that big data cannot be entirely stored in the main memory as required by traditional offline learning methods, the problem of learning data that can only be collected over time is also very prevalent. Consequently, there is a need of online methods which can handle sequentially arriving data and offer the same accuracy as offline methods. In this paper, we introduce a new lossless online Bayesian-based classifier which uses the arriving data in a 1-by-1 manner and discards each data right after use. The lossless property of our proposed method guarantees that it can reach the same prediction performance as its offline counterpart regardless of the incremental training order. Experimental results demonstrate its superior performance over many well-known state-of-the-art online learning methods in the literature

    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Get PDF
    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    Darknet Traffic Analysis A Systematic Literature Review

    Full text link
    The primary objective of an anonymity tool is to protect the anonymity of its users through the implementation of strong encryption and obfuscation techniques. As a result, it becomes very difficult to monitor and identify users activities on these networks. Moreover, such systems have strong defensive mechanisms to protect users against potential risks, including the extraction of traffic characteristics and website fingerprinting. However, the strong anonymity feature also functions as a refuge for those involved in illicit activities who aim to avoid being traced on the network. As a result, a substantial body of research has been undertaken to examine and classify encrypted traffic using machine learning techniques. This paper presents a comprehensive examination of the existing approaches utilized for the categorization of anonymous traffic as well as encrypted network traffic inside the darknet. Also, this paper presents a comprehensive analysis of methods of darknet traffic using machine learning techniques to monitor and identify the traffic attacks inside the darknet.Comment: 35 Pages, 13 Figure

    On the Use of Vision and Range Data for Scene Understanding

    Get PDF
    The availability of large-scale datasets facilitates the ability of training very deep neural networks. Deep neural networks performing visual tasks have been boosted by a large number of labeled 2D images as well as synthesized 2D images. On the other hand, Light Detection and Ranging (LIDAR) sensors that use laser pulses to determine accurate distance (range) from objects to the sensor, have gained a wide range of applications in robotics, especially in autonomous driving. The resulting lidar point clouds in 3D enable perceptual tasks such as 3D object detection, semantic segmentation and panoptic segmentation. These tasks are all essential for holistic scene understanding and play important roles in robotics such as perceptual systems on driverless vehicles. Despite the large amount of training data and powerful deep neural networks, performing the visual tasks still suffers from several challenges: 1) long-tail distribution of object categories and viewpoints; 2) self-occlusion and occlusion between objects; 3) small appearances; 4) appearance variance and 5) demand for real-time performance. These challenges degrade the performance of the algorithms and pose threats to safety-critical conditions especially for autonomous driving scenarios. This dissertation investigates the challenges using both vision and range data to train deep learning models for scene understanding and propose solutions to improve the robustness of algorithms. The first part of the dissertation focuses on using synthesized 2D images to tackle long-tail distribution challenge in 2D image understanding. The second part extends the deep learning pipeline to lidar point clouds and focuses on addressing the exclusive challenges on learning from lidar point clouds

    Toward a Collaborative Platform for Hybrid Designs Sharing a Common Cohort

    Get PDF
    This doctoral thesis binds together four included papers in a thematical whole and is simultaneously an independent work proposing a platform facilitating epidemiological research. Population-based prospective cohort studies typically recruit a relatively large group of participants representative of a studied population and follow them over years or decades. This group of participants is called a cohort. As part of the study, the participants may be asked to answer extensive questionnaires, undergo medical examinations, donate blood samples, and participate in several rounds of follow-ups. The collected data can also include information from other sources, such as health registers. In prospective cohort studies, the participants initially do not have the investigated diagnoses, but statistically, a certain percentage will be diagnosed with a disease yearly. The studies enable the researchers to investigate how those who got a disease differ from those who did not. Often, many new studies can be nested within a cohort study. Data for a subgroup of the cohort is then selected and analyzed. A new study combined with an existing cohort is said to have a hybrid design. When a research group uses the same cohort as a basis for multiple new studies, these studies often have similarities regarding the workflow for designing the study and analysis. The thesis shows the potential for developing a platform encouraging the reuse of work from previous studies and systematizing the study design workflows to enhance time efficiency and reduce the risk of errors. However, the study data are subject to strict acts and regulations pertaining to privacy and research ethics. Therefore, the data must be stored and accessed within a secured IT environment where researchers log in to conduct analyses, with minimal possibilities to install analytics software not already provided by default. Further, transferring the data from the secured IT environment to a local computer or a public cloud is prohibited. Nevertheless, researchers can usually upload and run script files, e.g., written in R and run in R-studio. A consequence is that researchers - often having limited software engineering skills - may rely mainly on self-written code for their analyses, possibly unsystematically developed with a high risk of errors and reinventing solutions solved in preceding studies within the group. The thesis makes a case for a platform providing a collaboration software as a service (SaaS) addressing the challenges of the described research context and proposes its architecture and design. Its main characteristic, and contribution, is the separation of concerns between the SaaS, which operates independently of the data, and a secured IT environment where data can be accessed and analyzed. The platform lets the researchers define the data analysis for the study using the cloud-based software, which is then automatically transformed into an executable version represented as source code in a scripting language already supported by the secure environment where the data resides. The author has not found systems solving the same problem similarly. However, the work is informed by cloud computing, workflow management systems, data analysis pipelines, low-code, no-code, and model-driven development

    Semantic-guided predictive modeling and relational learning within industrial knowledge graphs

    Get PDF
    The ubiquitous availability of data in today’s manufacturing environments, mainly driven by the extended usage of software and built-in sensing capabilities in automation systems, enables companies to embrace more advanced predictive modeling and analysis in order to optimize processes and usage of equipment. While the potential insight gained from such analysis is high, it often remains untapped, since integration and analysis of data silos from different production domains requires high manual effort and is therefore not economic. Addressing these challenges, digital representations of production equipment, so-called digital twins, have emerged leading the way to semantic interoperability across systems in different domains. From a data modeling point of view, digital twins can be seen as industrial knowledge graphs, which are used as semantic backbone of manufacturing software systems and data analytics. Due to the prevalent historically grown and scattered manufacturing software system landscape that is comprising of numerous proprietary information models, data sources are highly heterogeneous. Therefore, there is an increasing need for semi-automatic support in data modeling, enabling end-user engineers to model their domain and maintain a unified semantic knowledge graph across the company. Once the data modeling and integration is done, further challenges arise, since there has been little research on how knowledge graphs can contribute to the simplification and abstraction of statistical analysis and predictive modeling, especially in manufacturing. In this thesis, new approaches for modeling and maintaining industrial knowledge graphs with focus on the application of statistical models are presented. First, concerning data modeling, we discuss requirements from several existing standard information models and analytic use cases in the manufacturing and automation system domains and derive a fragment of the OWL 2 language that is expressive enough to cover the required semantics for a broad range of use cases. The prototypical implementation enables domain end-users, i.e. engineers, to extend the basis ontology model with intuitive semantics. Furthermore it supports efficient reasoning and constraint checking via translation to rule-based representations. Based on these models, we propose an architecture for the end-user facilitated application of statistical models using ontological concepts and ontology-based data access paradigms. In addition to that we present an approach for domain knowledge-driven preparation of predictive models in terms of feature selection and show how schema-level reasoning in the OWL 2 language can be employed for this task within knowledge graphs of industrial automation systems. A production cycle time prediction model in an example application scenario serves as a proof of concept and demonstrates that axiomatized domain knowledge about features can give competitive performance compared to purely data-driven ones. In the case of high-dimensional data with small sample size, we show that graph kernels of domain ontologies can provide additional information on the degree of variable dependence. Furthermore, a special application of feature selection in graph-structured data is presented and we develop a method that allows to incorporate domain constraints derived from meta-paths in knowledge graphs in a branch-and-bound pattern enumeration algorithm. Lastly, we discuss maintenance of facts in large-scale industrial knowledge graphs focused on latent variable models for the automated population and completion of missing facts. State-of-the art approaches can not deal with time-series data in form of events that naturally occur in industrial applications. Therefore we present an extension of learning knowledge graph embeddings in conjunction with data in form of event logs. Finally, we design several use case scenarios of missing information and evaluate our embedding approach on data coming from a real-world factory environment. We draw the conclusion that industrial knowledge graphs are a powerful tool that can be used by end-users in the manufacturing domain for data modeling and model validation. They are especially suitable in terms of the facilitated application of statistical models in conjunction with background domain knowledge by providing information about features upfront. Furthermore, relational learning approaches showed great potential to semi-automatically infer missing facts and provide recommendations to production operators on how to keep stored facts in synch with the real world

    Data Mining

    Get PDF
    The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

    Evaluating and Improving Internet Load Balancing with Large-Scale Latency Measurements

    Full text link
    Load balancing is used in the Internet to distribute load across resources at different levels, from global load balancing that distributes client requests across servers at the Internet level to path-level load balancing that balances traffic across load-balanced paths. These load balancing algorithms generally work under certain assumptions on performance similarity. Specifically, global load balancing divides the Internet address space into client aggregations and assumes that clients in the same aggregation have similar performance to the same server; load-balanced paths are generally selected for load balancing as if they have similar performance. However, as performance similarity is typically achieved with similarity in path properties, e.g., topology and hop count, which do not necessarily lead to similar performance, performance between clients in the same aggregation and between load-balanced paths could differ significantly. This dissertation evaluates and improves global and path-level load balancing in terms of performance similarity. We achieve this with large-scale latency measurements, which not only allow us to systematically identify and evaluate the performance issues of Internet load balancing at scale, but also enable us to develop data-driven approaches to improve the performance. Specifically, this dissertation consists of three parts. First, we study the issues of existing client aggregations for global load balancing and then design AP-atoms, a data-driven client aggregation learned from passive large-scale latency measurements. Second, we show that the latency imbalance between load-balanced paths, previously deemed insignificant, is now both significant and prevalent. We present Flipr, a network prober that actively collects large-scale latency measurements to characterize the latency imbalance issue. Lastly, we design another network prober, Congi, that can detect congestion at scale and use Congi to study the congestion imbalance problem at scale. For both latency and congestion imbalance, we demonstrate that they could greatly affect the performance of various applications.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168012/1/yibo_1.pd
    corecore