9 research outputs found

    Minimizing User Involvement for Learning Human Mobility Patterns from Location Traces

    No full text
    Utilizing trajectories for modeling human mobility often involves extracting descriptive features for each individual, a procedure heavily based on experts' knowledge. In this work, our objective is to minimize human involvement and exploit the power of community in learning `features' for individuals from their location traces. We propose a probabilistic graphical model that learns distribution of latent concepts, named motifs, from anonymized sequences of user locations. To handle variation in user activity level, our model learns motif distributions from sequence-level location co-occurrence of all users. To handle the big variation in location popularity, our model uses an asymmetric prior, conditioned on per-sequence features. We evaluate the new representation in a link prediction task and compare our results to those of baseline approaches

    Efficient Estimation of Dynamic Density Functions with an Application to Outlier Detection

    No full text
    In this paper, we propose a new method to estimate the dynamic density over data streams, named KDE-Track as it is based on a conventional and widely used Kernel Density Estimation (KDE) method. KDE-Track can efficiently estimate the density with linear complexity by using interpolation on a kernel model, which is incrementally updated upon the arrival of streaming data. Both theoretical analysis and experimental validation show that KDE-Track outperforms traditional KDE and a baseline method Cluster-Kernels on estimation accuracy of the complex density structures in data streams, computing time and memory usage. KDE-Track is also demonstrated on timely catching the dynamic density of synthetic and real-world data. In addition, KDE-Track is used to accurately detect outliers in sensor data and compared with two existing methods developed for detecting outliers and cleaning sensor data

    KDE-Track: An Efficient Dynamic Density Estimator for Data Streams

    No full text

    Manipulation Detection in Cryptocurrency Markets: An Anomaly and Change Detection Based Approach

    No full text
    As a financial asset, cryptocurrencies innovated the financial industry in different ways. However, the lack of regulations and transparency in cryptocurrency markets is hindering the industry from reaching its full potential. There is a need for extensive technical analysis of the cryptocurrency market data to detect possible market manipulation attempts. Anomaly detection techniques can reveal information about abnormal activities in the market and provide insights on manipulation attempts. In this study, a robust unsupervised anomaly detection tool (ADT) is developed for this purpose. Experiments show that ADT outperforms a set of methods in detecting the anomalies in features extracted from the cryptocurrency exchanges data and on a set of benchmark data sets

    Manipulation Detection in Cryptocurrency Markets: An Anomaly and Change Detection Based Approach

    No full text
    As a financial asset, cryptocurrencies innovated the financial industry in different ways. However, the lack of regulations and transparency in cryptocurrency markets is hindering the industry from reaching its full potential. There is a need for extensive technical analysis of the cryptocurrency market data to detect possible market manipulation attempts. Anomaly detection techniques can reveal information about abnormal activities in the market and provide insights on manipulation attempts. In this study, a robust unsupervised anomaly detection tool (ADT) is developed for this purpose. Experiments show that ADT outperforms a set of methods in detecting the anomalies in features extracted from the cryptocurrency exchanges data and on a set of benchmark data sets

    Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery

    No full text
    © 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links

    Building Data Civilizer Pipelines with an Advanced Workflow Engine

    No full text
    © 2018 IEEE. In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-To-end data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-To-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets

    A Demo of the Data Civilizer System

    No full text
    Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being scattered all over the enterprise and being typically dirty and inconsistent. In practice, data scientists are routinely reporting that the majority (more than 80%) of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. We propose to demonstrate Data Civilizer to ease the pain faced in analyzing data "in the wild". Data Civilizer is an end-to-end big data management system with components for data discovery, data integration and stitching, data cleaning, and querying data from a large variety of storage engines, running in large enterprises
    corecore