3,042 research outputs found

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    DLOREAN: Dynamic Location-aware Reconstruction of multiway Networks

    Get PDF
    This paper presents a method for learning time-varying higher-order interactions based on node observations, with application to short-term traffic forecasting based on traffic flow sensor measurements. We incorporate domain knowledge into the design of a new damped periodic kernel which lever- ages traffic flow patterns towards better structure learning. We introduce location-based regularization for learning models with desirable geographical properties (short-range or long-range interactions). We show using experiments on synthetic and real data, that our approach performs better than static methods for reconstruction of multiway interactions, as well as time-varying methods which recover only pair-wise interactions. Further, we show on real traffic data that our model is useful for short-term traffic forecasting, improving over state-of-the-art

    A Semi-Supervised Feature Engineering Method for Effective Outlier Detection in Mixed Attribute Data Sets

    Get PDF
    Outlier detection is one of the crucial tasks in data mining which can lead to the finding of valuable and meaningful information within the data. An outlier is a data point that is notably dissimilar from other data points in the data set. As such, the methods for outlier detection play an important role in identifying and removing the outliers, thereby increasing the performance and accuracy of the prediction systems. Outlier detection is used in many areas like financial fraud detection, disease prediction, and network intrusion detection. Traditional outlier detection methods are founded on the use of different distance measures to estimate the similarity between the points and are confined to data sets that are purely continuous or categorical. These methods, though effective, lack in elucidating the relationship between outliers and known clusters/classes in the data set. We refer to this relationship as the context for any reported outlier. Alternate outlier detection methods establish the context of a reported outlier using underlying contextual beliefs of the data. Contextual beliefs are the established relationships between the attributes of the data set. Various studies have been recently conducted where they explore the contextual beliefs to determine outlier behavior. However, these methods do not scale in the situations where the data points and their respective contexts are sparse. Thus, the outliers reported by these methods tend to lose meaning. Another limitation of these methods is that they assume all features are equally important and do not consider nor determine subspaces among the features for identifying the outliers. Furthermore, determining subspaces is computationally exacerbated, as the number of possible subspaces increases with increasing dimensionality. This makes searching through all the possible subspaces impractical. In this thesis, we propose a Hybrid Bayesian Network approach to capture the underlying contextual beliefs to detect meaningful outliers in mixed attribute data sets. Hybrid Bayesian Networks utilize their probability distributions to encode the information of the data and outliers are those points which violate this information. To deal with the sparse contexts, we use an angle-based similarity method which is then combined with the joint probability distributions of the Hybrid Bayesian Network in a robust manner. With regards to the subspace selection, we employ a feature engineering method that consists of two-stage feature selection using Maximal Information Coefficient and Markov blankets of Hybrid Bayesian Networks to select highly correlated feature subspaces. This proposed method was tested on a real world medical record data set. The results indicate that the algorithm was able to identify meaningful outliers successfully. Moreover, we compare the performance of our algorithm with the existing baseline outlier detection algorithms. We also present a detailed analysis of the reported outliers using our method and demonstrate its efficiency when handling data points with sparse contexts

    A Survey of Location Prediction on Twitter

    Full text link
    Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

    Framework for Contextual Outlier Identification using Multivariate Analysis approach and Unsupervised Learning

    Get PDF
    Majority of the existing commercial application for video surveillance system only captures the event frames where the accuracy level of captures is too poor. We reviewed the existing system to find that at present there is no such research technique that offers contextual-based scene identification of outliers. Therefore, we presented a framework that uses unsupervised learning approach to perform precise identification of outliers for a given video frames concerning the contextual information of the scene. The proposed system uses matrix decomposition method using multivariate analysis to maintain an equilibrium better faster response time and higher accuracy of the abnormal event/object detection as an outlier. Using an analytical methodology, the proposed system blocking operation followed by sparsity to perform detection. The study outcome shows that proposed system offers an increasing level of accuracy in contrast to the existing system with faster response time
    • …
    corecore