178,102 research outputs found

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Transforming Graph Representations for Statistical Relational Learning

    Full text link
    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

    Taming Wild High Dimensional Text Data with a Fuzzy Lash

    Full text link
    The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy

    Finding the different patterns in buildings data using bag of words representation with clustering

    Full text link
    The understanding of the buildings operation has become a challenging task due to the large amount of data recorded in energy efficient buildings. Still, today the experts use visual tools for analyzing the data. In order to make the task realistic, a method has been proposed in this paper to automatically detect the different patterns in buildings. The K Means clustering is used to automatically identify the ON (operational) cycles of the chiller. In the next step the ON cycles are transformed to symbolic representation by using Symbolic Aggregate Approximation (SAX) method. Then the SAX symbols are converted to bag of words representation for hierarchical clustering. Moreover, the proposed technique is applied to real life data of adsorption chiller. Additionally, the results from the proposed method and dynamic time warping (DTW) approach are also discussed and compared

    A multilabel fuzzy relevance clustering system for malware attack attribution in the edge layer of cyber-physical networks

    Get PDF
    The rapid increase in the number of malicious programs has made malware forensics a daunting task and caused users’ systems to become in danger. Timely identification of malware characteristics including its origin and the malware sample family would significantly limit the potential damage of malware. This is a more profound risk in Cyber-Physical Systems (CPSs), where a malware attack may cause significant physical damage to the infrastructure. Due to limited on-device available memory and processing power in CPS devices, most of the efforts for protecting CPS networks are focused on the edge layer, where the majority of security mechanisms are deployed. Since the majority of advanced and sophisticated malware programs are combining features from different families, these malicious programs are not similar enough to any existing malware family and easily evade binary classifier detection. Therefore, in this article, we propose a novel multilabel fuzzy clustering system for malware attack attribution. Our system is deployed on the edge layer to provide insight into applicable malware threats to the CPS network. We leverage static analysis by utilizing Opcode frequencies as the feature space to classify malware families. We observed that a multilabel classifier does not classify a part of samples. We named this problem the instance coverage problem. To overcome this problem, we developed an ensemble-based multilabel fuzzy classification method to suggest the relevance of a malware instance to the stricken families. This classifier identified samples of VirusShare, RansomwareTracker, and BIG2015 with an accuracy of 94.66%, 94.26%, and 97.56%, respectively
    corecore