4,614 research outputs found

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    Extração de conhecimento a partir de fontes semi-estruturadas

    Get PDF
    The increasing number of small, cheap devices, full of sensing capabilities lead to an untapped source of data that can be explored to improve and optimize multiple systems, from small-scale home automation to large-scale applications such as agriculture monitoring, traffic flow and industrial maintenance prediction. Yet, hand in hand with this growth, goes the increasing difficulty to collect, store and organize all these new data. The lack of standard context representation schemes is one of the main struggles in this area. Furthermore, conventional methods for extracting knowledge from data rely on standard representations or a priori relations. These a priori relations add latent information to the underlying model, in the form of context representation schemes, table relations, or even ontologies. Nonetheless, these relations are created and maintained by human users. While feasible for small-scale scenarios or specific areas, this becomes increasingly difficult to maintain when considering the potential dimension of IoT and M2M scenarios. This thesis addresses the problem of storing and organizing context information from IoT/M2M scenarios in a meaningful way, without imposing a representation scheme or requiring a priori relations. This work proposes a d-dimension organization model, which was optimized for IoT/M2M data. The model relies on machine learning features to identify similar context sources. These features are then used to learn relations between data sources automatically, providing the foundations for automatic knowledge extraction, where machine learning, or even conventional methods, can rely upon to extract knowledge on a potentially relevant dataset. During this work, two different machine learning techniques were tackled: semantic and stream similarity. Semantic similarity estimates the similarity between concepts (in textual form). This thesis proposes an unsupervised learning method for semantic features based on distributional profiles, without requiring any specific corpus. This allows the organizational model to organize data based on concept similarity instead of string matching. Another advantage is that the learning method does not require input from users, making it ideal for massive IoT/M2M scenarios. Stream similarity metrics estimate the similarity between two streams of data. Although these methods have been extensively researched for DNA sequencing, they commonly rely on variants of the longest common sub-sequence. This PhD proposes a generative model for stream characterization, specially optimized for IoT/M2M data. The model can be used to generate statistically significant data’s streams and estimate the similarity between streams. This is then used by the context organization model to identify context sources with similar stream patterns. The work proposed in this thesis was extensively discussed, developed and published in several international publications. The multiple contributions in projects and collaborations with fellow colleagues, where parts of the work developed were used successfully, support the claim that although the context organization model (and subsequent similarity features) were optimized for IoT/M2M data, they can potentially be extended to deal with any kind of context information in a wide array of applications.O número crescente de dispositivos pequenos e baratos, repletos de capacidades sensoriais, criou uma nova fonte de dados que pode ser explorada para melhorar e otimizar vários sistemas, desde domótica em ambientes residenciais até aplicações de larga escala como monitorização agrícola, gestão de tráfego e manutenção preditiva a nível industrial. No entanto, este crescimento encontra-se emparelhado com a crescente dificuldade em recolher, armazenar e organizar todos estes dados. A inexistência de um esquema de representação padrão é uma das principais dificuldades nesta área. Além disso, métodos de extração de conhecimento convencionais dependem de representações padrão ou relações definidas a priori. No entanto estas relações são definidas e mantidas por utilizadores humanos. Embora seja viável para cenários de pequena escala ou áreas especificas, este tipo de relações torna-se cada vez mais difícil de manter quando se consideram cenários com a dimensão associado a IoT e M2M. Esta tese de doutoramento endereça o problema de armazenar e organizar informação de contexto de cenários de IoT/M2M, sem impor um esquema de representação ou relações a priori. Este trabalho propõe um modelo de organização com d dimensões, especialmente otimizado para dados de IoT/M2M. O modelo depende de características de machine learning para identificar fontes de contexto similares. Estas caracteristicas são utilizadas para aprender relações entre as fontes de dados automaticamente, criando as fundações para a extração de conhecimento automática. Quer machine learning quer métodos convencionais podem depois utilizar estas relações automáticas para extrair conhecimento em datasets potencialmente relevantes. Durante este trabalho, duas técnicas foram desenvolvidas: similaridade semântica e similaridade entre séries temporais. Similaridade semântica estima a similaridade entre conceitos (em forma textual). Este trabalho propõe um método de aprendizagem não supervisionado para features semânticas baseadas em perfis distributivos, sem exigir nenhum corpus específico. Isto permite ao modelo de organização organizar dados baseado em conceitos e não em similaridade de caracteres. Numa outra vantagem importante para os cenários de IoT/M2M, o método de aprendizagem não necessita de dados de entrada adicionados por utilizadores. A similaridade entre séries temporais são métricas que permitem estimar a similaridade entre várias series temporais. Embora estes métodos tenham sido extensivamente desenvolvidos para sequenciação de ADN, normalmente dependem de variantes de métodos baseados na maior sub-sequencia comum. Esta tese de doutoramento propõe um modelo generativo para caracterizar séries temporais, especialmente desenhado para dados IoT/M2M. Este modelo pode ser usado para gerar séries temporais estatisticamente corretas e estimar a similaridade entre múltiplas séries temporais. Posteriormente o modelo de organização identifica fontes de contexto com padrões temporais semelhantes. O trabalho proposto foi extensivamente discutido, desenvolvido e publicado em diversas publicações internacionais. As múltiplas contribuições em projetos e colaborações com colegas, onde partes trabalho desenvolvido foram utilizadas com sucesso, permitem reivindicar que embora o modelo (e subsequentes técnicas) tenha sido otimizado para dados IoT/M2M, podendo ser estendido para lidar com outros tipos de informação de contexto noutras áreas.The present study was developed in the scope of the Smart Green Homes Project [POCI-01-0247-FEDER-007678], a co-promotion between Bosch Termotecnologia S.A. and the University of Aveiro. It is financed by Portugal 2020 under the Competitiveness and Internationalization Operational Program, and by the European Regional Development Fund.Programa Doutoral em Informátic

    Trading Indistinguishability-based Privacy and Utility of Complex Data

    Get PDF
    The collection and processing of complex data, like structured data or infinite streams, facilitates novel applications. At the same time, it raises privacy requirements by the data owners. Consequently, data administrators use privacy-enhancing technologies (PETs) to sanitize the data, that are frequently based on indistinguishability-based privacy definitions. Upon engineering PETs, a well-known challenge is the privacy-utility trade-off. Although literature is aware of a couple of trade-offs, there are still combinations of involved entities, privacy definition, type of data and application, in which we miss valuable trade-offs. In this thesis, for two important groups of applications processing complex data, we study (a) which indistinguishability-based privacy and utility requirements are relevant, (b) whether existing PETs solve the trade-off sufficiently, and (c) propose novel PETs extending the state-of-the-art substantially in terms of methodology, as well as achieved privacy or utility. Overall, we provide four contributions divided into two parts. In the first part, we study applications that analyze structured data with distance-based mining algorithms. We reveal that an essential utility requirement is the preservation of the pair-wise distances of the data items. Consequently, we propose distance-preserving encryption (DPE), together with a general procedure to engineer respective PETs by leveraging existing encryption schemes. As proof of concept, we apply it to SQL log mining, useful for database performance tuning. In the second part, we study applications that monitor query results over infinite streams. To this end, -event differential privacy is state-of-the-art. Here, PETs use mechanisms that typically add noise to query results. First, we study state-of-the-art mechanisms with respect to the utility they provide. Conducting the so far largest benchmark that fulfills requirements derived from limitations of prior experimental studies, we contribute new insights into the strengths and weaknesses of existing mechanisms. One of the most unexpected, yet explainable result, is a baseline supremacy. It states that one of the two baseline mechanisms delivers high or even the best utility. A natural follow-up question is whether baseline mechanisms already provide reasonable utility. So, second, we perform a case study from the area of electricity grid monitoring revealing two results. First, achieving reasonable utility is only possible under weak privacy requirements. Second, the utility measured with application-specific utility metrics decreases faster than the sanitization error, that is used as utility metric in most studies, suggests. As a third contribution, we propose a novel differential privacy-based privacy definition called Swellfish privacy. It allows tuning utility beyond incremental -event mechanism design by supporting time-dependent privacy requirements. Formally, as well as by experiments, we prove that it increases utility significantly. In total, our thesis contributes substantially to the research field, and reveals directions for future research

    Inferring Anomalies from Data using Bayesian Networks

    Get PDF
    Existing studies on data mining has largely focused on the design of measures and algorithms to identify outliers in large and high dimensional categorical and numeric databases. However, not much stress has been given on the interestingness of the reported outlier. One way to ascertain interestingness and usefulness of the reported outlier is by making use of domain knowledge. In this thesis, we present measures to discover outliers based on background knowledge, represented by a Bayesian network. Using causal relationships between attributes encoded in the Bayesian framework, we demonstrate that meaningful outliers, i.e., outliers which encode important or new information are those which violate causal relationships encoded in the model. Depending upon nature of data, several approaches are proposed to identify and explain anomalies using Bayesian knowledge. Outliers are often identified as data points which are ``rare'', ''isolated'', or ''far away from their nearest neighbors''. We show that these characteristics may not be an accurate way of describing interesting outliers. Through a critical analysis on several existing outlier detection techniques, we show why there is a mismatch between outliers as entities described by these characteristics and ``real'' outliers as identified using Bayesian approach. We show that the Bayesian approaches presented in this thesis has better accuracy in mining genuine outliers while, keeping a low false positive rate as compared to traditional outlier detection techniques

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Quality and interestingness of association rules derived from data mining of relational and semi-structured data

    Get PDF
    Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task

    Unsupervised learning on social data

    Get PDF
    corecore