75 research outputs found
Learning by Fusing Heterogeneous Data
It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data.
Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways.
A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets
Representation Learning for Natural Language Processing
This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing
Embedding Approaches for Relational Data
​Embedding methods for searching latent representations of the data are very important tools for unsupervised and supervised machine learning as well as information visualisation. Over the years, such methods have continually progressed towards the ability to capture and analyse the structure and latent characteristics of larger and more complex data. In this thesis, we examine the problem of developing efficient and reliable embedding methods for revealing, understanding, and exploiting the different aspects of the relational data. We split our work into three pieces, where each deals with a different relational data structure. In the first part, we are handling with the weighted bipartite relational structure. Based on the relational measurements between two groups of heterogeneous objects, our goal is to generate low dimensional representations of these two different types of objects in a unified common space. We propose a novel method that models the embedding of each object type symmetrically to the other type, subject to flexible scale constraints and weighting parameters. The embedding generation relies on an efficient optimisation despatched using matrix decomposition. And we have also proposed a simple way of measuring the conformity between the original object relations and the ones re-estimated from the embeddings, in order to achieve model selection by identifying the optimal model parameters with a simple search procedure. We show that our proposed method achieves consistently better or on-par results on multiple synthetic datasets and real world ones from the text mining domain when compared with existing embedding generation approaches. In the second part of this thesis, we focus on the multi-relational data, where objects are interlinked by various relation types. Embedding approaches are very popular in this field, they typically encode objects and relation types with hidden representations and use the operations between them to compute the positive scalars corresponding to the linkages' likelihood score. In this work, we aim at further improving the existing embedding techniques by taking into account the multiple facets of the different patterns and behaviours of each relation type. To the best of our knowledge, this is the first latent representation model which considers relational representations to be dependent on the objects they relate in this field. The multi-modality of the relation type over different objects is effectively formulated as a projection matrix over the space spanned by the object vectors. Two large benchmark knowledge bases are used to evaluate the performance with respect to the link prediction task. And a new test data partition scheme is proposed to offer a better understanding of the behaviour of a link prediction model. In the last part of this thesis, a much more complex relational structure is considered. In particular, we aim at developing novel embedding methods for jointly modelling the linkage structure and objects' attributes. Traditionally, link prediction task is carried out on either the linkage structure or the objects' attributes, which does not aware of their semantic connections and is insufficient for handling the complex link prediction task. Thus, our goal in this work is to build a reliable model that can fuse both sources of information to improve the link prediction problem. The key idea of our approach is to encode both the linkage validities and the nodes neighbourhood information into embedding-based conditional probabilities. Another important aspect of our proposed algorithm is that we utilise a margin-based contrastive training process for encoding the linkage structure, which relies on a more appropriate assumption and dramatically reduces the number of training links. In the experiments, our proposed method indeed improves the link prediction performance on three citation/hyperlink datasets, when compared with those methods relying on only the nodes' attributes or the linkage structure, and it also achieves much better performances compared with the state-of-arts
Recommended from our members
Appropriate, accessible and appealing probabilistic graphical models
Appropriate - Many multivariate probabilistic models either use independent distributions or dependent Gaussian distributions. Yet, many real-world datasets contain count-valued or non-negative skewed data, e.g. bag-of-words text data and biological sequencing data. Thus, we develop novel probabilistic graphical models for use on count-valued and non-negative data including Poisson graphical models and multinomial graphical models. We develop one generalization that allows for triple-wise or k-wise graphical models going beyond the normal pairwise formulation. Furthermore, we also explore Gaussian-copula graphical models and derive closed-form solutions for the conditional distributions and marginal distributions (both before and after conditioning). Finally, we derive mixture and admixture, or topic model, generalizations of these graphical models to introduce more power and interpretability.
Accessible - Previous multivariate models, especially related to text data, often have complex dependencies without a closed form and require complex inference algorithms that have limited theoretical justification. For example, hierarchical Bayesian models often require marginalizing over many latent variables. We show that our novel graphical models (even the k-wise interaction models) have simple and intuitive estimation procedures based on node-wise regressions that likely have similar theoretical guarantees as previous work in graphical models. For the copula-based graphical models, we show that simple approximations could still provide useful models; these copula models also come with closed-form conditional and marginal distributions, which make them amenable to exploratory inspection and manipulation. The parameters of these models are easy to interpret and thus may be accessible to a wide audience.
Appealing - High-level visualization and interpretation of graphical models with even 100 variables has often been difficult even for a graphical model expert---despite visualization being one of the original motivators for graphical models. This difficulty is likely due to the lack of collaboration between graphical model experts and visualization experts. To begin bridging this gap, we develop a novel "what if?" interaction that manipulates and leverages the probabilistic power of graphical models. Our approach defines: the probabilistic mechanism via conditional probability; the query language to map text input to a conditional probability query; and the formal underlying probabilistic model. We then propose to visualize these query-specific probabilistic graphical models by combining the intuitiveness of force-directed layouts with the beauty and readability of word clouds, which pack many words into valuable screen space while ensuring words do not overlap via pixel-level collision detection. Although both the force-directed layout and the pixel-level packing problems are challenging in their own right, we approximate both simultaneously via adaptive simulated annealing starting from careful initialization. For visualizing mixture distributions, we also design a meaningful mapping from the properties of the mixture distribution to a color in the perceptually uniform CIELUV color space. Finally, we demonstrate our approach via illustrative visualizations of several real-world datasets.Computer Science
Object Recognition
Vision-based object recognition tasks are very familiar in our everyday activities, such as driving our car in the correct lane. We do these tasks effortlessly in real-time. In the last decades, with the advancement of computer technology, researchers and application developers are trying to mimic the human's capability of visually recognising. Such capability will allow machine to free human from boring or dangerous jobs
Dissémination de l’information et dynamique des opinions dans les réseaux sociaux
Our aim in this Ph. D. thesis is to study the diffusion of information as well as the opinion dynamics of users in social networks. Information diffusion models explore the paths taken by information being transmitted through a social network in order to understand and analyze the relationships between users in such network, leading to a better comprehension of human relations and dynamics. This thesis is based on both sides of information diffusion: first by developing mathematical theories and models to study the relationships between people and information, and in a second time by creating tools to better exploit the hidden patterns in these relationships. The theoretical tools developed in this thesis are opinion dynamics models and information diffusion models, where we study the information flow from users in social networks, and the practical tools developed in this thesis are a novel community detection algorithm and a novel trend detection algorithm. We start by introducing an opinion dynamics model in which agents interact with each other about several distinct opinions/contents. In our framework, agents do not exchange all their opinions with each other, they communicate about randomly chosen opinions at each time. We show, using stochastic approximation algorithms, that under mild assumptions this opinion dynamics algorithm converges as time increases, whose behavior is ruled by how users choose the opinions to broadcast at each time. We develop next a community detection algorithm which is a direct application of this opinion dynamics model: when agents broadcast the content they appreciate the most. Communities are thus formed, where they are defined as groups of users that appreciate mostly the same content. This algorithm, which is distributed by nature, has the remarkable property that the discovered communities can be studied from a solid mathematical standpoint. In addition to the theoretical advantage over heuristic community detection methods, the presented algorithm is able to accommodate weighted networks, parametric and nonparametric versions, with the discovery of overlapping communities a byproduct with no mathematical overhead. In a second part, we define a general framework to model information diffusion in social networks. The proposed framework takes into consideration not only the hidden interactions between users, but as well the interactions between contents and multiple social networks. It also accommodates dynamic networks and various temporal effects of the diffusion. This framework can be combined with topic modeling, for which several estimation techniques are derived, which are based on nonnegative tensor factorization techniques. Together with a dimensionality reduction argument, this techniques discover, in addition, the latent community structure of the users in the social networks. At last, we use one instance of the previous framework to develop a trend detection algorithm designed to find trendy topics in a social network. We take into consideration the interaction between users and topics, we formally define trendiness and derive trend indices for each topic being disseminated in the social network. These indices take into consideration the distance between the real broadcast intensity and the maximum expected broadcast intensity and the social network topology. The proposed trend detection algorithm uses stochastic control techniques in order calculate the trend indices, is fast and aggregates all the information of the broadcasts into a simple one-dimensional process, thus reducing its complexity and the quantity of necessary data to the detection. To the best of our knowledge, this is the first trend detection algorithm that is based solely on the individual performances of topicsLa dissémination d'information explore les chemins pris par l'information qui est transmise dans un réseau social, afin de comprendre et modéliser les relations entre les utilisateurs de ce réseau, ce qui permet une meilleur compréhension des relations humaines et leurs dynamique. Même si la priorité de ce travail soit théorique, en envisageant des aspects psychologiques et sociologiques des réseaux sociaux, les modèles de dissémination d'information sont aussi à la base de plusieurs applications concrètes, comme la maximisation d'influence, la prédication de liens, la découverte des noeuds influents, la détection des communautés, la détection des tendances, etc. Cette thèse est donc basée sur ces deux facettes de la dissémination d'information: nous développons d'abord des cadres théoriques mathématiquement solides pour étudier les relations entre les personnes et l'information, et dans un deuxième moment nous créons des outils responsables pour une exploration plus cohérente des liens cachés dans ces relations. Les outils théoriques développés ici sont les modèles de dynamique d'opinions et de dissémination d'information, où nous étudions le flot d'informations des utilisateurs dans les réseaux sociaux, et les outils pratiques développés ici sont un nouveau algorithme de détection de communautés et un nouveau algorithme de détection de tendances dans les réseaux sociau
Non-Convex and Geometric Methods for Tomography and Label Learning
Data labeling is a fundamental problem of mathematical data analysis in which each data point is assigned exactly one single label (prototype) from a finite predefined set. In this thesis we study two challenging extensions, where either the input data cannot be observed directly or prototypes are not available beforehand.
The main application of the first setting is discrete tomography. We propose several non-convex variational as well as smooth geometric approaches to joint image label assignment and reconstruction from indirect measurements with known prototypes. In particular, we consider spatial regularization of assignments, based on the KL-divergence, which takes into account the smooth geometry of discrete probability distributions endowed with the Fisher-Rao (information) metric, i.e. the assignment manifold. Finally, the geometric point of view leads to a smooth flow evolving on a Riemannian submanifold including the tomographic projection constraints directly into the geometry of assignments. Furthermore we investigate corresponding implicit numerical schemes which amount to solving a sequence of convex problems.
Likewise, for the second setting, when the prototypes are absent, we introduce and study a smooth dynamical system for unsupervised data labeling which evolves by geometric integration on the assignment manifold. Rigorously abstracting from ``data-label'' to ``data-data'' decisions leads to interpretable low-rank data representations, which themselves are parameterized by label assignments. The resulting self-assignment flow simultaneously performs learning of latent prototypes in the very same framework while they are used for inference. Moreover, a single parameter, the scale of regularization in terms of spatial context, drives the entire process. By smooth geodesic interpolation between different normalizations of self-assignment matrices on the positive definite matrix manifold, a one-parameter family of self-assignment flows is defined. Accordingly, the proposed approach can be characterized from different viewpoints such as discrete optimal transport, normalized spectral cuts and combinatorial optimization by completely positive factorizations, each with additional built-in spatial regularization
Recommended from our members
Composing Deep Learning and Bayesian Nonparametric Methods
Recent progress in Bayesian methods largely focus on non-conjugate models featured with extensive use of black-box functions: continuous functions implemented with neural networks. Using deep neural networks, Bayesian models can reasonably fit big data while at the same time capturing model uncertainty. This thesis targets at a more challenging problem: how do we model general random objects, including discrete ones, using random functions? Our conclusion is: many (discrete) random objects are in nature a composition of Poisson processes and random functions}. Thus, all discreteness is handled through the Poisson process while random functions captures the rest complexities of the object. Thus the title: composing deep learning and Bayesian nonparametric methods.
This conclusion is not a conjecture. In spacial cases such as latent feature models , we can prove this claim by working on infinite dimensional spaces, and that is how Bayesian nonparametric kicks in. Moreover, we will assume some regularity assumptions on random objects such as exchangeability. Then the representations will show up magically using representation theorems. We will see this two times throughout this thesis.
One may ask: when a random object is too simple, such as a non-negative random vector in the case of latent feature models, how can we exploit exchangeability? The answer is to aggregate infinite random objects and map them altogether onto an infinite dimensional space. And then assume exchangeability on the infinite dimensional space. We demonstrate two examples of latent feature models by (1) concatenating them as an infinite sequence (Section 2,3) and (2) stacking them as a 2d array (Section 4).
Besides, we will see that Bayesian nonparametric methods are useful to model discrete patterns in time series data. We will showcase two examples: (1) using variance Gamma processes to model change points (Section 5), and (2) using Chinese restaurant processes to model speech with switching speakers (Section 6).
We also aware that the inference problem can be non-trivial in popular Bayesian nonparametric models. In Section 7, we find a novel solution of online inference for the popular HDP-HMM model
Recommended from our members
Nonlinear opinion models and other networked systems
Networks play a critical role in many physical, biological, and social systems. In this thesis, we investigate tools to model and analyze networked systems. We first examine some of the ways in which we can model social dynamics that take place on networks. We then study two recently developed data-analysis methods that employ a network framework and explore new ways in which they can be used to find meaningful signals in large data sets. In the first half of the thesis, we study opinion dynamics on networks. We begin by examining a class of opinion models, known as coevolving voter models (CVM), that couple the mechanisms of opinion formation and changing social connections. We then propose a version of CVMs that incorporates nonlinearity. In our models, we assume that individuals strive to achieve harmony and avoid disagreement, both by changing their social connections to reflect their opinions and by changing their opinions to reflect their social connections. By taking a minimalist approach to modeling social dynamics, we hope to gain a deeper understanding of how these two mechanisms can give rise to social phenomena such as the ``majority illusion''. Comparing several versions of CVMs, we find that seemingly small changes in update rules can lead to strikingly different behaviors. A particularly interesting feature of our nonlinear CVMs is that, under certain conditions, the opinion state that is held initially by a minority of the nodes can effectively spread to almost every node in a network if the minority nodes view themselves as the majority. We then discuss an ongoing project that involves another class of opinion models called bounded-confidence models. Specifically, we examine extensions of bounded-confidence models on hypergraphs and discuss some preliminary findings. In the second half of the thesis, we study problems in data analysis. We begin by considering topological structures as a tool to study integrated circuit (IC) devices. In particular, we examine a problem in the design and manufacturing of IC devices using topological data analysis (TDA), which is based on network structures called simplicial complexes. Failures in IC devices generally occur near the tolerance limits of photolithography systems, such as at the minimum separation distance between adjacent electronic components. However, for complex arrangements of electronic components, simply ensuring minimal separation is insufficient to guarantee that one can manufacture an IC design accurately and reliably. We apply tools from TDA to compare data from IC designs. Without inputting domain knowledge, we are able to infer several results about the IC design-manufacturing process. Finally, we discuss an ongoing project in the analysis of network data. Specifically, we explore applications of a recently developed algorithm called network dictionary learning (NDL) and discuss problems of network reconstruction and denoising using NDL on both synthetic and real-world networks
- …