29 research outputs found
A review of multi-instance learning assumptions
Multi-instance (MI) learning is a variant of inductive machine learning, where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many MI learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain; this âstandard MI assumptionâ is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area
Causal Discovery for Relational Domains: Representation, Reasoning, and Learning
Many domains are currently experiencing the growing trend to record and analyze massive, observational data sets with increasing complexity. A commonly made claim is that these data sets hold potential to transform their corresponding domains by providing previously unknown or unexpected explanations and enabling informed decision-making. However, only knowledge of the underlying causal generative process, as opposed to knowledge of associational patterns, can support such tasks.
Most methods for traditional causal discoveryâthe development of algorithms that learn causal structure from observational dataâare restricted to representations that require limiting assumptions on the form of the data. Causal discovery has almost exclusively been applied to directed graphical models of propositional data that assume a single type of entity with independence among instances. However, most real-world domains are characterized by systems that involve complex interactions among multiple types of entities. Many state-of-the-art methods in statistics and machine learning that address such complex systems focus on learning associational models, and they are oftentimes mistakenly interpreted as causal. The intersection between causal discovery and machine learning in complex systems is small.
The primary objective of this thesis is to extend causal discovery to such complex systems. Specifically, I formalize a relational representation and model that can express the causal and probabilistic dependencies among the attributes of interacting, heterogeneous entities. I show that the traditional method for reasoning about statistical independence from model structure fails to accurately derive conditional independence facts from relational models. I introduce a new theoryârelational d-separationâand a novel, lifted representationâthe abstract ground graphâthat supports a sound, complete, and computationally efficient method for algorithmically deriving conditional independencies from probabilistic models of relational data. The abstract ground graph representation also presents causal implications that enable the detection of causal direction for bivariate relational dependencies without parametric assumptions. I leverage these implications and the theoretical framework of relational d-separation to develop a sound and complete algorithmâthe relational causal discovery (RCD) algorithmâthat learns causal structure from relational data
Recommended from our members
Data Mining for Enhanced Operations Management Decision Making: Applications in Health Care
Data Mining involves the extraction of new knowledge from large data sets. Despite the growing research interest in data mining, however, integrating this extra knowledge into the subsequent decision making processes has received little attention. Within the context of operations management, this integration can occur in two different ways: by providing inputs for an optimization procedure and by analyzing the output of an optimization procedure. In this dissertation, I will begin by introducing a database exploration technique, which is used to improve the drug discovery process of a pharmaceutical company (Samorani et al., 2011). The same procedure is also applied to a mental health clinic\u27s database to predict whether patients will show up at their scheduled appointments. The knowledge obtained with this procedure is then used to improve patient scheduling procedures (Samorani and LaGanga, 2011). I will finally discuss how data mining can be used to learn useful information about the structure of a problem (Samorani and Laguna, 2012)
Modeling Complex Networks For (Electronic) Commerce
NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Relational clustering models for knowledge discovery and recommender systems
Cluster analysis is a fundamental research field in Knowledge Discovery and Data Mining
(KDD). It aims at partitioning a given dataset into some homogeneous clusters so as
to reflect the natural hidden data structure. Various heuristic or statistical approaches
have been developed for analyzing propositional datasets. Nevertheless, in relational
clustering the existence of multi-type relationships will greatly degrade the performance
of traditional clustering algorithms. This issue motivates us to find more effective algorithms
to conduct the cluster analysis upon relational datasets. In this thesis we
comprehensively study the idea of Representative Objects for approximating data distribution
and then design a multi-phase clustering framework for analyzing relational
datasets with high effectiveness and efficiency.
The second task considered in this thesis is to provide some better data models for
people as well as machines to browse and navigate a dataset. The hierarchical taxonomy
is widely used for this purpose. Compared with manually created taxonomies, automatically
derived ones are more appealing because of their low creation/maintenance cost
and high scalability. Up to now, the taxonomy generation techniques are mainly used
to organize document corpus. We investigate the possibility of utilizing them upon relational
datasets and then propose some algorithmic improvements. Another non-trivial
problem is how to assign suitable labels for the taxonomic nodes so as to credibly summarize
the content of each node. Unfortunately, this field has not been investigated
sufficiently to the best of our knowledge, and so we attempt to fill the gap by proposing
some novel approaches.
The final goal of our cluster analysis and taxonomy generation techniques is
to improve the scalability of recommender systems that are developed to tackle the
problem of information overload. Recent research in recommender systems integrates
the exploitation of domain knowledge to improve the recommendation quality, which
however reduces the scalability of the whole system at the same time. We address this
issue by applying the automatically derived taxonomy to preserve the pair-wise similarities
between items, and then modeling the user visits by another hierarchical structure.
Experimental results show that the computational complexity of the recommendation
procedure can be greatly reduced and thus the system scalability be improved
Learning Instance Weights in Multi-Instance Learning
Multi-instance (MI) learning is a variant of supervised machine learning, where each learning example contains a bag of instances instead of just a single feature vector. MI learning has applications in areas such as drug activity prediction, fruit disease management and image classification.
This thesis investigates the case where each instance has a weight value determining the level of influence that it has on its bag's class label. This is a more general assumption than most existing approaches use, and thus is more widely applicable. The challenge is to accurately estimate these weights in order to make predictions at the bag level.
An existing approach known as MILES is retroactively identified as an algorithm that uses instance weights for MI learning, and is evaluated using a variety of base learners on benchmark problems. New algorithms for learning instance weights for MI learning are also proposed and rigorously evaluated on both artificial and real-world datasets. The new algorithms are shown to achieve better root mean squared error rates than existing approaches on artificial data generated according to the algorithms' underlying assumptions. Experimental results also demonstrate that the new algorithms are competitive with existing approaches on real-world problems
MenetelmiÀ jÀlleenkuvausten louhintaan
In scientific investigations data oftentimes have different nature. For instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, a natural task is to identify the correspondences that exist between these different aspects.
This is the motivating idea of redescription mining, the data analysis task studied in this thesis. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions.
A practical example in biology consists in finding geographical areas that admit two characterizations, one in terms of their climatic profile and one in terms of the occupying species. Discovering such redescriptions can contribute to better our understanding of the influence of climate over species distribution. Besides biology, applications of redescription mining can be envisaged in medicine or sociology, among other fields.
Previously, redescription mining was restricted to propositional queries over Boolean attributes. However, many conditions, like aforementioned climate, cannot be expressed naturally in this limited formalism. In this thesis, we consider more general query languages and propose algorithms to find the corresponding redescriptions, making the task relevant to a broader range of domains and problems.
Specifically, we start by extending redescription mining to non-Boolean attributes. In other words, we propose an algorithm to handle nominal and real-valued attributes natively. We then extend redescription mining to the relational setting, where the aim is to find corresponding connection patterns that relate almost the same object tuples in a network.
We also study approaches for selecting high quality redescriptions to be output by the mining process. The first approach relies on an interface for mining and visualizing redescriptions interactively and allows the analyst to tailor the selection of results to meet his needs. The second approach, rooted in information theory, is a compression-based method for mining small sets of associations from two-view datasets.
In summary, we take redescription mining outside the Boolean world and show its potential as a powerful exploratory method relevant in a broad range of domains.Tieteellinen tutkimusaineisto kootaan usein eri termistöÀ kÀyttÀvistÀ lÀhteistÀ. NÀiden erilaisten nÀkökulmienvÀlisten vastaavuuksien ja yhteyksien tunnistaminen on luonnollinen tapa lÀhestyÀ tutkittavaa ilmiötÀ.
VÀitöskirjassa tarkastellaan juuri tÀhÀn pyrkivÀÀ data-analyysimenetelmÀÀ, jÀlleenkuvausten louhintaa (redescription mining). JÀlleenkuvausten tavoitteena on yhtÀÀltÀ kuvata samaa asiaa vaihoehtoisilla tavoilla ja toisaalta tunnistaa sellaiset asiat, joilla on useita eri kuvauksia.
JÀlleenkuvausten louhinnalla on mahdollisia sovelluksia mm. biologiassa, lÀÀketieteessÀ ja sosiologiassa. Biologiassa voidaan esimerkiksi etsiÀ sellaisia maantieteellisiÀ alueita, joita voidaan luonnehtia kahdella vaihtoehtoisella tavalla: joko kuvaamalla alueen ilmasto tai kuvaamalla alueella elÀvÀt lajit. Esimerkiksi Skandinaviassa ja Baltiassa on ensinnÀkin samankaltaiset lÀmpötila- ja sadeolosuhteet ja toisekseen hirvi on yhteinen laji molemmilla alueilla. TÀllaisten jÀlleenkuvausten löytÀminen voi auttaa ymmÀrtÀmÀÀn ilmaston vaikutuksia lajien levinneisyyteen. LÀÀketieteessÀ taas jÀlleenkuvauksilla voidaan löytÀÀ potilaiden taustatietojen sekÀ heidÀn oireidensa ja diagnoosiensa vÀlisiÀ yhteyksiÀ, joiden avulla taas voidaan mahdollisesti paremmin ymmÀrtÀÀ itse sairauksia.
Aiemmin jÀlleenkuvausten louhinnassa on rajoituttu tarkastelemaan totuusarvoisia muuttujia sekÀ propositionaalisia kuvauksia. Monia asioita, esimerkiksi ilmastotyyppiÀ, ei kuitenkaan voi luontevasti kuvata tÀllaisilla rajoittuneilla formalismeilla. VÀitöskirjatyössÀ laajennetaankin jÀlleenkuvausten kÀytettÀvyyttÀ. TyössÀ esitetÀÀn ensimmÀinen algoritmi jÀlleenkuvausten löytÀmiseen aineistoista, joissa attribuutit ovat reaalilukuarvoisia ja kÀsitellÀÀn ensimmÀistÀ kertaa jÀlleenkuvausten etsintÀÀ relationaalisista aineistoista, joissa asiat viittaavat toisiinsa.
LisÀksi vÀitöskirjassa tarkastellaan menetelmiÀ, joilla jÀlleenkuvausten joukosta voidaan valita kaikkein laadukkaimmat. NÀihin menetelmiin kuuluvat sekÀ interaktiivinen kÀyttöliittymÀ jÀlleenkuvausten louhintaan ja visualisointiin, ettÀ informaatioteoriaan perustuvaa parametriton menetelmÀ parhaiden kuvausten valitsemiseksi.
Kokonaisuutena vÀitöskirjatyössÀ siis laajennetaan jÀlleenkuvausten louhintaa totuusarvoisista muuttujista myös muunlaisten aineistojen kÀsittelyyn sekÀ osoitetaan menetelmÀn mahdollisuuksia monenlaisilla sovellusalueilla.Méthodes pour la fouille de redescriptions
Lors de l'analyse scientifique d'un phénomÚne, les données disponibles sont souvent de différentes natures. Entre autres, elles peuvent provenir de différentes sources ou utiliser différentes terminologies.
Découvrir des correspondances entre ces différents aspects fournit un moyen naturel de mieux comprendre le phénomÚne à l'étude.
C'est l'idĂ©e directrice de la fouille de redescriptions (redescription mining), la mĂ©thode d'analyse de donnĂ©es Ă©tudiĂ©e dans cette thĂšse. La fouille de redescriptions a pour but de trouver diverses maniĂšres de dĂ©crire les mĂȘme choses et vice versa, de trouver des choses qui ont plusieurs descriptions en commun.
Un exemple en biologie consiste Ă dĂ©terminer des zones gĂ©ographiques qui peuvent ĂȘtre caractĂ©risĂ©es de deux maniĂšres, en terme de leurs conditions climatiques d'une part, et en terme des espĂšces animales qui y vivent d'autre part. Les rĂ©gions europĂ©ennes de la Scandinavie et de la Baltique, par exemple, ont des conditions de tempĂ©ratures et de prĂ©cipitations similaires et l'Ă©lan est une espĂšce commune aux deux rĂ©gions. Identifier de telles redescriptions peut potentiellement aider Ă Ă©lucider l'influence du climat sur la distribution des espĂšces animales.
Pour prendre un autre exemple, la fouille de redescriptions pourrait ĂȘtre appliquĂ©e en mĂ©decine, pour mettre en relation les antĂ©cĂ©dents des patients, leurs symptĂŽmes et leur diagnostic, dans le but d'amĂ©liorer notre comprĂ©hension des maladies.
Auparavant, la fouille de redescriptions n'utilisait que des requĂȘtes propositionnelles Ă variables boolĂ©ennes. Cependant, de nombreuses conditions, telles que le climat citĂ© ci-dessus, ne peuvent ĂȘtre exprimĂ©es dans ce formalisme restreint. Dans cette thĂšse, nous proposons un algorithme pour construire directement des redescriptions avec des variables rĂ©elles. Nous introduisons ensuite des redescriptions mettant en jeu des liens entre les objets, c'est Ă dire basĂ©es sur des requĂȘtes relationnelles. Nous Ă©tudions aussi des approches pour sĂ©lectionner des redescriptions de qualitĂ©, soit en utilisant une interface permettant la fouille et la visualisation interactives des redescriptions, soit via une mĂ©thode sans paramĂštres motivĂ©e par des principes de la thĂ©orie de l'information.
En résumé, nous étendons la fouille de redescriptions hors du monde booléen et montrons qu'elle constitue une méthode d'exploration de données puissante et pertinente dans une large variété de domaines
Modeling Complex Networks For (Electronic) Commerce
NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc