29 research outputs found

    A review of multi-instance learning assumptions

    Get PDF
    Multi-instance (MI) learning is a variant of inductive machine learning, where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many MI learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain; this ‘standard MI assumption’ is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area

    Causal Discovery for Relational Domains: Representation, Reasoning, and Learning

    Get PDF
    Many domains are currently experiencing the growing trend to record and analyze massive, observational data sets with increasing complexity. A commonly made claim is that these data sets hold potential to transform their corresponding domains by providing previously unknown or unexpected explanations and enabling informed decision-making. However, only knowledge of the underlying causal generative process, as opposed to knowledge of associational patterns, can support such tasks. Most methods for traditional causal discovery—the development of algorithms that learn causal structure from observational data—are restricted to representations that require limiting assumptions on the form of the data. Causal discovery has almost exclusively been applied to directed graphical models of propositional data that assume a single type of entity with independence among instances. However, most real-world domains are characterized by systems that involve complex interactions among multiple types of entities. Many state-of-the-art methods in statistics and machine learning that address such complex systems focus on learning associational models, and they are oftentimes mistakenly interpreted as causal. The intersection between causal discovery and machine learning in complex systems is small. The primary objective of this thesis is to extend causal discovery to such complex systems. Specifically, I formalize a relational representation and model that can express the causal and probabilistic dependencies among the attributes of interacting, heterogeneous entities. I show that the traditional method for reasoning about statistical independence from model structure fails to accurately derive conditional independence facts from relational models. I introduce a new theory—relational d-separation—and a novel, lifted representation—the abstract ground graph—that supports a sound, complete, and computationally efficient method for algorithmically deriving conditional independencies from probabilistic models of relational data. The abstract ground graph representation also presents causal implications that enable the detection of causal direction for bivariate relational dependencies without parametric assumptions. I leverage these implications and the theoretical framework of relational d-separation to develop a sound and complete algorithm—the relational causal discovery (RCD) algorithm—that learns causal structure from relational data

    Modeling Complex Networks For (Electronic) Commerce

    Get PDF
    NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Transforming Graph Representations for Statistical Relational Learning

    Full text link
    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

    Relational clustering models for knowledge discovery and recommender systems

    Get PDF
    Cluster analysis is a fundamental research field in Knowledge Discovery and Data Mining (KDD). It aims at partitioning a given dataset into some homogeneous clusters so as to reflect the natural hidden data structure. Various heuristic or statistical approaches have been developed for analyzing propositional datasets. Nevertheless, in relational clustering the existence of multi-type relationships will greatly degrade the performance of traditional clustering algorithms. This issue motivates us to find more effective algorithms to conduct the cluster analysis upon relational datasets. In this thesis we comprehensively study the idea of Representative Objects for approximating data distribution and then design a multi-phase clustering framework for analyzing relational datasets with high effectiveness and efficiency. The second task considered in this thesis is to provide some better data models for people as well as machines to browse and navigate a dataset. The hierarchical taxonomy is widely used for this purpose. Compared with manually created taxonomies, automatically derived ones are more appealing because of their low creation/maintenance cost and high scalability. Up to now, the taxonomy generation techniques are mainly used to organize document corpus. We investigate the possibility of utilizing them upon relational datasets and then propose some algorithmic improvements. Another non-trivial problem is how to assign suitable labels for the taxonomic nodes so as to credibly summarize the content of each node. Unfortunately, this field has not been investigated sufficiently to the best of our knowledge, and so we attempt to fill the gap by proposing some novel approaches. The final goal of our cluster analysis and taxonomy generation techniques is to improve the scalability of recommender systems that are developed to tackle the problem of information overload. Recent research in recommender systems integrates the exploitation of domain knowledge to improve the recommendation quality, which however reduces the scalability of the whole system at the same time. We address this issue by applying the automatically derived taxonomy to preserve the pair-wise similarities between items, and then modeling the user visits by another hierarchical structure. Experimental results show that the computational complexity of the recommendation procedure can be greatly reduced and thus the system scalability be improved

    Learning Instance Weights in Multi-Instance Learning

    Get PDF
    Multi-instance (MI) learning is a variant of supervised machine learning, where each learning example contains a bag of instances instead of just a single feature vector. MI learning has applications in areas such as drug activity prediction, fruit disease management and image classification. This thesis investigates the case where each instance has a weight value determining the level of influence that it has on its bag's class label. This is a more general assumption than most existing approaches use, and thus is more widely applicable. The challenge is to accurately estimate these weights in order to make predictions at the bag level. An existing approach known as MILES is retroactively identified as an algorithm that uses instance weights for MI learning, and is evaluated using a variety of base learners on benchmark problems. New algorithms for learning instance weights for MI learning are also proposed and rigorously evaluated on both artificial and real-world datasets. The new algorithms are shown to achieve better root mean squared error rates than existing approaches on artificial data generated according to the algorithms' underlying assumptions. Experimental results also demonstrate that the new algorithms are competitive with existing approaches on real-world problems

    MenetelmiÀ jÀlleenkuvausten louhintaan

    Get PDF
    In scientific investigations data oftentimes have different nature. For instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, a natural task is to identify the correspondences that exist between these different aspects. This is the motivating idea of redescription mining, the data analysis task studied in this thesis. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions. A practical example in biology consists in finding geographical areas that admit two characterizations, one in terms of their climatic profile and one in terms of the occupying species. Discovering such redescriptions can contribute to better our understanding of the influence of climate over species distribution. Besides biology, applications of redescription mining can be envisaged in medicine or sociology, among other fields. Previously, redescription mining was restricted to propositional queries over Boolean attributes. However, many conditions, like aforementioned climate, cannot be expressed naturally in this limited formalism. In this thesis, we consider more general query languages and propose algorithms to find the corresponding redescriptions, making the task relevant to a broader range of domains and problems. Specifically, we start by extending redescription mining to non-Boolean attributes. In other words, we propose an algorithm to handle nominal and real-valued attributes natively. We then extend redescription mining to the relational setting, where the aim is to find corresponding connection patterns that relate almost the same object tuples in a network. We also study approaches for selecting high quality redescriptions to be output by the mining process. The first approach relies on an interface for mining and visualizing redescriptions interactively and allows the analyst to tailor the selection of results to meet his needs. The second approach, rooted in information theory, is a compression-based method for mining small sets of associations from two-view datasets. In summary, we take redescription mining outside the Boolean world and show its potential as a powerful exploratory method relevant in a broad range of domains.Tieteellinen tutkimusaineisto kootaan usein eri termistöÀ kĂ€yttĂ€vistĂ€ lĂ€hteistĂ€. NĂ€iden erilaisten nĂ€kökulmienvĂ€listen vastaavuuksien ja yhteyksien tunnistaminen on luonnollinen tapa lĂ€hestyĂ€ tutkittavaa ilmiötĂ€. VĂ€itöskirjassa tarkastellaan juuri tĂ€hĂ€n pyrkivÀÀ data-analyysimenetelmÀÀ, jĂ€lleenkuvausten louhintaa (redescription mining). JĂ€lleenkuvausten tavoitteena on yhtÀÀltĂ€ kuvata samaa asiaa vaihoehtoisilla tavoilla ja toisaalta tunnistaa sellaiset asiat, joilla on useita eri kuvauksia. JĂ€lleenkuvausten louhinnalla on mahdollisia sovelluksia mm. biologiassa, lÀÀketieteessĂ€ ja sosiologiassa. Biologiassa voidaan esimerkiksi etsiĂ€ sellaisia maantieteellisiĂ€ alueita, joita voidaan luonnehtia kahdella vaihtoehtoisella tavalla: joko kuvaamalla alueen ilmasto tai kuvaamalla alueella elĂ€vĂ€t lajit. Esimerkiksi Skandinaviassa ja Baltiassa on ensinnĂ€kin samankaltaiset lĂ€mpötila- ja sadeolosuhteet ja toisekseen hirvi on yhteinen laji molemmilla alueilla. TĂ€llaisten jĂ€lleenkuvausten löytĂ€minen voi auttaa ymmĂ€rtĂ€mÀÀn ilmaston vaikutuksia lajien levinneisyyteen. LÀÀketieteessĂ€ taas jĂ€lleenkuvauksilla voidaan löytÀÀ potilaiden taustatietojen sekĂ€ heidĂ€n oireidensa ja diagnoosiensa vĂ€lisiĂ€ yhteyksiĂ€, joiden avulla taas voidaan mahdollisesti paremmin ymmĂ€rtÀÀ itse sairauksia. Aiemmin jĂ€lleenkuvausten louhinnassa on rajoituttu tarkastelemaan totuusarvoisia muuttujia sekĂ€ propositionaalisia kuvauksia. Monia asioita, esimerkiksi ilmastotyyppiĂ€, ei kuitenkaan voi luontevasti kuvata tĂ€llaisilla rajoittuneilla formalismeilla. VĂ€itöskirjatyössĂ€ laajennetaankin jĂ€lleenkuvausten kĂ€ytettĂ€vyyttĂ€. TyössĂ€ esitetÀÀn ensimmĂ€inen algoritmi jĂ€lleenkuvausten löytĂ€miseen aineistoista, joissa attribuutit ovat reaalilukuarvoisia ja kĂ€sitellÀÀn ensimmĂ€istĂ€ kertaa jĂ€lleenkuvausten etsintÀÀ relationaalisista aineistoista, joissa asiat viittaavat toisiinsa. LisĂ€ksi vĂ€itöskirjassa tarkastellaan menetelmiĂ€, joilla jĂ€lleenkuvausten joukosta voidaan valita kaikkein laadukkaimmat. NĂ€ihin menetelmiin kuuluvat sekĂ€ interaktiivinen kĂ€yttöliittymĂ€ jĂ€lleenkuvausten louhintaan ja visualisointiin, ettĂ€ informaatioteoriaan perustuvaa parametriton menetelmĂ€ parhaiden kuvausten valitsemiseksi. Kokonaisuutena vĂ€itöskirjatyössĂ€ siis laajennetaan jĂ€lleenkuvausten louhintaa totuusarvoisista muuttujista myös muunlaisten aineistojen kĂ€sittelyyn sekĂ€ osoitetaan menetelmĂ€n mahdollisuuksia monenlaisilla sovellusalueilla.MĂ©thodes pour la fouille de redescriptions Lors de l'analyse scientifique d'un phĂ©nomĂšne, les donnĂ©es disponibles sont souvent de diffĂ©rentes natures. Entre autres, elles peuvent provenir de diffĂ©rentes sources ou utiliser diffĂ©rentes terminologies. DĂ©couvrir des correspondances entre ces diffĂ©rents aspects fournit un moyen naturel de mieux comprendre le phĂ©nomĂšne Ă  l'Ă©tude. C'est l'idĂ©e directrice de la fouille de redescriptions (redescription mining), la mĂ©thode d'analyse de donnĂ©es Ă©tudiĂ©e dans cette thĂšse. La fouille de redescriptions a pour but de trouver diverses maniĂšres de dĂ©crire les mĂȘme choses et vice versa, de trouver des choses qui ont plusieurs descriptions en commun. Un exemple en biologie consiste Ă  dĂ©terminer des zones gĂ©ographiques qui peuvent ĂȘtre caractĂ©risĂ©es de deux maniĂšres, en terme de leurs conditions climatiques d'une part, et en terme des espĂšces animales qui y vivent d'autre part. Les rĂ©gions europĂ©ennes de la Scandinavie et de la Baltique, par exemple, ont des conditions de tempĂ©ratures et de prĂ©cipitations similaires et l'Ă©lan est une espĂšce commune aux deux rĂ©gions. Identifier de telles redescriptions peut potentiellement aider Ă  Ă©lucider l'influence du climat sur la distribution des espĂšces animales. Pour prendre un autre exemple, la fouille de redescriptions pourrait ĂȘtre appliquĂ©e en mĂ©decine, pour mettre en relation les antĂ©cĂ©dents des patients, leurs symptĂŽmes et leur diagnostic, dans le but d'amĂ©liorer notre comprĂ©hension des maladies. Auparavant, la fouille de redescriptions n'utilisait que des requĂȘtes propositionnelles Ă  variables boolĂ©ennes. Cependant, de nombreuses conditions, telles que le climat citĂ© ci-dessus, ne peuvent ĂȘtre exprimĂ©es dans ce formalisme restreint. Dans cette thĂšse, nous proposons un algorithme pour construire directement des redescriptions avec des variables rĂ©elles. Nous introduisons ensuite des redescriptions mettant en jeu des liens entre les objets, c'est Ă  dire basĂ©es sur des requĂȘtes relationnelles. Nous Ă©tudions aussi des approches pour sĂ©lectionner des redescriptions de qualitĂ©, soit en utilisant une interface permettant la fouille et la visualisation interactives des redescriptions, soit via une mĂ©thode sans paramĂštres motivĂ©e par des principes de la thĂ©orie de l'information. En rĂ©sumĂ©, nous Ă©tendons la fouille de redescriptions hors du monde boolĂ©en et montrons qu'elle constitue une mĂ©thode d'exploration de donnĂ©es puissante et pertinente dans une large variĂ©tĂ© de domaines

    Modeling Complex Networks For (Electronic) Commerce

    Get PDF
    NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Modeling complex networks for electronic commerce

    Full text link
    corecore