9 research outputs found
From-Below Boolean Matrix Factorization Algorithm Based on MDL
During the past few years Boolean matrix factorization (BMF) has become an
important direction in data analysis. The minimum description length principle
(MDL) was successfully adapted in BMF for the model order selection.
Nevertheless, a BMF algorithm performing good results from the standpoint of
standard measures in BMF is missing. In this paper, we propose a novel
from-below Boolean matrix factorization algorithm based on formal concept
analysis. The algorithm utilizes the MDL principle as a criterion for the
factor selection. On various experiments we show that the proposed algorithm
outperforms---from different standpoints---existing state-of-the-art BMF
algorithms
ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
<p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p
Sequential recommendation with metric models based on frequent sequences
Modeling user preferences (long-term history) and user dynamics (short-term
history) is of greatest importance to build efficient sequential recommender
systems. The challenge lies in the successful combination of the whole user's
history and his recent actions (sequential dynamics) to provide personalized
recommendations. Existing methods capture the sequential dynamics of a user
using fixed-order Markov chains (usually first order chains) regardless of the
user, which limits both the impact of the past of the user on the
recommendation and the ability to adapt its length to the user profile. In this
article, we propose to use frequent sequences to identify the most relevant
part of the user history for the recommendation. The most salient items are
then used in a unified metric model that embeds items based on user preferences
and sequential dynamics. Extensive experiments demonstrate that our method
outperforms state-of-the-art, especially on sparse datasets. We show that
considering sequences of varying lengths improves the recommendations and we
also emphasize that these sequences provide explanations on the recommendation.Comment: 25 pages, 6 figures, submitted to DAMI (under review
Comprehensible and Robust Knowledge Discovery from Small Datasets
Die Wissensentdeckung in Datenbanken (âKnowledge Discovery in Databasesâ, KDD) zielt darauf ab, nĂŒtzliches Wissen aus Daten zu extrahieren. Daten können eine Reihe
von Messungen aus einem realen Prozess reprÀsentieren oder eine Reihe von Eingabe-
Ausgabe-Werten eines Simulationsmodells. Zwei hĂ€ufig widersprĂŒchliche Anforderungen
an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und
(2) in einer gut verstĂ€ndlichen Form vorliegt. EntscheidungsbĂ€ume (âDecision Treesâ) und
Methoden zur Entdeckung von Untergruppen (âSubgroup Discoveryâ) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verstĂ€ndlich.
Um die Bedeutung einer verstÀndlichen Datenzusammenfassung zu demonstrieren,
erforschen wir Dezentrale intelligente Netzsteuerung â ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Ănderungen in der Infrastruktur implementiert.
Die bisher durchgefĂŒhrte konventionelle Analyse dieses Systems beschrĂ€nkte sich auf
die BerĂŒcksichtigung identischer Teilnehmer und spiegelte daher die RealitĂ€t nicht ausreichend gut wider. Wir fĂŒhren viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden EntscheidungsbĂ€ume auf die resultierenden Daten an. Mit den daraus resultierenden verstĂ€ndlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen.
EntscheidungsbĂ€ume ermöglichen die Beschreibung des Systemverhaltens fĂŒr alle Eingabekombinationen.
Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum
zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe fĂŒhren
(sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen
erfordern normalerweise groĂe Datenmengen, um eine stabile und genaue Ausgabe zu erzielen.
Der Datenerfassungsprozess ist jedoch hÀufig kostspielig. Unser Hauptbeitrag ist die
Verbesserung der Untergruppenerkennung aus DatensÀtzen mit wenigen Beobachtungen.
Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung
bezeichnet. Ein hĂ€ufig verwendeter Algorithmus fĂŒr die Szenarioerkennung ist PRIM
(Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering
Scenarios) vor, ein neues Verfahren fĂŒr die Szenarioerkennung. FĂŒr REDS, trainieren wir
zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine groĂe Menge
neuer Daten fĂŒr PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir
ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM fĂŒr sich alleine:
Es reduziert die Anzahl der erforderlichen SimulationslÀufe um 75% im Durchschnitt.
Mit simulierten Daten hat man perfekte Kenntnisse ĂŒber die Eingangsverteilung â eine
Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben
wir es mit Stichproben aus einer geschÀtzten multivariate Verteilung der Daten kombiniert.
Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies fĂŒr PRIM und BestInterval â eine weitere reprĂ€sentative Methode zur Erkennung von Untergruppen â gemacht. In den meisten FĂ€llen hat unsere Methodik die QualitĂ€t der entdeckten Untergruppen erhöht
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Communities
One of the major hurdles preventing the full exploitation of information from
online communities is the widespread concern regarding the quality and
credibility of user-contributed content. Prior works in this domain operate on
a static snapshot of the community, making strong assumptions about the
structure of the data (e.g., relational tables), or consider only shallow
features for text classification.
To address the above limitations, we propose probabilistic graphical models
that can leverage the joint interplay between multiple factors in online
communities --- like user interactions, community dynamics, and textual content
--- to automatically assess the credibility of user-contributed online content,
and the expertise of users and their evolution with user-interpretable
explanation. To this end, we devise new models based on Conditional Random
Fields for different settings like incorporating partial expert knowledge for
semi-supervised learning, and handling discrete labels as well as numeric
ratings for fine-grained analysis. This enables applications such as extracting
reliable side-effects of drugs from user-contributed posts in healthforums, and
identifying credible content in news communities.
Online communities are dynamic, as users join and leave, adapt to evolving
trends, and mature over time. To capture this dynamics, we propose generative
models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian
Motion to trace the continuous evolution of user expertise and their language
model over time. This allows us to identify expert users and credible content
jointly over time, improving state-of-the-art recommender systems by explicitly
considering the maturity of users. This also enables applications such as
identifying helpful product reviews, and detecting fake and anomalous reviews
with limited information.Comment: PhD thesis, Mar 201