451 research outputs found
Feature selection in high-dimensional dataset using MapReduce
This paper describes a distributed MapReduce implementation of the minimum
Redundancy Maximum Relevance algorithm, a popular feature selection method in
bioinformatics and network inference problems. The proposed approach handles
both tall/narrow and wide/short datasets. We further provide an open source
implementation based on Hadoop/Spark, and illustrate its scalability on
datasets involving millions of observations or features
OpenTED Browser: Insights into European Public Spendings
We present the OpenTED browser, a Web application allowing to interactively
browse public spending data related to public procurements in the European
Union. The application relies on Open Data recently published by the European
Commission and the Publications Office of the European Union, from which we
imported a curated dataset of 4.2 million contract award notices spanning the
period 2006-2015. The application is designed to easily filter notices and
visualise relationships between public contracting authorities and private
contractors. The simple design allows for example to quickly find information
about who the biggest suppliers of local governments are, and the nature of the
contracted goods and services. We believe the tool, which we make Open Source,
is a valuable source of information for journalists, NGOs, analysts and
citizens for getting information on public procurement data, from large scale
trends to local municipal developments.Comment: ECML, PKDD, SoGood workshop 201
Study of meta-analysis strategies for network inference using information-theoretic approaches
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches focused on individual datasets, which typically suffer from some experimental bias and a small number of samples.
To date, there are mainly two strategies for the problem of interest: the first one (”data merging”) merges all datasets together and then infers a GRN whereas the other (”networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking.
In this paper, we evaluate the performances of various metaanalysis approaches mentioned above with a systematic set of experiments based on in silico benchmarks. Furthermore, we present a new meta-analysis approach for inferring GRNs from multiple studies. Our proposed approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix.Peer ReviewedPostprint (author's final draft
From dependency to causality: a machine learning approach
The relationship between statistical dependency and causality lies at the
heart of all statistical approaches to causal inference. Recent results in the
ChaLearn cause-effect pair challenge have shown that causal directionality can
be inferred with good accuracy also in Markov indistinguishable configurations
thanks to data driven approaches. This paper proposes a supervised machine
learning approach to infer the existence of a directed causal link between two
variables in multivariate settings with variables. The approach relies on
the asymmetry of some conditional (in)dependence relations between the members
of the Markov blankets of two variables causally connected. Our results show
that supervised learning methods may be successfully used to extract causal
information on the basis of asymmetric statistical descriptors also for
variate distributions.Comment: submitted to JML
Relevance of different prior knowledge sources for inferring gene interaction networks
When inferring networks from high-throughput genomic data, one of the main challenges is the subsequent validation of these networks. In the best case scenario, the true network is partially known from previous research results published in structured databases or research articles. Traditionally, inferred networks are validated against these known interactions. Whenever the recovery rate is gauged to be high enough, subsequent high scoring but unknown inferred interactions are deemed good candidates for further experimental validation. Therefore such validation framework strongly depends on the quantity and quality of published interactions and presents serious pitfalls: (1) availability of these known interactions for the studied problem might be sparse; (2) quantitatively comparing different inference algorithms is not trivial; and (3) the use of these known interactions for validation prevents their integration in the inference procedure. The latter is particularly relevant as it has recently been showed that integration of priors during network inference significantly improves the quality of inferred networks. To overcome these problems when validating inferred networks, we recently proposed a data-driven validation framework based on single gene knock-down experiments. Using this framework, we were able to demonstrate the benefits of integrating prior knowledge and expression data. In this paper we used this framework to assess the quality of different sources of prior knowledge on their own and in combination with different genomic data sets in colorectal cancer. We observed that most prior sources lead to significant F-scores. Furthermore, their integration with genomic data leads to a significant increase in F-scores, especially for priors extracted from full text PubMed articles, known co-expression modules and genetic interactions. Lastly, we observed that the results are consistent for three different data sets: experimental knock-down data and two human tumor data sets
minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information
SCOPUS: ar.jinfo:eu-repo/semantics/publishe
On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information
SCOPUS: ar.jinfo:eu-repo/semantics/publishe
A churn prediction dataset from the telecom sector: a new benchmark for uplift modeling
Uplift modeling, also known as individual treatment effect (ITE) estimation,
is an important approach for data-driven decision making that aims to identify
the causal impact of an intervention on individuals. This paper introduces a
new benchmark dataset for uplift modeling focused on churn prediction, coming
from a telecom company in Belgium, Orange Belgium. Churn, in this context,
refers to customers terminating their subscription to the telecom service. This
is the first publicly available dataset offering the possibility to evaluate
the efficiency of uplift modeling on the churn prediction problem. Moreover,
its unique characteristics make it more challenging than the few other public
uplift datasets.Comment: 8 pages, 2 figures, 5 tables, post-proceedings of the ECML PKDD 2023
Workshop on Uplift Modeling and Causal Machine Learning for Operational
Decision Makin
Adversarial Learning in Real-World Fraud Detection: Challenges and Perspectives
Data economy relies on data-driven systems and complex machine learning
applications are fueled by them. Unfortunately, however, machine learning
models are exposed to fraudulent activities and adversarial attacks, which
threaten their security and trustworthiness. In the last decade or so, the
research interest on adversarial machine learning has grown significantly,
revealing how learning applications could be severely impacted by effective
attacks. Although early results of adversarial machine learning indicate the
huge potential of the approach to specific domains such as image processing,
still there is a gap in both the research literature and practice regarding how
to generalize adversarial techniques in other domains and applications. Fraud
detection is a critical defense mechanism for data economy, as it is for other
applications as well, which poses several challenges for machine learning. In
this work, we describe how attacks against fraud detection systems differ from
other applications of adversarial machine learning, and propose a number of
interesting directions to bridge this gap
- …