Search CORE

2,146 research outputs found

Systematic comparison of ranking aggregation methods for gene lists in experimental results

Author: Baillie J Kenneth
Cole Joby
Dockrell David H
Gutmann Michael U
Law Andy
Parkinson Nicholas
Regan Tim
Russell Clark D
Wang Bo
Publication venue: 'Oxford University Press (OUP)'
Publication date: 12/09/2022
Field of study

MOTIVATION: A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. The gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. Evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability since the property of a dataset can influence the performance of an algorithm. Such evaluation on gene lists is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists. RESULTS: In this study, a group of existing methods and their variations that are suitable for meta-analysis of gene lists are compared using simulated and real data. Simulated data were used to explore the performance of the aggregation methods as a function of emulating the common scenarios of real genomic data, with various heterogeneity of quality, noise level and a mix of unranked and ranked data using 20 000 possible entities. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (non-small cell lung cancer) and bacteria (macrophage apoptosis) was performed. We summarize the results of our evaluation in a simple flowchart to select a ranking aggregation method, and in an automated implementation using the meta-analysis by information content algorithm to infer heterogeneity of data quality across input datasets. AVAILABILITY AND IMPLEMENTATION: The code for simulated data generation and running edited version of algorithms: https://github.com/baillielab/comparison_of_RA_methods. Code to perform an optimal selection of methods based on the results of this review, using the MAIC algorithm to infer the characteristics of an input dataset, can be downloaded here: https://github.com/baillielab/maic. An online service for running MAIC: https://baillielab.net/maic. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

PubMed Central

Edinburgh Research Explorer

Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data

Author: Angioni Marta
DESSI NICOLETTA
PES BARBARA
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Ensemble classification is a well-established approach that involves fusing the decisions of multiple predictive models. A similar “ensemble logic” has been recently applied to challenging feature selection tasks aimed at identifying the most informative variables (or features) for a given domain of interest. In this work, we discuss the rationale of ensemble feature selection and evaluate the effects and the implications of a specific ensemble approach, namely the data perturbation strategy. Basically, it consists in combining multiple selectors that exploit the same core algorithm but are trained on different perturbed versions of the original data. The real potential of this approach, still object of debate in the feature selection literature, is here investigated in conjunction with different kinds of core selection algorithms (both univariate and multivariate). In particular, we evaluate the extent to which the ensemble implementation improves the overall performance of the selection process, in terms of predictive accuracy and stability (i.e., robustness with respect to changes in the training data). Furthermore, we measure the impact of the ensemble approach on the final selection outcome, i.e. on the composition of the selected feature subsets. The results obtained on ten public genomic benchmarks provide useful insight on both the benefits and the limitations of such ensemble approach, paving the way to the exploration of new and wider ensemble schemes

Archivio istituzionale della ricerca - Università di Cagliari

Fast and Robust Rank Aggregation against Model Misspecification

Author: Chen Weijie
Niu Gang
Pan Yuangang
Sugiyama Masashi
Tsang Ivor W.
Publication venue
Publication date: 29/05/2019
Field of study

In rank aggregation, preferences from different users are summarized into a total order under the homogeneous data assumption. Thus, model misspecification arises and rank aggregation methods take some noise models into account. However, they all rely on certain noise model assumptions and cannot handle agnostic noises in the real world. In this paper, we propose CoarsenRank, which rectifies the underlying data distribution directly and aligns it to the homogeneous data assumption without involving any noise model. To this end, we define a neighborhood of the data distribution over which Bayesian inference of CoarsenRank is performed, and therefore the resultant posterior enjoys robustness against model misspecification. Further, we derive a tractable closed-form solution for CoarsenRank making it computationally efficient. Experiments on real-world datasets show that CoarsenRank is fast and robust, achieving consistent improvement over baseline methods

arXiv.org e-Print Archive

Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism

Author: Ferraro Petrillo Umberto
Guerra Concettina
Joshi Sarang
Lu Yinquan
Palini Francesco
Rossignac Jarek
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2020
Field of study

We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease?gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism

Archivio della ricerca- Università di Roma La Sapienza

Analysis of Rank Aggregation Techniques for Rank Based on the Feature Selection Technique

Author: Dalal Surjeet
Sikri Alisha
Singh N. P.
Publication venue: Auricle Global Society of Education and Research
Publication date: 11/03/2023
Field of study

In order to improve classification accuracy and lower future computation and data collecting costs, feature selection is the process of choosing the most crucial features from a group of attributes and removing the less crucial or redundant ones. To narrow down the features that need to be analyzed, a variety of feature selection procedures have been detailed in published publications. Chi-Square (CS), IG, Relief, GR, Symmetrical Uncertainty (SU), and MI are six alternative feature selection methods used in this study. The provided dataset is aggregated using four rank aggregation strategies: "rank aggregation," "Borda Count (BC) methodology," "score and rank combination," and "unified feature scoring" based on the outcomes of the six feature selection method (UFS). These four procedures by themselves were unable to generate a clear selection rank for the characteristic. To produce different ranks of traits, this ensemble of aggregating ranks is carried out. For this, the bagging method of majority voting was applied

International Journal on Recent and Innovation Trends in Computing and Communication

Inferring a consensus problem list using penalized multistage models for ordered data

Author: Boonstra Philip S
Krauss John C
Publication venue: Collection of Biostatistics Research Archive
Publication date: 23/10/2019
Field of study

A patient\u27s medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers, among other things. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient\u27s problem in the electronic medical record actually reflect a consensus communication of that patient\u27s current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorates; in contrast, physicians broadly disagreed on the relevant problems for the middle and most difficult case abstract

PubMed Central

Collection Of Biostatistics Research Archive

Heuristic ensembles of filters for accurate and reliable feature selection

Author: Aldehim Ghadah
Publication venue
Publication date: 01/12/2015
Field of study

Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out

University of East Anglia digital repository

Integrating multiple immunogenetic data sources for feature extraction and mining somatic hypermutation patterns: the case of “towards analysis” in chronic lymphocytic leukaemia

Author: A Agathangelidis
AC Tan
Aliki Xochelli
Anastasia Hadzidimitriou
Andreas Agathangelidis
BT Messmer
D Scaviner
DG Schatz
DG Schatz
EM Conlon
F Ghiotto
F Murray
F Wang
FW Alt
G Yaari
Grigorios Tsoumakas
HC Liu
I Fishel
Ioanna Chouvarda
Ioannis Kavakiotis
Ioannis Vlahavas
JM Bischof
K Stamatopoulos
Kostas Stamatopoulos
LA Sutton
M-P Lefranc
MP Lefranc
MP Lefranc
N Darzentas
Nicos Maglaveras
P Baliakas
RN Damle
RP DeConde
RW Maul
S Lin
S Lin
SH Kleinstein
TJ Hamblin
V Giudicelli
V Giudicelli
X Brochet
Z Xu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and their Roles in Complex Disease

Author: Jay Jeremy J.
Publication venue: DigitalCommons@UMaine
Publication date: 01/12/2013
Field of study

Vast amounts of biomedical associations are easily accessible in public resources, spanning gene-disease associations, tissue-specific gene expression, gene function and pathway annotations, and many other data types. Despite this mass of data, information most relevant to the study of a particular disease remains loosely coupled and difficult to incorporate into ongoing research. Current public databases are difficult to navigate and do not interoperate well due to the plethora of interfaces and varying biomedical concept identifiers used. Because no coherent display of data within a specific problem domain is available, finding the latent relationships associated with a disease of interest is impractical. This research describes a method for extracting the contextual relationships embedded within associations relevant to a disease of interest. After applying the method to a small test data set, a large-scale integrated association network is constructed for application of a network propagation technique that helps uncover more distant latent relationships. Together these methods are adept at uncovering highly relevant relationships without any a priori knowledge of the disease of interest. The combined contextual search and relevance methods power a tool which makes pertinent biomedical associations easier to find, easier to assimilate into ongoing work, and more prominent than currently available databases. Increasing the accessibility of current information is an important component to understanding high-throughput experimental results and surviving the data deluge

University of Maine