Search CORE

97 research outputs found

Gene function finding through cross-organism ensemble learning

Author: Masseroli M.
Moro G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

Author: Domeniconi Giacomo <1986>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 12/05/2016
Field of study

In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

AMS Tesi di Dottorato

Recommended from our members

Computational Toxinology

Author: Romano Joseph Daniel
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Venoms are complex mixtures of biological macromolecules and other compounds that are used for predatory and defensive purposes by hundreds of thousands of known species worldwide. Throughout human history, venoms and venom components have been used to treat a vast array of illnesses, causing them to be of great clinical, economic, and academic interest to the drug discovery and toxinology communities. In spite of major computational advances that facilitate data-driven drug discovery, most therapeutic venom effects are still discovered via tedious trial-and-error, or simply by accident. In this dissertation, I describe a body of work that aims to establish a new subdiscipline of translational bioinformatics, which I name “computational toxinology”. To accomplish this goal, I present three integrated components that span a wide range of informatics techniques: (1) VenomKB, (2) VenomSeq, and (3) VenomKB’s Semantic API. To provide a platform for structuring, representing, retrieving, and integrating venom data relevant to drug discovery, VenomKB provides a database-backed web application and knowledge base for computational toxinology. VenomKB is structured according to a fully-featured ontology of venoms, and provides data aggregated from many popular web re- sources. VenomSeq is a biotechnology workflow that is designed to generate new high-throughput sequencing data for incorporation into VenomKB. Specifically, we expose human cells to controlled doses of crude venoms, conduct RNA-Sequencing, and build profiles of differential gene expression, which we then compare to publicly-available differential expression data for known dis- eases and drugs with known effects, and use those comparisons to hypothesize ways that the venoms could act in a therapeutic manner, as well. These data are then integrated into VenomKB, where they can be effectively retrieved and evaluated using existing data and known therapeutic associations. VenomKB’s Semantic API further develops this functionality by providing an intelligent, powerful, and user-friendly interface for querying the complex underlying data in VenomKB in a way that reflects the intuitive, human-understandable mean- ing of those data. The Semantic API is designed to cater to the needs of advanced users as well as laypersons and bench scientists without previous expertise in computational biology and semantic data analysis. In each chapter of the dissertation, I describe how we evaluated these 3 components through various approaches. We demonstrate the utility of VenomKB and the Semantic API by testing a number of practical use-cases for each, designed to highlight their ability to rediscover existing knowledge as well as suggesting potential areas for future exploration. We use statistics and data science techniques to evaluate VenomSeq on 25 diverse species of venomous animals, and propose biologically feasible explanations for significant findings. In evaluating the Semantic API, I show how observations on VenomSeq data can be interpreted and placed into the context of past research by members of the larger toxinology community. Computational toxinology is a toolbox designed to be used by multiple stakeholders (toxinologists, computational biologists, and systems pharmacologists, among others) to improve the return rate of clinically-significant findings from manual experimentation. It aims to achieve this goal by enabling access to data, providing means for easy validation of results, and suggesting specific hypotheses that are preliminarily supported by rigorous inferential statistics. All components of the research I describe are open-access and publicly available, to improve reproducibility and encourage widespread adoptio

Columbia University Academic Commons

An ontological approach to quantify distance between hereditary disease modules on the interactome.

Author: Caniza Vierci Horacio
Publication venue
Publication date: 01/02/2016
Field of study

Royal Holloway - Pure

GOTA: GO term annotation of biomedical literature

Author: A Doms
A Schlicker
A Singhal
C Blaschke
D Li
DL Rubin
G Salton
Giacomo Domeniconi
Gianluca Moro
J Gobeill
J Gobeill
J Lomax
J Rousu
K Verspoor
L Du Plessis
L Hirschman
Luciano Margara
M Ashburner
MF Porter
N Cesa-Bianchi
N Skunca
NR Silla
NS Altman
P Radivojac
Pietro Di Lena
SE Lewis
T Liu
TH Wonnacott
Y Mao
Y Tao
Z Barutcuoglu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Enhancing Biomedical Scientific Reviews Summarization with Graph-based Factual Evidence Extracted from Papers

Author: Francesco Boschi
Giacomo Frisoni
Gianluca Moro
Paolo Italiani
Publication venue: 'Scitepress'
Publication date: 01/01/2022
Field of study

Combining structured knowledge and neural language models to tackle natural language processing tasks is a recent research trend that catalyzes community attention. This integration holds a lot of potential in document summarization, especially in the biomedical domain, where the jargon and the complex facts make the overarching information truly hard to interpret. In this context, graph construction via semantic parsing plays a crucial role in unambiguously capturing the most relevant parts of a document. However, current works are limited to extracting open-domain triples, failing to model real-world n-ary and nested biomedical interactions accurately. To alleviate this issue, we present EASumm, the first framework for biomedical abstractive summarization enhanced by event graph extraction (i.e., graphical representations of medical evidence learned from scientific text), relying on dual text-graph encoders. Extensive evaluations on the CDSR dataset corroborate the importance of explicit event structures, with better or comparable performance than previous state-of-the-art systems. Finally, we offer some hints to guide future research in the field

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature

Author: moro gianluca
ragazzi luca
valgimigli lorenzo
Publication venue: place:Stroudsburg PA 18360
Publication date: 01/01/2022
Field of study

Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Computational Proteomics Using Network-Based Strategies

Author: Goh Wen
Publication venue: Computing, Imperial College London
Publication date: 01/03/2014
Field of study

This thesis examines the productive application of networks towards proteomics, with a specific biological focus on liver cancer. Contempory proteomics (shot- gun) is plagued by coverage and consistency issues. These can be resolved via network-based approaches. The application of 3 classes of network-based approaches are examined: A traditional cluster based approach termed Proteomics Expansion Pipeline), a generalization of PEP termed Maxlink and a feature-based approach termed Proteomics Signature Profiling. PEP is an improvement on prevailing cluster-based approaches. It uses a state- of-the-art cluster identification algorithm as well as network-cleaning approaches to identify the critical network regions indicated by the liver cancer data set. The top PARP1 associated-cluster was identified and independently validated. Maxlink allows identification of undetected proteins based on the number of links to identified differential proteins. It is more sensitive than PEP due to more relaxed requirements. Here, the novel roles of ARRB1/2 and ACTB are identified and discussed in the context of liver cancer. Both PEP and Maxlink are unable to deal with consistency issues, PSP is the first method able to deal with both, and is termed feature-based since the network- based clusters it uses are predicted independently of the data. It is also capable of using real complexes or predicted pathway subnets. By combining pathways and complexes, a novel basis of liver cancer progression implicating nucleotide pool imbalance aggravated by mutations of key DNA repair complexes was identified. Finally, comparative evaluations suggested that pure network-based methods are vastly outperformed by feature-based network methods utilizing real complexes. This is indicative that the quality of current networks are insufficient to provide strong biological rigor for data analysis, and should be carefully evaluated before further validations.Open Acces

Spiral - Imperial College Digital Repository

Functional profiling of genome-scale experiments: new approaches leading to a systemic analysis

Author: Mínguez Paniagua Pablo
Publication venue
Publication date: 01/01/2008
Field of study

Tesis doctoral inédita. Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 31-10-200

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

From condition-specific interactions towards the differential complexome of proteins

Author: Will Thorsten
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2020
Field of study

While capturing the transcriptomic state of a cell is a comparably simple effort with modern sequencing techniques, mapping protein interactomes and complexomes in a sample-speciﬁc manner is currently not feasible on a large scale. To understand crucial biological processes, however, knowledge on the physical interplay between proteins can be more interesting than just their mere expression. In this thesis, we present and demonstrate four software tools that unlock the cellular wiring in a condition-speciﬁc manner and promise a deeper understanding of what happens upon cell fate transitions. PPIXpress allows to exploit the abundance of existing expression data to generate speciﬁc interactomes, which can even consider alternative splicing events when protein isoforms can be related to the presence of causative protein domain interactions of an underlying model. As an addition to this work, we developed the convenient differential analysis tool PPICompare to determine rewiring events and their causes within the inferred interaction networks between grouped samples. Furthermore, we present a new implementation of the combinatorial protein complex prediction algorithm DACO that features a signiﬁcantly reduced runtime. This improvement facilitates an application of the method for a large number of samples and the resulting sample-speciﬁc complexes can ultimately be assessed quantitatively with our novel differential protein complex analysis tool CompleXChange.Das Transkriptom einer Zelle ist mit modernen Sequenzierungstechniken vergleichsweise einfach zu erfassen. Die Ermittlung von Proteininteraktionen und -komplexen wiederum ist in großem Maßstab derzeit nicht möglich. Um wichtige biologische Prozesse zu verstehen, kann das Zusammenspiel von Proteinen jedoch erheblich interessanter sein als deren reine Expression. In dieser Arbeit stellen wir vier Software-Tools vor, die es ermöglichen solche Interaktionen zustandsbezogen zu betrachten und damit ein tieferes Verständnis darüber versprechen, was in der Zelle bei Veränderungen passiert. PPIXpress ermöglicht es vorhandene Expressionsdaten zu nutzen, um die aktiven Interaktionen in einem biologischen Kontext zu ermitteln. Wenn Proteinvarianten mit Interaktionen von Proteindomänen in Verbindung gebracht werden können, kann hierbei sogar alternatives Spleißen berücksichtigen werden. Als Ergänzung dazu haben wir das komfortable Differenzialanalyse-Tool PPICompare entwickelt, welches Veränderungen des Interaktoms und deren Ursachen zwischen gruppierten Proben bestimmen kann. Darüber hinaus stellen wir eine neue Implementierung des Proteinkomplex-Vorhersagealgorithmus DACO vor, die eine deutlich reduzierte Laufzeit aufweist. Diese Verbesserung ermöglicht die Anwendung der Methode auf eine große Anzahl von Proben. Die damit bestimmten probenspeziﬁschen Komplexe können schließlich mit unserem neuartigen Differenzialanalyse-Tool CompleXChange quantitativ bewertet werden

Universaar

Acronym