49 research outputs found
Large-scale automated protein function prediction
Includes bibliographical references.2016 Summer.Proteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources – by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics. In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In our first project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements. Although the biomedical literature is a very informative resource for identifying protein function, most AFP methods do not take advantage of the large amount of information contained in it. In our second project, we conduct the first ever comprehensive evaluation on the effectiveness of literature data for AFP. Specifically, we extract co-mentions of protein-GO term pairs and bag-of-words features from the literature and explore their effectiveness in predicting protein function. Our results show that literature features are very informative of protein function but with further room for improvement. In order to improve the quality of automatically extracted co-mentions, we formulate the classification of co-mentions as a supervised learning problem and propose a novel method based on graph kernels. Experimental results indicate the feasibility of using this co-mention classifier as a complementary method that aids the bio-curators who are responsible for maintaining databases such as Gene Ontology. This is the first study of the problem of protein-function relation extraction from biomedical text. The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In our third project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data
Biomedical entities recognition in Spanish combining word embeddings
El reconocimiento de entidades con nombre (NER) es una tarea importante en el campo del
Procesamiento del Lenguaje Natural que se utiliza para extraer conocimiento significativo de los
documentos textuales. El objetivo de NER es identificar trozos de texto que se refieran a entidades
específicas.
En esta tesis pretendemos abordar la tarea de NER en el dominio biomédico y en español. En este
dominio las entidades pueden referirse a nombres de fármacos, síntomas y enfermedades y ofrecen un
conocimiento valioso a los expertos sanitarios. Para ello, proponemos un modelo basado en redes
neuronales y empleamos una combinación de word embeddings. Además, nosotros generamos unos
nuevos embeddings específicos del dominio y del idioma para comprobar su eficacia. Finalmente,
demostramos que la combinación de diferentes word embeddings como entrada a la red neuronal
mejora los resultados del estado de la cuestión en los escenarios aplicados.Named Entity Recognition (NER) is an important task in the field of Natural Language Processing that is
used to extract meaningful knowledge from textual documents. The goal of NER is to identify text
fragments that refer to specific entities.
In this thesis we aim to address the task of NER in the Spanish biomedical domain. In this domain
entities can refer to drug, symptom and disease names and offer valuable knowledge to health experts.
For this purpose, we propose a model based on neural networks and employ a combination of word
embeddings. In addition, we generate new domain- and language-specific embeddings to test their
effectiveness. Finally, we show that the combination of different word embeddings as input to the neural
network improves the state-of-the-art results in the applied scenarios.Tesis Univ. Jaén. Departamento de Informática. Leída el 22 abril de 2021
Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data
Single-cell sequencing is a recently advanced revolutionary technology which enables researchers to obtain genomic, transcriptomic, or multi-omics information through gene expression analysis. It gives the advantage of analyzing highly heterogenous cell type information compared to traditional sequencing methods, which is gaining popularity in the biomedical area. Moreover, this analysis can help for early diagnosis and drug development of tumor cells, and cancer cell types. In the workflow of gene expression data profiling, identification of the cell types is an important task, but it faces many challenges like the curse of dimensionality, sparsity, batch effect, and overfitting. However, these challenges can be overcome by performing a feature selection technique which selects more relevant features by reducing feature dimensions. In this research work, recurrent neural network-based feature selection model is proposed to extract relevant features from high dimensional, and low sample size data. Moreover, a deep learning-based gene embedding model is also proposed to reduce data sparsity of single-cell data for cell type identification. The proposed frameworks have been implemented with different architectures of recurrent neural networks, and demonstrated via real-world micro-array datasets and single-cell RNA-seq data and observed that the proposed models perform better than other feature selection models. A semi-supervised model is also implemented using the same workflow of gene embedding concept since labeling data is very cumbersome, time consuming, and requires manual effort and expertise in the field. Therefore, different ratios of labeled data are used in the experiment to validate the concept. Experimental results show that the proposed semi-supervised approach represents very encouraging performance even though a limited number of labeled data is used via the gene embedding concept. In addition, graph attention based autoencoder model has also been studied to learn the latent features by incorporating prior knowledge with gene expression data for cell type classification.
Index Terms — Single-Cell Gene Expression Data, Gene Embedding, Semi-Supervised model, Incorporate Prior Knowledge, Gene-gene Interaction Network, Deep Learning, Graph Auto Encode
Representation learning on relational data
Humans utilize information about relationships or interactions between objects for orientation in various situations. For example, we trust our friend circle recommendations, become friends with the people we already have shared friends with, or adapt opinions as a result of interactions with other people.
In many Machine Learning applications, we also know about relationships, which bear essential information for the use-case.
Recommendations in social media, scene understanding in computer vision, or traffic prediction are few examples where relationships play a crucial role in the application.
In this thesis, we introduce methods taking relationships into account and demonstrate their benefits for various problems.
A large number of problems, where relationship information plays a central role, can be approached by modeling data by a graph structure and by task formulation as a prediction problem on the graph.
In the first part of the thesis, we tackle the problem of node classification from various directions. We start with unsupervised learning approaches, which differ by assumptions they make about the relationship's meaning in the graph.
For some applications such as social networks, it is a feasible assumption that densely connected nodes are similar. On the other hand, if we want to predict passenger traffic for the airport based on its flight connections, similar nodes are not necessarily positioned close to each other in the graph and more likely have comparable neighborhood patterns.
Furthermore, we introduce novel methods for classification and regression in a semi-supervised setting, where labels of interest are known for a fraction of nodes. We use the known prediction targets and information about how nodes connect to learn the relationships' meaning and their effect on the final prediction.
In the second part of the thesis, we deal with the problem of graph matching. Our first use-case is the alignment of different geographical maps, where the focus lies on the real-life setting. We introduce a robust method that can learn to ignore the noise in the data.
Next, our focus moves to the field of Entity Alignment in different Knowledge Graphs.
We analyze the process of manual data annotation and propose a setting and algorithms to accelerate this labor-intensive process.
Furthermore, we point to the several shortcomings in the empirical evaluations, make several suggestions on how to improve it, and extensively analyze existing approaches for the task.
The next part of the thesis is dedicated to the research direction dealing with automatic extraction and search of arguments, known as Argument Mining. We propose a novel approach for identifying arguments and demonstrate how it can make use of relational information. We apply our method to identify arguments in peer-reviews for scientific publications and show that arguments are essential for the decision process. Furthermore, we address the problem of argument search and introduce a novel approach that retrieves relevant and original arguments for the user's queries.
Finally, we propose an approach for subspace clustering, which can deal with large datasets and assign new objects to the found clusters. Our method learns the relationships between objects and performs the clustering on the resulting graph
Few-shot Claim Verification for Automated Fact Checking
In an era characterized by the rapid expansion of online information and the widespread dissemination of misinformation, automated fact-checking has emerged as an essential area of research. As digital platforms continue to proliferate, the necessity for accurate and efficient fact-checking mechanisms is attracting increasing interest. Automated fact-checking systems address two main tasks: claim detection and claim validation. Claim detection involves identifying sentences or text snippets containing assertions or claims potentially subject to fact-checking. Claim validation, a multifaceted endeavor, encompasses evidence retrieval and claim verification. During evidence retrieval, relevant information or evidence that may support or refute a given claim is obtained. Claim verification, on the other hand, entails assessing the veracity of a claim by comparing it against available evidence. Typically framed as a natural language inference (NLI) problem, claim verification requires the model to determine whether a claim is supported, refuted, or there is not enough information to reach a verdict. In this thesis, we explore challenges inherent in claim verification, with a focus on few-shot scenarios where limited labeled data and computational resources pose significant constraints. We introduce three innovative methods tailored to tackle these challenges: Semantic Embedding Element-wise Difference (SEED), Micro Analysis of Pairwise Language Evolution (MAPLE), and Active learning with Pattern Exploiting Training models (Active PETs). SEED, a novel vector-based approach, leverages semantic differences in claim-evidence pairs to perform claim verification in few-shot scenarios. By creating class representative vectors, SEED enables efficient claim verification even with limited training data. Comparative evaluations against previous state-of-the-art methods demonstrate SEED's consistent improvements in few-shot settings. MAPLE is another pioneering approach to few-shot claim verification, harnessing a small seq2seq model and a novel semantic measure to explore the alignment between claims and evidence. Utilizing micro analysis of pairwise language evolution, MAPLE achieves significant performance improvements over state-of-the-art baselines across multiple automated fact-checking datasets. Active PETs presents a novel ensemble-based active learning approach for data annotation prioritization in few-shot claim verification. By utilizing an ensemble of Pattern Exploiting Training (PET) models based on various pre-trained language models, Active PETs effectively selects unlabelled data for annotation, consistently outperforming baseline active learning methods. Its integrated oversampling strategy further enhances performance, demonstrating the potential of active learning techniques in optimizing claim verification workflows. Together, these methods represent significant advancements in claim verification research, offering scalable and practical solutions. Through extensive experimentation and comparative analysis, this thesis evaluates the effectiveness of each method on various dataset configurations and provides valuable insights into their strengths and weaknesses. Furthermore, by identifying potential extensions and areas for refinement, the thesis lays the groundwork for future research endeavors in this critical field of artificial intelligence
Recommended from our members
Computational Argumentation Approaches to Improve Sensemaking and Evidence-based Reasoning in Online Deliberation Systems
Deliberation is the process through which communities identify potential solutions for a problem and select the solution that most effectively meets their diverse requirements through dialogic communication. Online deliberation is implemented nowadays with means of social media and online discussion platforms; however, these media present significant challenges and issues that can be traced to inadequate support for Sensemaking processes and poor endorsement of the quality characteristics of deliberation.
This thesis investigates integrating computational argumentation methods in online deliberation platforms as an effective way to improve participants' perception of the quality of the deliberation process, their way of making sense of the overall process and producing healthier social dynamics.
For that, two computational artefacts are proposed: (i) a Synoptical summariser of long discussions and (ii) a Scientific Argument Recommender System (SciArgRecSys).
The two artefacts are designed and developed with state-of-the-art methods (with the use of Large Language Models - LLMs) and evaluated intrinsically and extrinsically when deployed in a real live platform (BCause).
Through extensive evaluation, the positive effect of both artefacts is illustrated in human Sensemaking and essential quality characteristics of deliberation such as reciprocal Engagement, Mutual Understanding, and Social dynamics. In addition, it has been demonstrated that these interventions effectively reduce polarisation, the formation of sub-communities while significantly enhancing the quality of the discussion by making it more coherent and diverse
Recommended from our members
Ranking for Scalable Information Extraction
Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text
In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability
Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge
linkages in existing scientific literature to provide impetus to innovation and research
productivity. Despite significant advancements in LBD research, previous studies contain
several open problems and shortcomings that are hindering its progress. The overarching
goal of this thesis is to address these issues, not only to enhance the discovery
component of LBD, but also to shed light on new directions that can further strengthen
the existing understanding of the LBD work
ow. In accordance with this goal, the thesis
aims to enhance the LBD work
ow with a view to ensuring its widespread applicability.
The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of
the proposed solutions to a diverse range of problem settings. These problem settings
are not necessarily application areas that are closely related to the LBD context, but
could include a wide range of problems beyond the typical scope of LBD, which has traditionally
been applied to scientific literature. Adapting the LBD work
ow to problems
outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of
LBD research, which is discovering novel linkages in text corpora is valid across a vast
range of problem settings.
Secondly, the idea of widespread applicability also denotes the capability of the proposed
solutions to be executed in new environments. These `new environments' are various
academic disciplines (i.e., cross-domain knowledge discovery) and publication languages
(i.e., cross-lingual knowledge discovery). The application of LBD models to new environments
is timely, since the massive growth of the scientific literature has engendered
huge challenges to academics, irrespective of their domain.
This thesis is divided into five main research objectives that address the following topics:
literature synthesis, the input component, the discovery component, reusability, and
portability. The objective of the literature synthesis is to address the gaps in existing
LBD reviews by conducting the rst systematic literature review. The input component
section aims to provide generalised insights on the suitability of various input types in the
LBD work
ow, focusing on their role and potential impact on the information retrieval
cycle of LBD.
The discovery component section aims to intermingle two research directions that have
been under-investigated in the LBD literature, `modern word embedding techniques'
and `temporal dimension' by proposing diachronic semantic inferences. Their potential
positive in
uence in knowledge discovery is veri ed through both direct and indirect
uses. The reusability section aims to present a new, distinct viewpoint on these LBD
models by verifying their reusability in a timely application area using a methodical reuse
plan. The last section, portability, proposes an interdisciplinary LBD framework that
can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its
generalisable capabilities.
Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main
research objectives, enhancing the existing understanding of the LBD work
ow. The
thesis offers new insights which future LBD research could further explore and expand
to create more eficient, widely applicable LBD models to enable broader community
benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Knowledge Modelling and Learning through Cognitive Networks
One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot