603 research outputs found

    Generating Effective Sentence Representations: Deep Learning and Reinforcement Learning Approaches

    Get PDF
    Natural language processing (NLP) is one of the most important technologies of the information age. Understanding complex language utterances is also a crucial part of artificial intelligence. Many Natural Language applications are powered by machine learning models performing a large variety of underlying tasks. Recently, deep learning approaches have obtained very high performance across many NLP tasks. In order to achieve this high level of performance, it is crucial for computers to have an appropriate representation of sentences. The tasks addressed in the thesis are best approached having shallow semantic representations. These representations are vectors that are then embedded in a semantic space. We present a variety of novel approaches in deep learning applied to NLP for generating effective sentence representations in this space. These semantic representations can either be general or task-specific. We focus on learning task-specific sentence representations, where often these tasks have a good amount of overlap. We design a set of general purpose and task specific sentence encoders combining both word-level semantic knowledge and word- and sentence-level syntactic information. As a method for the former, we perform an intelligent amalgamation of word vectors using modern deep learning modules. For the latter, we use word-level knowledge, such as parts of speech, spelling, and suffix features, and sentence-level information drawn from natural language parse trees which provide the hierarchical structure of a sentence together with grammatical relations between the words. Further expertise is added with reinforcement learning which guides a machine learning model through a reward-penalty game. Rather than just striving for good performance, we always try to design models that are more transparent and explainable. We provide an intuitive explanation about the design of each model and how the model is making a decision. Our extensive experiments show that these models achieve competitive performance compared with the currently available state-of-the-art generalized and task-specific sentence encoders. All but one of the tasks dealt with English language texts. The multilingual semantic similarity task required creating a multilingual corpus for which we provide a novel semi-supervised approach to make artificial negative samples in the presence of just positive samples

    Gene Regulatory Networks: Modeling, Intervention and Context

    Get PDF
    abstract: Biological systems are complex in many dimensions as endless transportation and communication networks all function simultaneously. Our ability to intervene within both healthy and diseased systems is tied directly to our ability to understand and model core functionality. The progress in increasingly accurate and thorough high-throughput measurement technologies has provided a deluge of data from which we may attempt to infer a representation of the true genetic regulatory system. A gene regulatory network model, if accurate enough, may allow us to perform hypothesis testing in the form of computational experiments. Of great importance to modeling accuracy is the acknowledgment of biological contexts within the models -- i.e. recognizing the heterogeneous nature of the true biological system and the data it generates. This marriage of engineering, mathematics and computer science with systems biology creates a cycle of progress between computer simulation and lab experimentation, rapidly translating interventions and treatments for patients from the bench to the bedside. This dissertation will first discuss the landscape for modeling the biological system, explore the identification of targets for intervention in Boolean network models of biological interactions, and explore context specificity both in new graphical depictions of models embodying context-specific genomic regulation and in novel analysis approaches designed to reveal embedded contextual information. Overall, the dissertation will explore a spectrum of biological modeling with a goal towards therapeutic intervention, with both formal and informal notions of biological context, in such a way that will enable future work to have an even greater impact in terms of direct patient benefit on an individualized level.Dissertation/ThesisPh.D. Computer Science 201

    Text Mining for Pathway Curation

    Get PDF
    Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist.Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension. Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators

    Sparks of Large Audio Models: A Survey and Outlook

    Full text link
    This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL: https://github.com/EmulationAI/awesome-large-audio-model

    2006 Abstract Booklet

    Get PDF
    Complete Schedule of Events for the 8th Annual Undergraduate Research Conference at Minnesota State University, Mankato

    Discovering lesser known molecular players and mechanistic patterns in Alzheimer's disease using an integrative disease modelling approach

    Get PDF
    Convergence of exponentially advancing technologies is driving medical research with life changing discoveries. On the contrary, repeated failures of high-profile drugs to battle Alzheimer's disease (AD) has made it one of the least successful therapeutic area. This failure pattern has provoked researchers to grapple with their beliefs about Alzheimer's aetiology. Thus, growing realisation that Amyloid-β and tau are not 'the' but rather 'one of the' factors necessitates the reassessment of pre-existing data to add new perspectives. To enable a holistic view of the disease, integrative modelling approaches are emerging as a powerful technique. Combining data at different scales and modes could considerably increase the predictive power of the integrative model by filling biological knowledge gaps. However, the reliability of the derived hypotheses largely depends on the completeness, quality, consistency, and context-specificity of the data. Thus, there is a need for agile methods and approaches that efficiently interrogate and utilise existing public data. This thesis presents the development of novel approaches and methods that address intrinsic issues of data integration and analysis in AD research. It aims to prioritise lesser-known AD candidates using highly curated and precise knowledge derived from integrated data. Here much of the emphasis is put on quality, reliability, and context-specificity. This thesis work showcases the benefit of integrating well-curated and disease-specific heterogeneous data in a semantic web-based framework for mining actionable knowledge. Furthermore, it introduces to the challenges encountered while harvesting information from literature and transcriptomic resources. State-of-the-art text-mining methodology is developed to extract miRNAs and its regulatory role in diseases and genes from the biomedical literature. To enable meta-analysis of biologically related transcriptomic data, a highly-curated metadata database has been developed, which explicates annotations specific to human and animal models. Finally, to corroborate common mechanistic patterns — embedded with novel candidates — across large-scale AD transcriptomic data, a new approach to generate gene regulatory networks has been developed. The work presented here has demonstrated its capability in identifying testable mechanistic hypotheses containing previously unknown or emerging knowledge from public data in two major publicly funded projects for Alzheimer's, Parkinson's and Epilepsy diseases

    Purposive variation in recordkeeping in the academic molecular biology laboratory

    Get PDF
    This thesis presents an investigation into the role played by laboratory records in the disciplinary discourse of academic molecular biology laboratories. The motivation behind this study stems from two areas of concern. Firstly, the laboratory record has received comparatively little attention as a linguistic genre in spite of its central role in the daily work of laboratory scientists. Secondly, laboratory records have become a focus for technologically driven change through the advent of computing systems that aim to support a transition away from the traditional paper-based approach towards electronic recordkeeping. Electronic recordkeeping raises the potential for increased sharing of laboratory records across laboratory communities. However, the uptake of electronic laboratory notebooks has been, and remains, markedly low in academic laboratories. The investigation employs a multi-perspective research framework combining ethnography, genre analysis, and reading protocol analysis in order to evaluate both the organizational practices and linguistic practices at work in laboratory recordkeeping, and to examine these practices from the viewpoints of both producers and consumers of laboratory records. Particular emphasis is placed on assessing variation in the practices used by different scientists when keeping laboratory records, and on assessing the types of articulation work used to achieve mutual intelligibility across laboratory members. The findings of this investigation indicate that the dominant viewpoint held by laboratory staff other than principal investigators conceptualized laboratory records as a personal resource rather than a community archive. Readers other than the original author relied almost exclusively on the recontextualization of selected information from laboratory records into ‘public genres’ such as laboratory talks, research articles, and progress reports as the preferred means of accessing the information held in the records. The consistent use of summarized forms of recording experimental data rendered most laboratory records as both unreliable and of limited usability in the records management sense that they did not form full and accurate descriptions that could support future organizational activities. These findings offer a counterpoint to other studies, notably a number of studies undertaken as part of technology developments for electronic recordkeeping, that report sharing of laboratory records or assume a ‘cyberbolic’ view of laboratory records as a shared resource

    The discovery of novel recessive genetic disorders in dairy cattle : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Animal Science at AL Rae Centre of Genetics and Breeding, Massey University, Palmerston North, New Zealand

    Get PDF
    The selection of desirable characteristics in livestock has resulted in the transmission of advantageous genetic variants for generations. The advent of artificial insemination has accelerated the propagation of these advantageous genetic variants and led to tremendous advances in animal productivity. However, this intensive selection has led to the rapid uptake of deleterious alleles as well. Recently, a recessive mutation in the GALNT2 gene was identified to dramatically impair growth and production traits in dairy cattle causing small calf syndrome. The research presented here seeks to further investigate the presence and impact of recessive mutations in dairy cattle. A primary aim of genetics is to identify causal variants and understand how they act to manipulate a phenotype. As datasets have expanded, larger analyses are now possible and statistical methods to discover causal mutations have become commonplace. One such method, the genome-wide association study (GWAS), presents considerable exploratory utility in identifying quantitative trait loci (QTL) and causal mutations. GWAS' have predominantly focused on identifying additive genetic effects assuming that each allele at a locus acts independently of the other, whereas non-additive effects including dominant, recessive, and epistatic effects have been neglected. Here, we developed a single-locus non-additive GWAS model intended for the detection of dominant and recessive genetic mechanisms. We applied our non-additive GWAS model to growth, developmental, and lactation phenotypes in dairy cattle. We identified several candidate causal mutations that are associated with moderate to large deleterious recessive disorders of animal welfare and production. These mutations included premature-stop (MUS81, ITGAL, LRCH4, RBM34), splice disrupting (FGD4, GALNT2), and missense (PLCD4, MTRF1, DPF2, DOCK8, SLC25A4, KIAA0556, IL4R) variants, and these occur at surprisingly high frequencies in cattle. We further investigated these candidates for anatomical, molecular, and metabolic phenotypes to understand how these disorders might manifest. In some cases, these mutations were analogous to disorder-causing mutations in other species, these included: Coffin-Siris syndrome (DPF2); Charcot Marie Tooth disease (FGD4); a congenital disorder of glycosylation (GALNT2); hyper Immunoglobulin-E syndrome (DOCK8); Joubert syndrome (KIAA0556); and mitochondrial disease (SLC25A4). These discoveries demonstrate that deleterious recessive mutations exist in dairy cattle at remarkably high frequencies and we are able to detect these disorders through modern genotyping and phenotyping capabilities. These are important findings that can be used to improve the health and productivity of dairy cattle in New Zealand and internationally

    Biochemistry students' difficulties with the symbolic and visual language used in molecular biology.

    Get PDF
    Thesis (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2007.This study reports on recurring difficulties experienced by undergraduate students with respect to understanding and interpretation of certain symbolism, nomenclature, terminology, shorthand notation, models and other visual representations employed in the field of Molecular Biology to communicate information. Based on teaching experience and guidelines set out by a four-level methodological framework, data on various topic-related difficulties was obtained by inductive analyses of students’ written responses to specifically designed, free-response and focused probes. In addition, interviews, think-aloud exercises and student-generated diagrams were also used to collect information. Both unanticipated and recurring difficulties were compared with scientifically correct propositional knowledge, categorized and subsequently classified. Students were adept at providing the meaning of the symbol “Δ” in various scientific contexts; however, some failed to recognize its use to depict the deletion of a leucine biosynthesis gene in the form, Δ leu. “Hazard to leucine”, “change to leucine” and “abbreviation for isoleucine” were some of the erroneous interpretations of this polysemic symbol. Investigations on these definitions suggest a constructivist approach to knowledge construction and the inappropriate transfer of knowledge from prior mental schemata. The symbol, “::”, was poorly differentiated by students in its use to indicate gene integration or transposition and in tandem gene fusion. Idiosyncratic perceptions emerged suggesting that it is, for example, a proteinaceous component linking genes in a chromosome or the centromere itself associated with the mitotic spindle or “electrons” between genes in the same way that it is symbolically shown in Lewis dot diagrams which illustrate covalent bonding between atoms. In an oligonucleotide shorthand notation, some students used valency to differentiate the phosphite trivalent form of the phosphorus atom from the pentavalent phosphodiester group, yet the concept of valency was poorly understood. By virtue of the visual form of a shorthand notation of the 3,5 phosphodiester link in DNA, the valency was incorrectly read. VSEPR theory and the Octet Rule were misunderstood or forgotten when trying to explain the valency of the phosphorus atom in synthetic oligonucleotide intermediates. Plasmid functional domains were generally well-understood although restriction mapping appeared to be a cognitively demanding task. Rote learning and substitution of definitions were evident in the explanation of promoter and operator functions. The concept of gene expression posed difficulties to many students who believed that genes contain the entity they encode. Transcription and translation of in tandem gene fusions were poorly explained by some students as was the effect of plasmid conformation on transformation and gene expression. With regard to the selection of transformants or the hybridoma, some students could not engage in reasoning or lateral thinking as protoconcepts and domain-specific information were poorly understood. A failure to integrate and reason with factual information on phenotypic traits, media components and biochemical pathways were evident in written and oral presentations. DNA-strand nomenclature and associated function were problematic to some students as they failed to differentiate coding strand from template strand and were prone to interchange the labelling of these. A substitution of labels with those characterizing DNA replication intermediates demonstrated erroneous information transfer. DNA replication models posed difficulties integrating molecular mechanisms and detail with line drawings, coupled with inaccurate illustrations of sequential replication features. Finally, a remediation model is presented, demonstrating a shift in assessment score dispersion from a range of 0 - 4.5 to 4 - 9 when learners are guided metacognitively to work with domain-specific or critical knowledge from an information bank. The present work shows that varied forms of symbolism can present students with complex learning difficulties as the underlying information depicted by these is understood in a superficial way. It is imperative that future studies be focused on the standardization of symbol use, perhaps governed by convention that determines the manner in which threshold information is disseminated on symbol use, coupled by innovative teaching strategies which facilitate an improved understanding of the use of symbolic representations in Molecular Biology. As Molecular Biology advances, it is likely that experts will continue to use new and diverse forms of symbolic representations to explain their findings. The explanation of futuristic Science is likely to develop a symbolic language that will impose great teaching challenges and unimaginable learning difficulties to new generation teachers and learners, respectively
    corecore