7 research outputs found

    Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes

    Get PDF
    Long document summarization poses obstacles to current generative transformer-based models because of the broad context to process and understand. Indeed, detecting long-range dependencies is still challenging for today’s state-of-the-art solutions, usually requiring model expansion at the cost of an unsustainable demand for computing and memory capacities. This paper introduces Emma, a novel efficient memory-enhanced transformer-based architecture. By segmenting a lengthy input into multiple text fragments, our model stores and compares the current chunk with previous ones, gaining the capability to read and comprehend the entire context over the whole document with a fixed amount of GPU memory. This method enables the model to deal with theoretically infinitely long documents, using less than 18 and 13 GB of memory for training and inference, respectively. We conducted extensive performance analyses and demonstrate that Emma achieved competitive results on two datasets of different domains while consuming significantly less GPU memory than competitors do, even in low-resource settings

    Gene function finding through cross-organism ensemble learning

    Get PDF
    Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

    Multi-language transfer learning for low-resource legal case summarization

    Get PDF
    Analyzing and evaluating legal case reports are labor-intensive tasks for judges and lawyers, who usually base their decisions on report abstracts, legal principles, and commonsense reasoning. Thus, summarizing legal documents is time-consuming and requires excellent human expertise. Moreover, public legal corpora of specific languages are almost unavailable. This paper proposes a transfer learning approach with extractive and abstractive techniques to cope with the lack of labeled legal summarization datasets, namely a low-resource scenario. In particular, we conducted extensive multi- and cross-language experiments. The proposed work outperforms the state-of-the-art results of extractive summarization on the Australian Legal Case Reports dataset and sets a new baseline for abstractive summarization. Finally, syntactic and semantic metrics assessments have been carried out to evaluate the accuracy and the factual consistency of the machine-generated legal summaries

    Individuazione di Nuove Funzioni Biologiche di Geni mediante Transfer Learning Inter-Organismo con Reti Neurali Autoencoder

    Get PDF
    La genomica è la disciplina che si occupa della struttura, sequenza, funzione ed evoluzione del genoma, cioè dell’informazione genetica contenuta nel DNA presente nelle cellule di una particolare specie. Avendo a disposizione enormi quantità di dati per l'uomo diventa difficile poterli analizzare tutti e riuscire fare deduzioni su di essi. Perciò fin dai primi anni di ricerca, in questa branca della biologia, si è resa fondamentale l'informatica per l'elaborazione e la visualizzazione dei dati che produce, tanto da poter affermare che si basa sulla bioinformatica. Con l'accrescere dei dati raccolti dalle ricerche, è nata la necessità di catalogarli in modo ordinato all'interno di enormi database. Sono molteplici i progetti avviati con questo scopo, uno dei principali è Gene Ontology che mantiene e sviluppa un vocabolario controllato, e annota e diffonde dati riguardanti i geni e le loro funzioni. È qui che entra in gioco il machine learning: grazie ai dati raccolti fino ad ora sulle funzioni biologiche svolte dai prodotti dei geni, e ai metodi introdotti dal machine learning come gli algoritmi supervisionati e le reti neurali, è ormai possibile aiutare la ricerca direzionando gli scienziati nelle analisi e negli esperimenti da provare, utili per scoprire e confermare nuove correlazioni tra geni e funzioni biologiche. Inoltre, essendo tutti gli organismi legati dal punto di vista evolutivo, e quindi con un genoma più o meno simile, è possibile trasferire le conoscenze apprese da uno di essi ad un altro: questo è ciò che si intende per "Transfer Learning Inter-Organismo". In questa tesi verranno sfruttate le reti neurali per trasferire queste conoscenze, proponendo due metodologie differenti: una utilizza la struttura degli autoencoder, l'altra invece crea classificatori differenti per ogni valore da predire usando il deep learning

    Cross-organism learning method to discover new gene functionalities

    No full text
    BACKGROUND: Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. METHODS: Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. RESULTS: We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted

    Cross-organism learning method to discover new gene functionalities

    No full text
    Background: Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Methods: Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. Results: We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted

    Big Data mining and machine learning techniques applied to real world scenarios

    Get PDF
    Data mining techniques allow the extraction of valuable information from heterogeneous and possibly very large data sources, which can be either structured or unstructured. Unstructured data, such as text files, social media, mobile data, are much more than structured data, and grow at a higher rate. Their high volume and the inherent ambiguity of natural language make unstructured data very hard to process and analyze. Appropriate text representations are therefore required in order to capture word semantics as well as to preserve statistical information, e.g. word counts. In Big Data scenarios, scalability is also a primary requirement. Data mining and machine learning approaches should take advantage of large-scale data, exploiting abundant information and avoiding the curse of dimensionality. The goal of this thesis is to enhance text understanding in the analysis of big data sets, introducing novel techniques that can be employed for the solution of real world problems. The presented Markov methods temporarily achieved the state-of-the-art on well-known Amazon reviews corpora for cross-domain sentiment analysis, before being outperformed by deep approaches in the analysis of large data sets. A noise detection method for the identification of relevant tweets leads to 88.9% accuracy in the Dow Jones Industrial Average daily prediction, which is the best result in literature based on social networks. Dimensionality reduction approaches are used in combination with LinkedIn users' skills to perform job recommendation. A framework based on deep learning and Markov Decision Process is designed with the purpose of modeling job transitions and recommending pathways towards a given career goal. Finally, parallel primitives for vendor-agnostic implementation of Big Data mining algorithms are introduced to foster multi-platform deployment, code reuse and optimization
    corecore