11 research outputs found

    Gene function finding through cross-organism ensemble learning

    Get PDF
    Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

    Cros-Organism Annotation Prediction through Deep Learning Algorithms

    Get PDF
    Studying how genes or proteins influence humans and other species' lives is paramount. To study that, it's necessary to know which functional properties are specific for each gene or protein. The association between one gene or protein and a functional properties is called annotation. An annotion can be 0 or 1. 1 means that gene or protein contributes to the activation of a certain functional property. Functional properties are referred by terms, which are strings that belong to ontologies. This work aim is to predict novel gene annotations for little know species such as Bos Taurus. To predict such annotations, a model, built using deep learning, is used. This model is trained using well know species as Mus Musculus or Homo Sapiens. Every predicted annotation has its own likelihood, that tells about how much the prediction is close to a 0 or a 1. Final accuracy can be evaluated fixing a certain value of likelihood, so that all the considered annotations have a likelihood greater or equal than the fixed one. The obtained accuracy is quite high but not enought to be used in a professional way, although it offers a nice cue for future research

    Metric for seleting the number of topics in the LDA Model

    Get PDF
    The latest technological trends are driving a vast and growing amount of textual data. Topic modeling is a useful tool for extracting information from large corpora of text. A topic template is based on a corpus of documents, discovers the topics that permeate the corpus and assigns documents to those topics. The Latent Dirichlet Allocation (LDA) model is the main, or most popular, of the probabilistic topic models. The LDA model is conditioned by three parameters: two Dirichlet hyperparameters (α and β ) and the number of topics (K). Determining the parameter K is extremely important and not extensively explored in the literature, mainly due to the intensive computation and long processing time. Most topic modeling methods implicitly assume that the number of topics is known in advance, thus considering it demands an exogenous parameter. That is annoying, leaving the technique prone to subjectivities. The quality of insights offered by LDA is quite sensitive to the value of the parameter K, and perhaps an excess of subjectivity in its choice might influence the confidence managers put on the techniques results, thus undermining its usage by firms. This dissertation’s main objective is to develop a metric to identify the ideal value for the parameter K of the LDA model that allows an adequate representation of the corpus and within a tolerable elapsed time of the process. We apply the proposed metric alongside existing metrics to two datasets. Experiments show that the proposed method selects a number of topics similar to that of other metrics, but with better performance in terms of processing time. Although each metric has its own method for determining the number of topics, some results are similar for the same database, as evidenced in the study. Our metric is superior when considering the processing time. Experiments show this method is effective.As tendências tecnológicas mais recentes impulsionam uma vasta e crescente quantidade de dados textuais. Modelagem de tópicos é uma ferramenta útil para extrair informações relevantes de grandes corpora de texto. Um modelo de tópico é baseado em um corpus de documentos, descobre os tópicos que permeiam o corpus e atribui documentos a esses tópicos. O modelo de Alocação de Dirichlet Latente (LDA) é o principal, ou mais popular, dos modelos de tópicos probabilísticos. O modelo LDA é condicionado por três parâmetros: os hiperparâmetros de Dirichlet (α and β ) e o número de tópicos (K). A determinação do parâmetro K é extremamente importante e pouco explorada na literatura, principalmente devido à computação intensiva e ao longo tempo de processamento. A maioria dos métodos de modelagem de tópicos assume implicitamente que o número de tópicos é conhecido com antecedência, portanto, considerando que exige um parâmetro exógeno. Isso é um tanto complicado para o pesquisador pois acaba acrescentando à técnica uma subjetividade. A qualidade dos insights oferecidos pelo LDA é bastante sensível ao valor do parâmetro K, e pode-se argumentar que um excesso de subjetividade em sua escolha possa influenciar a confiança que os gerentes depositam nos resultados da técnica, prejudicando assim seu uso pelas empresas. O principal objetivo desta dissertação é desenvolver uma métrica para identificar o valor ideal para o parâmetro K do modelo LDA que permita uma representação adequada do corpus e dentro de um tempo de processamento tolerável. Embora cada métrica possua método próprio para determinação do número de tópicos, alguns resultados são semelhantes para a mesma base de dados, conforme evidenciado no estudo. Nossa métrica é superior ao considerar o tempo de processamento. Experimentos mostram que esse método é eficaz

    Oppimisen kohdentaminen easymmetrisen harvuuden avulla

    Get PDF
    Useat viime vuosina kerätyt havaintoaineistot koostuvat mittauksista hyvin pienestä määrästä näytteitä. Tällaisten aineistojen mallintaminen on haasteellista, koska mallit helposti ylisovittuvat aineistoon. Ongelmaan on kehitetty useita lähestymistapoja. Pääasiallisen mallinnustehtävän rinnalle voidaan ottaa muita mallinnustehtäviä, joissa käytettävät mallit kytketään pääasiallisen tehtävän malliin. Näin mallien yhteisten osien oppimiseen on käytettävissä enemmän aineistoa, mikä parantaa tulosten yleistymistä uusiin aineistoihin. Tätä lähestymistapaa kutsutaan monitehtäväoppimiseksi. Käytettävää mallia voidaan myös rajoittaa lisäämällä siihen oletuksia, jotka rajoittavat mallin sovittumista aineistoon ja siten vähentävät ylisovittumista. Tyypilliset monitehtäväoppimista hyödyntävät mallit painottavat kaikkia oppimistehtäviä yhtä voimakkaasti, vaikka yksi oppimistehtävä on yleensä muita tärkeämpi. Tämä diplomityö on esitutkimus uudesta lähestymistavasta, joka pyrkii monitehtäväoppimisasetelmassa parantamaan yleistyvyyttä yhdessä oppimistehtävässä eri mallien sovittumiskykyä rajoittavien oletusten avulla. Valitussa oppimistehtävässä mallin sovittumista aineistoon rajoitetaan muita oppimistehtäviä enemmän mallin harvuutta lisäämällä, jotta tehtävälle opittu malli yleistyisi paremmin. Uutta lähestymistapaa tutkitaan rajaamalla tutkimuskysymys suosittuihin LDAmalleihin, joissa hyödynnetään bayesilaisia epäparametrisia priorijakaumia. Epäsymmetrisen harvuuden vaikutuksia tutkitaan tämän malliperheen avulla. Tuloksissa on havaittavissa hienovaraisia parannuksia yleistyvyyteen. Tulokset uudella mallilla ovat kilpailukykyisiä tämän hetkisten johtavien menetelmien tulosten kanssa.Modern data sets often suffer from the problem of having measurements from very few samples. The small sample size makes modeling such data sets very difficult, as models easily overfit to the data. Many approaches to alleviate the problem have been taken. One such approach is multi-task learning, a subfield of statistical machine learning, in which multiple data sets are modeled simultaneously. More generally, multiple learning tasks may be learnt simultaneously to achieve better performance in each. Another approach to the problem of having too few samples is to prevent over fitting by constraining the model by making suitable assumptions. Traditional multi-task methods treat all learning tasks and data sets equally, even thought we are usually mostly interested in learning one of them. This thesis is a case study about promoting predictive performance in a specific data set of interest in a multi-task setting by constraining the models for the learning tasks unevenly. The model for the data set of interest more sparse as compared to the models for the secondary data sets. To study the new approach, the research question is limited to the very specific and popular family of so-called topic models using Bayesian nonparametric priors. A new model is presented which enables us to study the effects of asymmetric sparsity. The effects of asymmetric sparsity are studied by using the new model on real data and toy data. Subtle beneficial effects of asymmetric sparsity are observed on toy data and the new model performs comparably to existing state-of-the-art methods on real data

    Understanding public transit patterns with open geodemographics to facilitate public transport planning

    Get PDF
    Plentiful studies have discussed the potential applications of contactless smart card from understanding interchange patterns to transit network analysis and user classifications. However, the incomplete and anonymous nature of the smart card data inherently limit the interpretations and understanding of thefindings, whichfurther limit planning implementations. Geodemographics, as ‘an analysis of people by where they live’, can be utilised as a promising supplement to provide contextual information to transport planning. This paper develops a methodological framework that conjointly integrates personalised smart card data with open geodemographics so as to pursue a better understanding of the traveller’s behaviours. It adopts a text mining technology, latent Dirichlet allocation modelling, to extract the transit patterns from the personalised smart card data and then use the open geodemographics derived from census data to enhance the interpretation of the patterns. Moreover, it presents night tube as an example to illustrate its potential usefulness in public transport planning

    Graphical Model approaches for Biclustering

    Get PDF
    In many scientific areas, it is crucial to group (cluster) a set of objects, based on a set of observed features. Such operation is widely known as Clustering and it has been exploited in the most different scenarios ranging from Economics to Biology passing through Psychology. Making a step forward, there exist contexts where it is crucial to group objects and simultaneously identify the features that allow to recognize such objects from the others. In gene expression analysis, for instance, the identification of subsets of genes showing a coherent pattern of expression in subsets of objects/samples can provide crucial information about active biological processes. Such information, which cannot be retrieved by classical clustering approaches, can be extracted with the so called Biclustering, a class of approaches which aim at simultaneously clustering both rows and columns of a given data matrix (where each row corresponds to a different object/sample and each column to a different feature). The problem of biclustering, also known as co-clustering, has been recently exploited in a wide range of scenarios such as Bioinformatics, market segmentation, data mining, text analysis and recommender systems. Many approaches have been proposed to address the biclustering problem, each one characterized by different properties such as interpretability, effectiveness or computational complexity. A recent trend involves the exploitation of sophisticated computational models (Graphical Models) to face the intrinsic complexity of biclustering, and to retrieve very accurate solutions. Graphical Models represent the decomposition of a global objective function to analyse in a set of smaller/local functions defined over a subset of variables. The advantages in using Graphical Models relies in the fact that the graphical representation can highlight useful hidden properties of the considered objective function, plus, the analysis of smaller local problems can be dealt with less computational effort. Due to the difficulties in obtaining a representative and solvable model, and since biclustering is a complex and challenging problem, there exist few promising approaches in literature based on Graphical models facing biclustering. 3 This thesis is inserted in the above mentioned scenario and it investigates the exploitation of Graphical Models to face the biclustering problem. We explored different type of Graphical Models, in particular: Factor Graphs and Bayesian Networks. We present three novel algorithms (with extensions) and evaluate such techniques using available benchmark datasets. All the models have been compared with the state-of-the-art competitors and the results show that Factor Graph approaches lead to solid and efficient solutions for dataset of contained dimensions, whereas Bayesian Networks can manage huge datasets, with the overcome that setting the parameters can be not trivial. As another contribution of the thesis, we widen the range of biclustering applications by studying the suitability of these approaches in some Computer Vision problems where biclustering has been never adopted before. Summarizing, with this thesis we provide evidence that Graphical Model techniques can have a significant impact in the biclustering scenario. Moreover, we demonstrate that biclustering techniques are ductile and can produce effective solutions in the most different fields of applications

    On Measuring Social Dynamics of Online Social Media

    No full text
    Due to the complex nature of human behaviour and to our inability to directly measure thoughts and feelings, social psychology has long struggled for empirical grounding for its theories and models. Traditional techniques involving groups of people in controlled environments are limited to small numbers and may not be a good analogue for real social interactions in natural settings due to their controlled and artificial nature. Their application as a foundation for simulation of social processes suffers similarly. The proliferation of online social media offers new opportunities to observe social phenomena “in the wild” that have only just begun to be realised. To date, analysis of social media data has been largely focussed on specific, commercially relevant goals (such as sentiment analysis) that are of limited use to social psychology, and the dynamics critical to an understanding of social processes is rarely addressed or even present in collected data. This thesis addresses such shortfalls by: (i) presenting a novel data collection strategy and system for rich dynamic data from communities operating on Twitter; (ii) a data set encompassing longitudinal dynamic information over two and a half years from the online pro-ana (pro-anorexia) movement; and (iii) two approaches to identifying active social psychological processes in collections of online text and network metadata: an approach linking traditional psychometric studies with topic models and an algorithm combining community detection in user networks with topic models of the social media text they generate, enabling identification of community specific topic usage

    Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

    Get PDF
    In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

    Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray

    No full text
    Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarra
    corecore