214 research outputs found

    Extraction and Classification of App Features from App Reviews

    Get PDF
    Aasta aastalt on kasvanud bioinformaatikas kasutatavate rakenduste arv.Selle tulemusena on konkreetse ĂŒlesande lahendamiseks sobiliku rakenduse leidmine muutunud keerukaks ĂŒlesandeks.Rakenduste kirjelduste paremaks sĂŒstematiseerimiseks ja otsitavaks muutmiseks on kasutusele vĂ”etud erinevaid mĂ€rksĂ”nade ontoloogiaid. Hetkel annoteeritakse kirjeldusi kĂ€sitsi, mis on ajamahukas ning ei anna alati Ă”igeid tulemusi.Antud töös kirjeldame uut annoteerimismeetodit, mis pakub automaatselt vĂ€lja ĂŒhe vĂ”i mitu mĂ€rksĂ”na kasutades selleks vaid tööriista vabatekstilist kirjeldust.Selleks kasutab meie meetod uusimaid loomuliku keele töötlemise meetodeid nagu Dirichlet' peitlahutus (latent Dirichlet allocation) ja sĂ”nade vektoresitust (word2vec).Esmane vĂ”rdlus meie poolt vĂ€lja pakutud algoritmi ja kĂ€sitsi saadud mĂ€rgendusega nĂ€itab, et tulemused on paljulubavad.The number of tools for bioinformatics is constantly increasing. To organize the available information and to facilitate the search, different ontologies are used. Today annotation of new descriptions is done manually, which is time-consuming and not always correct. We proposed a new annotation method, which, based on the description of the tool, offers one or more annotation labels in accordance with the ontology. In our method, we applied modern methods of natural language processing, such as latent Dirichlet allocation and word2vec. We compared the manual annotation labels with the labels obtained by using our algorithm and the first results look auspicious

    A framework for evaluating automatic image annotation algorithms

    Get PDF
    Several Automatic Image Annotation (AIA) algorithms have been introduced recently, which have been found to outperform previous models. However, each one of them has been evaluated using either different descriptors, collections or parts of collections, or "easy" settings. This fact renders their results non-comparable, while we show that collection-specific properties are responsible for the high reported performance measures, and not the actual models. In this paper we introduce a framework for the evaluation of image annotation models, which we use to evaluate two state-of-the-art AIA algorithms. Our findings reveal that a simple Support Vector Machine (SVM) approach using Global MPEG-7 Features outperforms state-of-the-art AIA models across several collection settings. It seems that these models heavily depend on the set of features and the data used, while it is easy to exploit collection-specific properties, such as tag popularity especially in the commonly used Corel 5K dataset and still achieve good performance

    Probit Normal Correlated Topic Models

    Get PDF
    The logistic normal distribution has recently been adapted via the transformation of multivariate Gaussian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a probit normal alternative approach to modelling correlated topical structures. Our use of the probit model in the context of topic discovery is novel, as many authors have so far concentrated solely of the logistic model partly due to the formidable inefficiency of the multinomial probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial probit estimation by using an adaptation of the diagonal orthant multinomial probit in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. An additional and very important benefit of our method lies in the fact that unlike with the logistic normal model whose non-conjugacy leads to the need for sophisticated sampling schemes, our approach exploits the natural conjugacy inherent in the auxiliary formulation of the probit model to achieve greater simplicity. The application of our proposed scheme to a well known Associated Press corpus not only helps discover a large number of meaningful topics but also reveals the capturing of compellingly intuitive correlations among certain topics. Besides, our proposed approach lends itself to even further scalability thanks to various existing high performance algorithms and architectures capable of handling millions of documents

    Hierarchical Multiclass Topic Modelling with Prior Knowledge

    Get PDF
    Eine neue Multi-Label-Dokument-Klassifizierungstechnik namens CascadeLDA wird in dieser Arbeit eingefĂŒhrt. Statt sich auf diskriminierende Modellierungstechniken zu konzentrieren, erweitert CascadeLDA ein generatives Basismodell durch die Einbeziehung von zwei Arten von Vorinformationen. Erstens wird das Wissen aus einem gekennzeichneten Trainingsdatensatz verwendet, um das generative Modell zu steuern. Zweitens wird die implizite Baumstruktur der Labels ausgenutzt, um diskriminierende Eigenschaften zwischen eng verwandten Labels hervorzuheben. Durch die Transformation des Klassifizierungsproblems in einem Ensemble von kleineren Problemen, werden vergleichbare out-of-sample Resultate circa 25 mal schneller erreicht als im Basismodell. In diesem Paper wird CascadeLDA auf DatensĂ€tzen mit akademischen Abstracts und vollstĂ€ndige wissenschaftliche angewendet. Das Modell wird eingesetzt, um Autoren beim Klassifizieren ihrer Publikationen automatisch zu unterstĂŒtzen.A new multi-label document classification technique called CascadeLDA is introduced in this thesis. Rather than focusing on discriminative modelling techniques, CascadeLDA extends a baseline generative model by incorporating two types of prior information. Firstly, knowledge from a labeled training dataset is used to direct the generative model. Secondly, the implicit tree structure of the labels is exploited to emphasise discriminative features between closely related labels. By segregating the classification problem in an ensemble of smaller problems, out-of-sample results are achieved at about 25 times the speed of the baseline model. In this thesis, CascadeLDA is performed on datasets with academic abstracts and full academic papers. The model is employed to assist authors in tagging their newly published articles

    Topic Uncovering and Image Annotation via Scalable Probit Normal Correlated Topic Models

    Get PDF
    Topic uncovering of the latent topics have become an active research area for more than a decade and continuous to receive contributions from all disciplines including computer science, information science and statistics. Since the introduction of Latent Dirichlet Allocation in 2003, many intriguing extension models have been proposed. One such extension model is the logistic normal correlated topic model, which not only uncovers hidden topic of a document, but also extract a meaningful topical relationship among a large number of topics. In this model, the Logistic normal distribution was adapted via the transformation of multivariate Gaussian variables to model the topical distribution of documents in the presence of correlations among topics. In this thesis, we propose a Probit normal alternative approach to modelling correlated topical structures. Our use of the Probit model in the context of topic discovery is novel, as many authors have so far concentrated solely of the logistic model partly due to the formidable inefficiency of the multinomial Probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial Probit estimation by using an adaptation of the Diagonal Orthant Multinomial Probit (DO-Probit) in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. In addition, we extended our model and implement it into the context of image annotation by developing an efficient Collapsed Gibbs Sampling scheme. Furthermore, we employed various high performance computing techniques such as memory-aware Map Reduce, SpareseLDA implementation, vectorization and block sampling as well as some numerical efficiency strategy to allow fast and efficient sampling of our algorithm

    Learning Object Categories From Internet Image Searches

    Get PDF
    In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets
    • 

    corecore