3 research outputs found

    An empirical evaluation of document embeddings and similarity metrics for scientific articles

    Get PDF
    The comparison of documents—such as articles or patents search, bibliography recommendations systems, visualization of document collections, etc.—has a wide range of applications in several fields. One of the key tasks that such problems have in common is the evaluation of a similarity metric. Many such metrics have been proposed in the literature. Lately, deep learning techniques have gained a lot of popularity. However, it is difficult to analyze how those metrics perform against each other. In this paper, we present a systematic empirical evaluation of several of the most popular similarity metrics when applied to research articles. We analyze the results of those metrics in two ways, with a synthetic test that uses scientific papers and Ph.D. theses, and in a real-world scenario where we evaluate their ability to cluster papers from different areas of research.This research was funded by Project TIN2017-88515-C2-1-R funded by Ministerio de Economía y Competitividad, under MCIN/AEI/10.13039/501100011033/FEDER “A way to make Europe”.Peer ReviewedPostprint (published version

    Visual analysis of research paper collections using normalized relative compression

    Get PDF
    The analysis of research paper collections is an interesting topic that can give insights on whether a research area is stalled in the same problems, or there is a great amount of novelty every year. Previous research has addressed similar tasks by the analysis of keywords or reference lists, with different degrees of human intervention. In this paper, we demonstrate how, with the use of Normalized Relative Compression, together with a set of automated data-processing tasks, we can successfully visually compare research articles and document collections. We also achieve very similar results with Normalized Conditional Compression that can be applied with a regular compressor. With our approach, we can group papers of different disciplines, analyze how a conference evolves throughout the different editions, or how the profile of a researcher changes through the time. We provide a set of tests that validate our technique, and show that it behaves better for these tasks than other techniques previously proposed.Peer ReviewedPostprint (published version

    Sistema de pesquisa automática de sequências de ADN aproximadas e não contíguas

    Get PDF
    Mestrado em Engenharia Eletrónica e TelecomunicaçõesA capacidade de efectuar pesquisas de sequências de ADN similares a outras contidas numa sequência maior, tal como um cromossoma, tem um papel muito importante no estudo de organismos e na possível ligação entre espécies diferentes. Apesar da existência de várias técnicas e algoritmos, criados com o intuito de realizar pesquisas de sequência, este problema ainda está aberto ao desenvolvimento de novas ferramentas que possibilitem melhorias em relação a ferramentas já existentes. Esta tese apresenta uma solução para pesquisa de sequências, baseada em compressão de dados, ou, mais especificamente, em modelos de contexto finito, obtendo uma medida de similaridade entre uma referência e um alvo. O método usa uma abordagem com base em modelos de contexto finito para obtenção de um modelo estatístico da sequência de referência e obtenção do número estimado de bits necessários para codificação da sequência alvo, utilizando o modelo da referência. Ao longo deste trabalho, estudámos o método descrito acima, utilizando, inicialmente, condições controladas, e, por m, fazendo um estudo de regiões de ADN do genoma humano moderno, que não se encontram em ADN ancestral (ou se encontram com elevado grau de dissimilaridade).The ability to search similar DNA sequences with relation to a larger sequence, such as a chromosome, has a really important role in the study of organisms and the possible connection between di erent species. Even though several techniques and algorithms, created with the goal of performing sequence searches, already exist, this problem is still open to the development of new tools that exhibit improvements over currently existent tools. This thesis proposes a solution for sequence search, based on data compression, or, speci cally, nite-context models, by obtaining a measure of similarity between a reference and a target. The method uses an approach based on nite-context models for the creation of a statistical model of the reference sequence and obtaining the estimated number of bits necessary for the codi cation of the target sequence, using the reference model. In this work we studied the above described method, using, initially, controlled conditions, and, nally, conducting a study on DNA regions, belonging to the modern human genome, that can not be found in ancient DNA (or can only be found with high dissimilarity rate)
    corecore