Search CORE

729 research outputs found

Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint

Author: Lu Jiaheng
Xu Pengfei
Publication venue: ACM
Publication date: 15/08/2018
Field of study

A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such relations, however, are helpful for obtaining high-quality joining results. In this paper, we leverage the taxonomy knowledge (i.e., a set of IS-A hierarchical relations) to define a similarity measure which finds semantic-similar records from two datasets. Based on this measure, we develop a similarity join algorithm with prefix filtering framework to prune away irrelevant pairs effectively. Our technical contribution here is an algorithm that judiciously selects critical parameters in a prefix filter to maximise its filtering power, supported by an estimation technique and Monte Carlo simulation process. Empirical experiments show that our proposed methods exhibit high efficiency and scalability, outperforming the state-of-art by a large margin.Peer reviewe

arXiv.org e-Print Archive

Crossref

Helsingin yliopiston digitaalinen arkisto

Efficient Approximate String Matching with Synonyms and Taxonomies

Author: Xu Pengfei
Publication venue: 'University of Helsinki Libraries'
Publication date: 19/02/2021
Field of study

Strings are ubiquitous. When being collected from various sources, strings are often inconsistent, which means that they can have the same or similar meaning expressed in different forms, such as with typographical mistakes. Finding similar strings given such inconsistent datasets has been researched extensively during past years under an umbrella problem called approximate string matching. This thesis aims to enhance the quality of the approximate string matching by detecting similar strings using their meanings besides typographical errors. Specifically, this thesis focuses on utilising synonyms and taxonomies, since both are commonly available knowledge sources. This research is to use each type of knowledge to address either a selection or join tasks, where the first task aims to find strings similar to a given string, and the second task is to find pairs of strings that are similar. The desired output is either all strings similar to a given extent (i.e., all-match) or the top-k most similar strings. The first contribution of this thesis is to address the top-k selection problem considering synonyms. Here, we propose algorithms with different optimisation goals: to minimise the space cost, to maximise the selection speed, or to maximise the selection speed under a space constraint. We model the last goal as a variant of an 0/1 knapsack problem and propose an efficient solution based on the branch and bound paradigm. Next, this thesis solves the top-k join problem considering taxonomy relations. Three algorithms, two based on sorted lists and one based on tries, are proposed, in which we use pre-computations to accelerate list scan or use predictions to eliminate unnecessary trie accesses. Experiments show that the trie-based algorithm has a very fast response time on a vast dataset. The third contribution of this thesis is to deal with the all-match join problem considering taxonomy relations. To this end, we identify the shortcoming of a standard prefix filtering principle and propose an adaptive filtering algorithm that is tuneable towards the minimised join time. We also design a sampling-based estimation procedure to suggest the best parameter in a short time with high accuracy. Lastly, this thesis researches the all-match join task by integrating typographical errors, synonyms, and taxonomies simultaneously. Key contributions here include a new unified similarity measure that employs multiple measures, as well as a non-trivial approximation algorithm with a tight theoretical guarantee. We furthermore propose two prefix filtering principles: a fast heuristic and accurate dynamic programming, to strive for the minimised join time.Merkkijonoja esiintyy kaikkialla. Kun merkkijonoja kerätään erilaisista lähteistä, ne ovat usein yhteensopimattomia. Tämä tarkoittaa, että niillä voi olla sama merkitys riippumatta siitä, että ne ovat eri muodossa. Muotoon liittyvät eroavaisuudet voivat johtua esimerkiksi typografisista virheistä. Samanlaisten merkkijonojen löytäminen yhteensopimattomista tietoaineistoista on laajasti tutkittu kysymys viime vuosien aikana. Yhteisnimitys tälle suuntauksella on likimääräinen merkkijonojen yhteensovittaminen (approximate string matching). Tämän työn päämääränä on parantaa merkkijonojen likimääräistä yhteensovittamista ottamalla typografisten virheiden lisäksi huomioon merkkijonojen merkitys. Tässä työssä keskitymme erityisesti hyödyntämään synonyymeja sekä taksonomisia luokittelujärjestelmiä, koska kummatkin ovat yleisesti saatavilla olevia tietolähteitä. Tutkimuksessamme on kummankin tyyppistä lähdettä käytetty joko kysely- tai liitostehtävissä. Kyselytehtävässä tarkoituksena on löytää annettua merkkijonoa vastaavat merkkijonot. Liitostehtävässä tarkoituksena on löytää ne merkkijonoparit, jotka vastaavat toisiaan. Tuloksena saadaan joko kaikki vastaavat merkkijonot haluttuun vastaavuuteen asti (all-match) tai ensimmäiset k kappaletta (top-k) eniten toisiaan vastaavia merkkijonoja. Tämän työn ensimmäisen vastauksen top-k kyselyongelmaan annamme synonyymien avulla. Kehittämissämme algoritmeissa pyrimme erilaisiin optimaalisiin ratkaisuihin, kuten käytetyn muistin minimointiin, suoritusnopeuden maksimointiin sekä näiden yhdistelmään, jossa nopeus maksimoidaan samalla rajoittaen muistinkäyttöä. Jälkimmäinen ongelma on erikoistapaus 0/1 knapsack ongelmasta, ja ratkaisemme ongelman tehokkaan haarauta ja rajoita paradigman avulla (branch and bound paradigm). Työn toinen vastaus top-k liitosongelmaan annetaan taksonomisten relaatioiden avulla. Tätä varten olemme kehittäneet kolme algoritmia, joista kaksi perustuu järjestettyihin listoihin ja yksi etuliitepuutietorakenteeseen (trie). Listojen läpikäymistä nopeutetaan etukäteen suoritettavilla alustuksilla. Etuliitepuihin perustuvaa algoritmia tehostetaan ennakoivasti poistamalla turhat haut puurakenteeseen. Kokeiden perusteella etuliitepuihin perustuvalla algoritmilla on erittäin nopea vastausaika, kun kyseessä on iso tietoaineisto. Kolmas vastaus työssä käsittelee all-match liitosongelmaa taksonomisten relaatioiden tapauksessa. Osoitamme millä tavalla standardi etuliiterajausperiaate (prefix filtering principle) on vajavainen ja vastauksena tähän kehitämme mukautuvan rajausalgoritmin, joka on säädettävissä siten, että liitoksen muodostamiseen tarvittava aika voidaan minimoida. Tämän lisäksi laadimme datasta poimittaviin näytteisiin perustuvan algoritmin, jonka avulla voidaan arvioida paras parametri lyhyessä ajassa korkealla tarkkuudella. Lopuksi työssä tutkimme all-match liitosongelmaa yhdistämällä typografiset virheet sekä synonyymien ja taksonomioiden käytön samanaikaisesti. Avainratkaisut tässä osassa pitävät sisällään yhtenäisen mitan merkkijonojen samankaltaisuudelle, jossa hyödynnämme useita vastaavaan tarkoitukseen kehitettyjä mittoja. Tähän liittyen kehitämme epätriviaalin algoritmin, jolla ongelmaa voidaan approksimoida ja jolla on vahva teoreettinen perusta. Lisäksi laadimme kaksi etuliiterajaukseen liittyvää periaatetta: nopean heuristisen periaatteen ja tarkan dynaamiseen ohjelmointiin perustuvan periaatteen. Näillä pyritään minimoimaan liitoksen muodostamiseen kuluva aika

Helsingin yliopiston digitaalinen arkisto

Towards a unified framework for string similarity joins

Author: Lu Jiaheng
Xu Pengfei
Publication venue
Publication date: 01/07/2019
Field of study

A similarity join aims to find all similar pairs between two collections of records. Established algorithms utilise different similarity measures, either syntactic or semantic, to quantify the similarity between two records. However, when records are similar in forms of a mixture of syntactic and semantic relations, utilising a single measure becomes inadequate to disclose the real similarity between records, and hence unable to obtain high-quality join results. In this paper, we study a unified framework to find similar records by combining multiple similarity measures. To achieve this goal, we first develop a new similarity framework that unifies the existing three kinds of similarity measures simultaneously, including syntactic (typographic) similarity, synonym-based similarity, and taxonomy-based similarity. We then theoretically prove that finding the maximum unified similarity between two strings is generally NP-hard, and furthermore develop an approximate algorithm which runs in polynomial time with a non-trivial approximation guarantee. To support efficient string joins based on our unified similarity measure, we adopt the filter-and-verification framework and propose a new signature structure, called pebble, which can be simultaneously adapted to handle multiple similarity measures. The salient feature of our approach is that, it can judiciously select the best pebble signatures and the overlap thresholds to maximise the filtering power. Extensive experiments show that our methods are capable of finding similar records having mixed types of similarity relations, while exhibiting high efficiency and scalability for similarity joins. The implementation can be downloaded at https://github.com/HY-UDBMS/AU-Join.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Emergent patterns in protein, microbial and mutualistic systems

Author: Pascual García Alberto
Publication venue
Publication date: 01/01/2015
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 17-04-2015In this thesis we analyse emergent patterns in complex biological systems. We say that these patterns emerge, given that they result from behaviours of the system that are di cult to explain starting from a microscopic description. These behaviours are strongly dependent on the interactions between elements, and thus our research focuses on the identi cation and evaluation of interaction networks. In particular, we have analysed interactions that may re ect the response of the system to long term conditions, whose analysis may be compatible with an evolutionary interpretation. The methodological and conceptual framework needed for the development of our research is complex. This is the reason why the rst part of the thesis is devoted to clarify the epistemological approximation we have followed. In subsequent chapters, we present our research results, which have been developed around three systems with notable di erences among them. The rst system considers a representative subset of all the protein structures known up to date. We develop a method that objectively demonstrates the existence of structural protein classes known as folds, de ning conserved interaction patterns between amino-acids. We go deeper into the evolutionary interpretation of this result investigating the role of protein function in the structural conservation and divergence. Second, we analyse high-throughput sequencing experiments collecting the presence of bacterial taxa in di erent environments. From this data we infer aggregation and segregation patterns suggesting that bacterial mutualistic interactions are very relevant, and whose functional role is explored in more detail analysing the bacterial assembly process in a group of infants during their development. Last, we have considered mutualistic communities of plants and pollinators. We predict the structural stability of this system de ning two magnitudes: the e ective interspeci c competition and the propagation of perturbations. These magnitudes rationalize the relative e ect of competition versus mutualism and, in particular, of the di erent mutualistic networks in the structural stability, which we show has a main role for sustaining biodiversityEn esta tesis analizamos patrones emergentes en sistemas biológicos complejos. Estos patrones los cali camos como emergentes porque son el resultado de comportamientos del sistema difíciles de caracterizar partiendo de una descripción microscópica. Dichos comportamientos son fuertemente dependientes de las interacciones entre elementos, por lo que nos centramos en la identi cación y evaluación de redes de interacción. En particular, hemos analizado interacciones que esperamos que re ejen la respuesta del sistema a condiciones relevantes en escalas de tiempo largas, cuyo análisis puede ser compatible con una interpretación evolutiva. El marco metodológico y conceptual necesario para el desarrollo de nuestra investigación es complejo. Por ello, la primera parte de la tesis está orientada a clari car la aproximación epistemológica que hemos seguido. En los siguientes capítulos presentamos el resultado de nuestra investigación, desarrollada alrededor de tres sistemas con notables diferencias entre ellos. El primer sistema considera un conjunto representativo de todas las estructuras de proteínas conocidas hasta la fecha. Desarrollamos un método que demuestra objetivamente la existencia de clases estructurales de proteí- nas conocidas como folds, que de nen patrones de interacción entre aminoácidos. Profundizamos en la interpretación evolutiva del resultado investigando el rol de la función de proteínas en la conservación o divergencia estructural. En segunda lugar analizamos experimentos de secuenciación masiva que recogen la presencia de taxones bacterianos en distintos ambientes. De estos datos inferimos patrones de agregación y segregación que sugieren que las interacciones mutualistas entre bacterias son muy relevantes, y cuyo rol funcional es explorado en más detalle analizando el proceso de ensamblaje bacteriano en un grupo de bebés durante su desarrollo. Por último, hemos considerado comunidades mutualistas de plantas y polinizadores. Predecimos la estabilidad estructural de este sistema de niendo dos magnitudes: la competición efectiva interespecí ca y la propagación de las perturbaciones. Estas magnitudes permiten racionalizar el efecto relativo de la competición versus el mutualismo y, en particular, de las distintas redes mutualistas en la estabilidad estructural, cuyo papel mostramos que es esencial en el sostenimiento de la biodiversida

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Novel approaches for large-scale phylogenetics and applications in the context of the amphibian tree of life

Author: Siu Ting Salvatierra Karen
Publication venue
Publication date: 01/01/2014
Field of study

During this thesis, I addressed some problems associated with large-scale phylogenetic analyses by tackling issues related to missing data and careful handling and addition of novel data in large-scale reconstructions, presenting an application of this approach in the context of amphibian phylogenetics. I developed a method (called “Concatabominations”) building on the original Safe Taxonomic Reduction method (Wilkinson 1995) as an alternative approach to the issue of identifying rogue taxa. The safe removal of rogue taxa due to missing data can potentially reduce the terraces in tree space search and improve resolution in the final consensus tree. In a pragmatic point of view, the new method can help in targeting taxa that require further sampling during a research design. Novel sequence data for the rediscovered Ericabatrachus baleensis allowed to explore its placement in the Amphibian tree of life. I tested the inclusion of novel data using a backbone alignment from a previous work (de novo analysis) and a backbone phylogenetic tree (constrained analysis), after careful curation of gene partitions to include in an analysis. I found that the use of a constrained phylogenetic inference using a previous accepted tree seems to be a practical solution to the rapid phylogenetic placement of a taxon in cases of well-supported relationships. However, a de novo analysis might ensure an optimal alignment and avoid risks introduced when adding new data. Finally, I investigated the evolutionary relationships of the three lineages of the extant amphibians (Anura, Caudata and Gymnophiona) using an independent source of evidence: miRNAs, recently used to help resolve difficult phylogenetic problems. The analyses yielded a high number of shared miRNAs using the Xenopus tropicalis genome, contrasting with a lower number of miRNAs discovered using the Axolotl transcriptome. This suggests that not using genomic data is not ideal to validate miRNAs. Nevertheless, in spite of the limitations, I was able to find two potential novel miRNAs: one supporting the monophyly of Lissamphibia, and another supporting the Batrachia hypothesis. Overall, I hope the work developed in this thesis contributes with new insights into large-scale phylogenetics and in particular to amphibian phylogenetics

MURAL - Maynooth University Research Archive Library

Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

Author: Lopes Marta B.
ML4Microbiome
Publication venue: 'Frontiers Media SA'
Publication date: 19/02/2021
Field of study

COST Action CA18131 Cierva Grant IJC2019-042188-I (LM-Z) Estonian Research Council grant PUT 1371The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.publishersversionpublishe

Repositório da Universidade Nova de Lisboa

Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach

University of Bergen

HAL Descartes

REPISALUD

NORA - Norwegian Open Research Archives

Diposit Digital de la Universitat de Barcelona

Hal-Diderot

Fondo Bibliográfico Digital Institucional