1,760 research outputs found

    An edit script for taxonomic classifications

    Get PDF
    BACKGROUND: The NCBI taxonomy provides one of the most powerful ways to navigate sequence data bases but currently users are forced to formulate queries according to a single taxonomic classification. Given that there is not universal agreement on the classification of organisms, providing a single classification places constraints on the questions biologists can ask. However, maintaining multiple classifications is burdensome in the face of a constantly growing NCBI classification. RESULTS: In this paper, we present a solution to the problem of generating modifications of the NCBI taxonomy, based on the computation of an edit script that summarises the differences between two classification trees. Our algorithms find the shortest possible edit script based on the identification of all shared subtrees, and only take time quasi linear in the size of the trees because classification trees have unique node labels. CONCLUSION: These algorithms have been recently implemented, and the software is freely available for download from

    The taxonomic name resolution service : an online tool for automated standardization of plant names

    Get PDF
    © The Author(s), 2013. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in BMC Bioinformatics 14 (2013): 16, doi:10.1186/1471-2105-14-16.The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this ‘names problem’ has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ webcite and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/ webcite.BJE was supported by NSF grant DBI 0850373 and TR by CSIRO Marine and Atmospheric Research, Australia,. BB and BJE acknowledge early financial support from Conservation International and TEAM who funded the development of early prototypes of taxonomic name resolution. The iPlant Collaborative (http://www.iplantcollaborative.org) is funded by a grant from the National Science Foundation (#DBI-0735191)

    Biodiversity informatics - entomological data processing, analysis and visualization

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019Este trabalho foca-se na digitalização, tratamento e análise de dados de colecções de história natural fazendo uso de ferramentas da informática da biodiversidade. Foram usados dados das colecções de insectos do Museu Nacional de História Natural e da Ciência (MNHNC) e do Instituto de Investigação Científica Tropical (IICT), Universidade de Lisboa. Em 2014, um dataset com 30 535 registos da colecção de insectos do MNHNC foi publicado no Global Biodiversity Information Facility (GBIF). Desde então, novos registos foram digitalizados e foram adicionados novos dados, tais como novas identificações taxonómicas, entre outros. Actualmente, o catálogo da colecção de insectos do MNHNC inclui 39 139 registos validados, que correspondem a cerca de 98% do total, referentes a 79 885 espécimes. Para que este dataset actualizado pudesse ser publicado, foram aplicadas ferramentas de limpeza de dados para detecção e correcção de erros, bem como a georreferenciação de registos, de forma a que os dados possam ser localizados num mapa a partir das coordenadas. Relativamente à limpeza e homogeneização de dados, todos os campos foram limpos e formatados de acordo com as normas do modelo de metadados DarwinCore. Este processo incluiu a verificação de identificações taxonómicas para detectar sinonímias e erros ortográficos, alteração do formato de datas e horas, e aplicação de um vocabulário controlado para os restantes campos. Paralelamente a este processo, foram testadas ferramentas para acelerar a digitalização em duas fases diferentes: transcrição e georreferenciação de dados a partir de etiquetas de espécimes. Foram testadas cinco ferramentas de georreferenciação que disponibilizam Application Programming Interfaces (APIs), que podem ser usadas para georreferenciar registos automaticamente a partir de nomes de localidades: Google Maps, MapQuest, GeoNames, OpenStreetMap e GEOLocate. Destes, a ferramenta Google Maps foi a que produziu melhores resultados, com 57.6% dos resultados a uma distância de 1000 m ou menos do local correcto. Foi também desenvolvido e testado um projecto de ciência cidadã na plataforma Zooniverse, que contemplou duas vertentes: uma de transcrição de dados a partir de fotografias de espécimes com etiquetas, direccionada ao público geral, e uma de identificação taxonómica de espécimes a partir de fotografias, direcionada a especialistas na taxonomia do respectivo grupo. A primeira vertente resultou na transcrição com sucesso dos dados de todos os 130 espécimes disponibilizados. A segunda resultou na identificação dos 61 espécimes que não tinham identificação prévia, e na verificação das identificações dos restantes 69 espécimes. Conclui-se, portanto, que os projectos de ciência cidadã serão uma boa maneira de acelerar o projecto de digitalização, desde que sejam implementados métodos de verificação e correcção de erros adequados. Por forma a focar todos os passos do processo de digitalização de uma forma mais completa, foram selecionadas as colecções de tabanídeos (Diptera: Tabanidae) do IICT e do MNHNC. Este grupo é de especial interesse por incluir importantes vectores de transmissão de doenças a humanos e gado, e por ter uma distribuição ampla em todo o Mundo. A colecção de tabanídeos do IICT é particularmente importante por ter sido, na sua maioria, compilada e estudada por J. A. Travassos Santos Dias, um especialista neste grupo que publicou extensos trabalhos com base nos espécimes da colecção. Ambas as colecções incluem espécimes tipo de espécies descritas por Travassos Santos Dias e outros autores. Apesar da sua importância, a informação associada aos espécimes das colecções do IICT/MNHNC ainda não estava digitalizada. Neste trabalho, foram fotografados todos os espécimes e transcritos os seus dados, resultando num dataset com 1 666 exemplares. Foi feita a georreferenciação dos registos sempre que possível. Os espécimes da colecção foram recolhidos entre 1899 e 2018, maioritariamente em Portugal, mas também em São Tomé e Príncipe, Guiné-Bissau, Moçambique, Espanha e outros países. Para uma melhor visualização da distribuição geográfica dos espécimes, foram criados mapas de distribuição, recorrendo a R, para as espécies mais bem representadas nas colecções. A publicação deste dataset na plataforma GBIF será uma mais-valia para o estudo da distribuição deste grupo, devido à sua ampla cobertura geográfica e temporal, bem como ao facto da maioria dos espécimes (85.1%) estarem identificados até à espécie ou subespécie.This work is based on data records associated with the insect Collections from the Museu Nacional de História Natural e da Ciência (MNHNC) and Instituto de Investigação Científica Tropical (IICT), Universidade de Lisboa. In 2014 a dataset with 30 535 records was published in the Global Biodiversity Information Facility (GBIF). Since then data has been improved and new records acquired. Currently, the collection catalogue includes 39 139 validated records, corresponding to 79 885 specimens, with much more to be added from collections donated by private collectors or unprocessed samples. The data for these specimens was cleaned, formatted and geocoded and published on the GBIF. During this work, different APIs were tested to allow automated geocoding of sampling locations. Google Maps achieved the best results, with 57.6% of results within 1000 m of the correct location. A citizen science project was developed and tested to accelerate the digitization process, including two workflows with different objectives. One was focused on the transcription of specimen label data, which resulted in the data for 130 specimens being successfully transcribed. The other was focused on the taxonomic identification of specimens from photographs, directed to specialists in the respective group’s taxonomy, which resulted in 61 new identifications and the verification of identifications for the remaining 69 specimens. The MNHNC and IICT collections contain collections of horseflies (Order Diptera, Family Tabanidae) which are of particular importance due to its size and completeness of associated data. Horseflies are widely distributed worldwide and are important vectors in transmission of diseases to humans and cattle. The IICT collection includes a sub-collection which was compiled and studied by J. A. Travassos Santos Dias, a prominent specialist in this group. The specimens in these collections were photographed, all the associated data were transcribed, taxonomic identifications were verified and records were geocoded, resulting in a dataset of 1666 specimens. These specimens were collected between 1899 and 2018, mainly in Portugal, but also in São Tomé and Príncipe, Guinea-Bissau, Mozambique, Spain and other countries. To better understand the distribution of this group, distribution maps were made for the most well-represented species in the collections

    ClassyFire: automated chemical classification with a comprehensive, computable taxonomy

    Get PDF
    Additional file 5. Use cases. Text-based search on the ClassyFire web server. (A) Building the query. (B) Sparteine, one of the returned compounds

    Antibiotic resistant bacteria in water environments in Louisville, Kentucky.

    Get PDF
    Antibiotic resistant bacterial strains are an increasing problem, particularly in clinical health care settings. As a result, bacterial infections are becoming increasingly challenging to treat with more cases becoming life threatening. Aquatic environments facilitate microbial diversity and the transfer of genetic elements and thus may serve as a reservoir for antibiotic resistant microbes. Human misuse of antibiotics may further facilitate the spread of resistance in water environments. With little known about the bacteria communities in local water environments, this study aimed to learn more about these populations through the following aims: 1) identify the microbial community composition from water environments around Louisville, KY; and 2) examine of the communities were resistant to two clinically used antibiotics—vancomycin and colistin. In this study, water sites were sampled and sorted into 4 categories: agricultural waters, commercial drains, natural waters, and wastewaters. In total, 155 single colony isolates resistant to vancomycin and colistin were identified through 16S sequencing. Whole community metagenomics analysis characterized the bacterial composition of 87 communities from the initial sample collection. Community diversity and the relationship between diversity and income was analyzed. One of the most striking results was the presence of Ochrobactrum sp. in 78 of the 87 communities. Two of the most prevalent genera, Ochrobactrum and Pseudochrobactrum, were characterized by assessing relative antibiotic resistance profiles and were found to be tolerant to high doses of a spectrum of antibiotics. Finally, a representative Ochrobactrum sp. isolate was tested for its ability to confer antibiotic resistance to a susceptible recipient bacterium. This Ochrobactrum sp. isolate was unable to transfer colistin resistance to another bacterial species, Pseudomonas aeruginosa, despite repeated efforts. The results indicate that there is a large diversity of microbes resistant to vancomycin and colistin though the ability of these microbes to transfer this resistance remains to be seen

    Simple identification tools in FishBase

    Get PDF
    Simple identification tools for fish species were included in the FishBase information system from its inception. Early tools made use of the relational model and characters like fin ray meristics. Soon pictures and drawings were added as a further help, similar to a field guide. Later came the computerization of existing dichotomous keys, again in combination with pictures and other information, and the ability to restrict possible species by country, area, or taxonomic group. Today, www.FishBase.org offers four different ways to identify species. This paper describes these tools with their advantages and disadvantages, and suggests various options for further development. It explores the possibility of a holistic and integrated computeraided strategy

    From trees to descriptions and identification tools

    Get PDF
    PARADISEC (Pacific And Regional Archive for Digital Sources in Endangered Cultures), Australian Partnership for Sustainable Repositories, Ethnographic E-Research Project and Sydney Object Repositories for Research and Teaching

    From trees to descriptions and identification tools

    Get PDF
    PARADISEC (Pacific And Regional Archive for Digital Sources in Endangered Cultures), Australian Partnership for Sustainable Repositories, Ethnographic E-Research Project and Sydney Object Repositories for Research and Teaching

    Microbiome profiling by Illumina sequencing of combinatorial sequence-tagged PCR products

    Get PDF
    We developed a low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region. Amplified PCR products are sequenced using an Illumina paired-end protocol to generate millions of overlapping reads. Combinatorial sequence tagging can be used to examine hundreds of samples with far fewer primers than is required when sequence tags are incorporated at only a single end. The number of reads generated permitted saturating or near-saturating analysis of samples of the vaginal microbiome. The large number of reads al- lowed an in-depth analysis of errors, and we found that PCR-induced errors composed the vast majority of non-organism derived species variants, an ob- servation that has significant implications for sequence clustering of similar high-throughput data. We show that the short reads are sufficient to assign organisms to the genus or species level in most cases. We suggest that this method will be useful for the deep sequencing of any short nucleotide region that is taxonomically informative; these include the V3, V5 regions of the bac- terial 16S rRNA genes and the eukaryotic V9 region that is gaining popularity for sampling protist diversity.Comment: 28 pages, 13 figure
    corecore