22 research outputs found

    A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

    Full text link
    In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier

    A reference library for Canadian invertebrates with 1.5 million barcodes, voucher specimens, and DNA samples

    Get PDF
    The synthesis of this dataset was enabled by funding from the Canada Foundation for Innovation, from Genome Canada through Ontario Genomics, from NSERC, and from the Ontario Ministry of Research, Innovation and Science in support of the International Barcode of Life project. It was also enabled by philanthropic support from the Gordon and Betty Moore Foundation and from Ann McCain Evans and Chris Evans. The release of the data on GGBN was supported by a GGBN – Global Genome Initiative Award and we thank G. Droege, L. Loo, K. Barker, and J. Coddington for their support. Our work depended heavily on the analytical capabilities of the Barcode of Life Data Systems (BOLD, www.boldsystems.org). We also thank colleagues at the CBG for their support, including S. Adamowicz, S. Bateson, E. Berzitis, V. Breton, V. Campbell, A. Castillo, C. Christopoulos, J. Cossey, C. Gallant, J. Gleason, R. Gwiazdowski, M. Hajibabaei, R. Hanner, K. Hough, P. Janetta, A. Pawlowski, S. Pedersen, J. Robertson, D. Roes, K. Seidle, M. A. Smith, B. St. Jacques, A. Stoneham, J. Stahlhut, R. Tabone, J.Topan, S. Walker, and C. Wei. For bioblitz-related assistance, we are grateful to D. Ireland, D. Metsger, A. Guidotti, J. Quinn and other members of Bioblitz Canada and Ontario Bioblitz. For our work in Canada’s national parks, we thank S. Woodley and J. Waithaka for their lead role in organizing permits and for the many Parks Canada staff who facilitated specimen collections, including M. Allen, D. Amirault-Langlais, J. Bastick, C. Belanger, C. Bergman, J.-F. Bisaillon, S. Boyle, J. Bridgland, S. Butland, L. Cabrera, R. Chapman, J. Chisholm, B. Chruszcz, D. Crossland, H. Dempsey, N. Denommee, T. Dobbie, C. Drake, J. Feltham, A. Forshner, K. Forster, S. Frey, L. Gardiner, P. Giroux, T. Golumbia, D. Guedo, N. Guujaaw, S. Hairsine, E. Hansen, C. Harpur, S. Hayes, J. Hofman, S. Irwin, B. Johnston, V. Kafa, N. Kang, P. Langan, P. Lawn, M. Mahy, D. Masse, D. Mazerolle, C. McCarthy, I. McDonald, J. McIntosh, C. McKillop, V. Minelga, C. Ouimet, S. Parker, N. Perry, J. Piccin, A. Promaine, P. Roy, M. Savoie, D. Sigouin, P. Sinkins, R. Sissons, C. Smith, R. Smith, H. Stewart, G. Sundbo, D. Tate, R. Tompson, E. Tremblay, Y. Troutet, K. Tulk, J. Van Wieren, C. Vance, G. Walker, D. Whitaker, C. White, R. Wissink, C. Wong, and Y. Zharikov. For our work near Canada’s ports in Vancouver, Toronto, Montreal, and Halifax, we thank R. Worcester, A. Chreston, M. Larrivee, and T. Zemlak, respectively. Many other organizations improved coverage in the reference library by providing access to specimens – they included the Canadian National Collection of Insects, Arachnids and Nematodes, Smithsonian Institution’s National Museum of Natural History, the Canadian Museum of Nature, the University of Guelph Insect Collection, the Royal British Columbia Museum, the Royal Ontario Museum, the Pacifc Forestry Centre, the Northern Forestry Centre, the Lyman Entomological Museum, the Churchill Northern Studies Centre, and rare Charitable Research Reserve. We also thank the many taxonomic specialists who identifed specimens, including A. Borkent, B. Brown, M. Buck, C. Carr, T. Ekrem, J. Fernandez Triana, C. Guppy, K. Heller, J. Huber, L. Jacobus, J. Kjaerandsen, J. Klimaszewski, D. Lafontaine, J-F. Landry, G. Martin, A. Nicolai, D. Porco, H. Proctor, D. Quicke, J. Savage, B. C. Schmidt, M. Sharkey, A. Smith, E. Stur, A. Tomas, J. Webb, N. Woodley, and X. Zhou. We also thank K. Kerr and T. Mason for facilitating collections at Toronto Zoo and D. Iles for servicing the trap at Wapusk National Park. This paper contributes to the University of Guelph’s Food from Thought research program supported by the Canada First Research Excellence Fund. The Barcode of Life Data System (BOLD; www.boldsystems.org)8 was used as the primary workbench for creating, storing, analyzing, and validating the specimen and sequence records and the associated data resources48. The BOLD platform has a private, password-protected workbench for the steps from specimen data entry to data validation (see details in Data Records), and a public data portal for the release of data in various formats. The latter is accessible through an API (http://www.boldsystems.org/index.php/resources/api?type=webservices) that can also be controlled through R75 with the package ‘bold’76.Peer reviewedPublisher PD

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Impact of variation in body size (mg) and specimen age on the mean length of barcode sequences recovered from specimens in 66 families of Lepidoptera sampled from ANIC.

    No full text
    <p>Each circle represents a different family while the mean specimen age is the average for all species analyzed in a family. The mean body mass (mg) for species in a family are shown by colouration.</p

    Impact of specimen age on the percentage of specimens yielding either a barcode compliant (≥487

    No full text
    <p> <b>bp) sequence or a partial barcode sequence.</b> Results are those for 12,031 specimens of Lepidoptera from the first ANIC batch. These results reflect the overall sequence length after amplification of 307 bp, 407 bp, 609 bp and 658 bp amplicons followed by failure tracking for 164 bp, 189 bp, 259 bp amplicons.</p
    corecore