73 research outputs found

    Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities

    Get PDF
    Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom model’s improvement per additional unit of manual annotation. Our system robustly handles any domain or user-defined label set and requires no external resources, enabling quality named entity recognition for Humanities corpora where such resources are not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.New York University–Paris Sciences Lettres Global Alliance grant; National Endowment for the Humanities grant, award HAA-256078-17; Computational Approaches to Modeling Language lab at New York University Abu Dhab

    Analysis of BAC end sequences in oak, a keystone forest tree species, providing insight into the composition of its genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the key goals of oak genomics research is to identify genes of adaptive significance. This information may help to improve the conservation of adaptive genetic variation and the management of forests to increase their health and productivity. Deep-coverage large-insert genomic libraries are a crucial tool for attaining this objective. We report herein the construction of a BAC library for <it>Quercus robur</it>, its characterization and an analysis of BAC end sequences.</p> <p>Results</p> <p>The <it>Eco</it>RI library generated consisted of 92,160 clones, 7% of which had no insert. Levels of chloroplast and mitochondrial contamination were below 3% and 1%, respectively. Mean clone insert size was estimated at 135 kb. The library represents 12 haploid genome equivalents and, the likelihood of finding a particular oak sequence of interest is greater than 99%. Genome coverage was confirmed by PCR screening of the library with 60 unique genetic loci sampled from the genetic linkage map. In total, about 20,000 high-quality BAC end sequences (BESs) were generated by sequencing 15,000 clones. Roughly 5.88% of the combined BAC end sequence length corresponded to known retroelements while <it>ab initio </it>repeat detection methods identified 41 additional repeats. Collectively, characterized and novel repeats account for roughly 8.94% of the genome. Further analysis of the BESs revealed 1,823 putative genes suggesting at least 29,340 genes in the oak genome. BESs were aligned with the genome sequences of <it>Arabidopsis thaliana</it>, <it>Vitis vinifera </it>and <it>Populus trichocarpa</it>. One putative collinear microsyntenic region encoding an alcohol acyl transferase protein was observed between oak and chromosome 2 of <it>V. vinifera.</it></p> <p>Conclusions</p> <p>This BAC library provides a new resource for genomic studies, including SSR marker development, physical mapping, comparative genomics and genome sequencing. BES analysis provided insight into the structure of the oak genome. These sequences will be used in the assembly of a future genome sequence for oak.</p

    Genetic mapping of EST-derived simple sequence repeats (EST-SSRs) to identify QTL for leaf morphological characters in a Quercus robur full-sib family

    No full text
    The availability of genomic resources such as expressed sequence tag-derived simple sequence repeat (EST-SSR) markers in adaptive genes with high transferability across related species allows the construction of genetic maps and the comparison of genome structure and quantitative trait loci (QTL) positions. In the present study, genetic linkage maps were constructed for both parents of a Quercus robur × Q. robur ssp. slavonica full-sib pedigree. A total of 182 markers (61 AFLPs, 23 nuclear SSRs, 98 EST-SSRs) and 172 markers (49 AFLPs, 21 nSSRs, 101 EST-SSRs, 1 isozyme) were mapped on the female and male linkage maps, respectively. The total map length and average marker spacing were 1,038 and 5.7 cM for the female map and 998.5 and 5.8 cM for the male map. A total of 68 nuclear SSRs and EST-SSRs segregating in both parents allowed to define homologous linkage groups (LG) between both parental maps. QTL for leaf morphological traits were mapped on all 12 LG at a chromosome-wide level and on 6 LG at a genome-wide level. The phenotypic effects explained by each single QTL ranged from 4.0 % for leaf area to 15.8 % for the number of intercalary veins. QTL clusters for leaf characters that discriminate between Q. robur and Quercus petraea were mapped reproducibly on three LG, and some putative candidate genes among potentially many others were identified on LG3 and LG5. Genetic linkage maps based on EST-SSRs can be valuable tools for the identification of genes involved in adaptive trait variation and for comparative mapping. © 2013 Springer-Verlag Berlin Heidelberg

    A linkage disequilibrium perspective on the genetic mosaic of speciation in two hybridizing Mediterranean white oaks

    No full text
    We analyzed the genetic mosaic of speciation in two hybridizing Mediterranean white oaks from the Iberian Peninsula (Quercus faginea Lamb. and Quercus pyrenaica Willd.). The two species show ecological divergence in flowering phenology, leaf morphology and composition, and in their basic or acidic soil preferences. Ninety expressed sequence tag-simple sequence repeats (EST-SSRs) and eight nuclear SSRs were genotyped in 96 trees from each species. Genotyping was designed in two steps. First, we used 69 markers evenly distributed over the 12 linkage groups (LGs) of the oak linkage map to confirm the species genetic identity of the sampled genotypes, and searched for differentiation outliers. Then, we genotyped 29 additional markers from the chromosome bins containing the outliers and repeated the multilocus scans. We found one or two additional outliers within four saturated bins, thus confirming that outliers are organized into clusters. Linkage disequilibrium (LD) was extensive; even for loosely linked and for independent markers. Consequently, score tests for association between two-marker haplotypes and the 'species trait' showed a broad genomic divergence, although substantial variation across the genome and within LGs was also observed. We discuss the influence of several confounding effects on neutrality tests and review the evolutionary processes leading to extensive LD. Finally, we examine how LD analyses within regions that contain outlier clusters and quantitative trait loci can help to identify regions of divergence and/or genomic hitchhiking in the light of predictions from ecological speciation theory
    corecore