2,770 research outputs found

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition

    Full text link
    Named entity recognition identifies common classes of entities in text, but these entity labels are generally sparse, limiting utility to downstream tasks. In this work we present ScienceExamCER, a densely-labeled semantic classification corpus of 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. Semantic class labels are drawn from a manually-constructed fine-grained typology of 601 classes generated through a data-driven analysis of 4,239 science exam questions. We show an off-the-shelf BERT-based named entity recognition model modified for multi-label classification achieves an accuracy of 0.85 F1 on this task, suggesting strong utility for downstream tasks in science domain question answering requiring densely-labeled semantic classification

    The Processing of Lexical Sequences

    Get PDF
    Psycholinguistics has traditionally been defined as the study of how we process units of language such as letters, words and sentences. But what about other units? This dissertation concerns itself with short lexical sequences called n- grams, longer than words but shorter than most sentences. N-grams can be phrases (such as the 3-gram "the great divide") or just fragments (such as the 4- gram means "nothing to a"). Words are often thought to be the universal, atomic building block of longer lexical sequences, but n-grams are equally capable of carrying meaning and being combined to create any sentence. Are n-grams more than just the sum of their parts (the sum of their words)? How do language users process n-grams when they are asked to read them or produce them? Using evidence that I have gathered, I will address these and other questions with the goal of better understanding n-gram processing

    DARIAH and the Benelux

    Get PDF

    The Reflection and Reification of Racialized Language in Popular Media

    Get PDF
    This work highlights specific lexical items that have become racialized in specific contextual applications and tests how these words are cognitively processed. This work presents the results of a visual world (Huettig et al 2011) eye-tracking study designed to determine the perception and application of racialized (Coates 2011) adjectives. To objectively select the racialized adjectives used, I developed a corpus comprised of popular media sources, designed specifically to suit my research question. I collected publications from digital media sources such as Sports Illustrated, USA Today, and Fortune by scraping articles featuring specific search terms from their websites. This experiment seeks to aid in the demarcation of socially salient groups whose application of racialized adjectives to racialized images is near instantaneous, or at least less questioned. As we view growing social movements which revolve around the significant marks unconscious assumptions leave on American society, revealing how and where these lexical assignments arise and thrive allows us to interrogate the forces which build and reify such biases. Future research should attempt to address the harmful semiotics these lexical choices sustain

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org
    • …
    corecore