930 research outputs found

    State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"

    Full text link
    Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas. In this document we present the results of this comprehensive survey. The report includes a description of the current state-of-the-art and practice, a roadmap for follow-up research in the identified areas, and recommendations for anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS

    Sound structure and sound change: A modeling approach

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations. The methodological focus of this work is on computational modeling. By formalising and implementing theoretical accounts, modeling can expose theoretical gaps and covert assumptions. To do so, it is necessary to formally assess the functional equivalence of specific implementational choices, as well as their mapping to theoretical structures. This book applies this analytic approach to a series of implemented models of sound change. As theoretical inconsistencies are discovered, possible solutions are proposed, incrementally constructing a set of sufficient properties for a working model. Because internal theoretical consistency is enforced, this model corresponds to an explanatorily adequate theory. And because explicit links between modules are required, this is a theory, not only of sound change, but of many aspects of phonological competence. The book highlights two aspects of modeling work that receive relatively little attention: the formal mapping from model to theory, and the scalability of demonstration models. Focusing on these aspects of modeling makes it clear that any theory of sound change in the specific is impossible without a more general theory of language: of the relationship between perception and production, the relationship between phonetics and phonology, the learning of linguistic units, and the nature of underlying representations. Theories of sound change that do not explicitly address these aspects of language are making tacit, untested assumptions about their properties. Addressing so many aspects of language may seem to complicate the linguist's task. However, as this book shows, it actually helps impose boundary conditions of ecological validity that reduce the theoretical search space

    A Dependency Parsing Approach to Biomedical Text Mining

    Get PDF
    Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriast

    Sound structure and sound change: A modeling approach

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations. The methodological focus of this work is on computational modeling. By formalising and implementing theoretical accounts, modeling can expose theoretical gaps and covert assumptions. To do so, it is necessary to formally assess the functional equivalence of specific implementational choices, as well as their mapping to theoretical structures. This book applies this analytic approach to a series of implemented models of sound change. As theoretical inconsistencies are discovered, possible solutions are proposed, incrementally constructing a set of sufficient properties for a working model. Because internal theoretical consistency is enforced, this model corresponds to an explanatorily adequate theory. And because explicit links between modules are required, this is a theory, not only of sound change, but of many aspects of phonological competence. The book highlights two aspects of modeling work that receive relatively little attention: the formal mapping from model to theory, and the scalability of demonstration models. Focusing on these aspects of modeling makes it clear that any theory of sound change in the specific is impossible without a more general theory of language: of the relationship between perception and production, the relationship between phonetics and phonology, the learning of linguistic units, and the nature of underlying representations. Theories of sound change that do not explicitly address these aspects of language are making tacit, untested assumptions about their properties. Addressing so many aspects of language may seem to complicate the linguist's task. However, as this book shows, it actually helps impose boundary conditions of ecological validity that reduce the theoretical search space

    Sound structure and sound change: A modeling approach

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations. The methodological focus of this work is on computational modeling. By formalising and implementing theoretical accounts, modeling can expose theoretical gaps and covert assumptions. To do so, it is necessary to formally assess the functional equivalence of specific implementational choices, as well as their mapping to theoretical structures. This book applies this analytic approach to a series of implemented models of sound change. As theoretical inconsistencies are discovered, possible solutions are proposed, incrementally constructing a set of sufficient properties for a working model. Because internal theoretical consistency is enforced, this model corresponds to an explanatorily adequate theory. And because explicit links between modules are required, this is a theory, not only of sound change, but of many aspects of phonological competence. The book highlights two aspects of modeling work that receive relatively little attention: the formal mapping from model to theory, and the scalability of demonstration models. Focusing on these aspects of modeling makes it clear that any theory of sound change in the specific is impossible without a more general theory of language: of the relationship between perception and production, the relationship between phonetics and phonology, the learning of linguistic units, and the nature of underlying representations. Theories of sound change that do not explicitly address these aspects of language are making tacit, untested assumptions about their properties. Addressing so many aspects of language may seem to complicate the linguist's task. However, as this book shows, it actually helps impose boundary conditions of ecological validity that reduce the theoretical search space

    Sound structure and sound change: A modeling approach

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations. The methodological focus of this work is on computational modeling. By formalising and implementing theoretical accounts, modeling can expose theoretical gaps and covert assumptions. To do so, it is necessary to formally assess the functional equivalence of specific implementational choices, as well as their mapping to theoretical structures. This book applies this analytic approach to a series of implemented models of sound change. As theoretical inconsistencies are discovered, possible solutions are proposed, incrementally constructing a set of sufficient properties for a working model. Because internal theoretical consistency is enforced, this model corresponds to an explanatorily adequate theory. And because explicit links between modules are required, this is a theory, not only of sound change, but of many aspects of phonological competence. The book highlights two aspects of modeling work that receive relatively little attention: the formal mapping from model to theory, and the scalability of demonstration models. Focusing on these aspects of modeling makes it clear that any theory of sound change in the specific is impossible without a more general theory of language: of the relationship between perception and production, the relationship between phonetics and phonology, the learning of linguistic units, and the nature of underlying representations. Theories of sound change that do not explicitly address these aspects of language are making tacit, untested assumptions about their properties. Addressing so many aspects of language may seem to complicate the linguist's task. However, as this book shows, it actually helps impose boundary conditions of ecological validity that reduce the theoretical search space

    Sound structure and sound change: A modeling approach

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations. The methodological focus of this work is on computational modeling. By formalising and implementing theoretical accounts, modeling can expose theoretical gaps and covert assumptions. To do so, it is necessary to formally assess the functional equivalence of specific implementational choices, as well as their mapping to theoretical structures. This book applies this analytic approach to a series of implemented models of sound change. As theoretical inconsistencies are discovered, possible solutions are proposed, incrementally constructing a set of sufficient properties for a working model. Because internal theoretical consistency is enforced, this model corresponds to an explanatorily adequate theory. And because explicit links between modules are required, this is a theory, not only of sound change, but of many aspects of phonological competence. The book highlights two aspects of modeling work that receive relatively little attention: the formal mapping from model to theory, and the scalability of demonstration models. Focusing on these aspects of modeling makes it clear that any theory of sound change in the specific is impossible without a more general theory of language: of the relationship between perception and production, the relationship between phonetics and phonology, the learning of linguistic units, and the nature of underlying representations. Theories of sound change that do not explicitly address these aspects of language are making tacit, untested assumptions about their properties. Addressing so many aspects of language may seem to complicate the linguist's task. However, as this book shows, it actually helps impose boundary conditions of ecological validity that reduce the theoretical search space

    Prosody leaks into the memories of words

    Full text link
    The average predictability (aka informativity) of a word in context has been shown to condition word duration (Seyfarth, 2014). All else being equal, words that tend to occur in more predictable environments are shorter than words that tend to occur in less predictable environments. One account of the informativity effect on duration is that the acoustic details of probabilistic reduction are stored as part of a word's mental representation. Other research has argued that predictability effects are tied to prosodic structure in integral ways. With the aim of assessing a potential prosodic basis for informativity effects in speech production, this study extends past work in two directions; it investigated informativity effects in another large language, Mandarin Chinese, and broadened the study beyond word duration to additional acoustic dimensions, pitch and intensity, known to index prosodic prominence. The acoustic information of content words was extracted from a large telephone conversation speech corpus with over 400,000 tokens and 6,000 word types spoken by 1,655 individuals and analyzed for the effect of informativity using frequency statistics estimated from a 431 million word subtitle corpus. Results indicated that words with low informativity have shorter durations, replicating the effect found in English. In addition, informativity had significant effects on maximum pitch and intensity, two phonetic dimensions related to prosodic prominence. Extending this interpretation, these results suggest that predictability is closely linked to prosodic prominence, and that the lexical representation of a word includes phonetic details associated with its average prosodic prominence in discourse. In other words, the lexicon absorbs prosodic influences on speech production.Comment: 41 pages, 1 figur

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Sound structure and sound change

    Get PDF
    Research in linguistics, as in most other scientific domains, is usually approached in a modular way – narrowing the domain of inquiry in order to allow for increased depth of study. This is necessary and productive for a topic as wide-ranging and complex as human language. However, precisely because language is a complex system, tied to perception, learning, memory, and social organization, the assumption of modularity can also be an obstacle to understanding language at a deeper level. This book examines the consequences of enforcing non-modularity along two dimensions: the temporal, and the cognitive. Along the temporal dimension, synchronic and diachronic domains are linked by the requirement that sound changes must lead to viable, stable language states. Along the cognitive dimension, sound change and variation are linked to speech perception and production by requiring non-trivial transformations between acoustic and articulatory representations
    • …
    corecore