39,520 research outputs found

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

    Fat-tailed fluctuations in the size of organizations: the role of social influence

    Full text link
    Organizational growth processes have consistently been shown to exhibit a fatter-than-Gaussian growth-rate distribution in a variety of settings. Long periods of relatively small changes are interrupted by sudden changes in all size scales. This kind of extreme events can have important consequences for the development of biological and socio-economic systems. Existing models do not derive this aggregated pattern from agent actions at the micro level. We develop an agent-based simulation model on a social network. We take our departure in a model by a Schwarzkopf et al. on a scale-free network. We reproduce the fat-tailed pattern out of internal dynamics alone, and also find that it is robust with respect to network topology. Thus, the social network and the local interactions are a prerequisite for generating the pattern, but not the network topology itself. We further extend the model with a parameter ÎŽ\delta that weights the relative fraction of an individual's neighbours belonging to a given organization, representing a contextual aspect of social influence. In the lower limit of this parameter, the fraction is irrelevant and choice of organization is random. In the upper limit of the parameter, the largest fraction quickly dominates, leading to a winner-takes-all situation. We recover the real pattern as an intermediate case between these two extremes.Comment: 15 pages, 4 figure

    Taxonomy for Humans or Computers? Cognitive Pragmatics for Big Data

    Get PDF
    Criticism of big data has focused on showing that more is not necessarily better, in the sense that data may lose their value when taken out of context and aggregated together. The next step is to incorporate an awareness of pitfalls for aggregation into the design of data infrastructure and institutions. A common strategy minimizes aggregation errors by increasing the precision of our conventions for identifying and classifying data. As a counterpoint, we argue that there are pragmatic trade-offs between precision and ambiguity that are key to designing effective solutions for generating big data about biodiversity. We focus on the importance of theory-dependence as a source of ambiguity in taxonomic nomenclature and hence a persistent challenge for implementing a single, long-term solution to storing and accessing meaningful sets of biological specimens. We argue that ambiguity does have a positive role to play in scientific progress as a tool for efficiently symbolizing multiple aspects of taxa and mediating between conflicting hypotheses about their nature. Pursuing a deeper understanding of the trade-offs and synthesis of precision and ambiguity as virtues of scientific language and communication systems then offers a productive next step for realizing sound, big biodiversity data services

    Ontologies and Information Extraction

    Full text link
    This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE

    Mean-field methods in evolutionary duplication-innovation-loss models for the genome-level repertoire of protein domains

    Full text link
    We present a combined mean-field and simulation approach to different models describing the dynamics of classes formed by elements that can appear, disappear or copy themselves. These models, related to a paradigm duplication-innovation model known as Chinese Restaurant Process, are devised to reproduce the scaling behavior observed in the genome-wide repertoire of protein domains of all known species. In view of these data, we discuss the qualitative and quantitative differences of the alternative model formulations, focusing in particular on the roles of element loss and of the specificity of empirical domain classes.Comment: 10 Figures, 2 Table
    • 

    corecore