31 research outputs found

    CORLEONE - Core Linguistic Entity Online Extraction

    Get PDF
    This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.JRC.G.2-Support to external securit

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Implicit Entity Networks: A Versatile Document Model

    Get PDF
    The time in which we live is often referred to as the Information Age. However, it can also aptly be characterized as an age of constant information overload. Nowhere is this more present than on the Web, which serves as an endless source of news articles, blog posts, and social media messages. Of course, this overload is even greater in professions that handle the creation or extraction of information and knowledge, such as journalists, lawyers, researchers, clerks, or medical professionals. The volume of available documents and the interconnectedness of their contents are both a blessing and a curse for the contemporary information consumer. On the one hand, they provide near limitless information, but on the other hand, their consumption and comprehension requires an amount of time that many of us cannot spare. As a result, automated extraction, aggregation, and summarization techniques have risen in popularity, even though they are a long way from being comprehensive. When we, as humans, are faced with an overload of information, we tend to look for patterns that bring order into the chaos. In news, we might identify familiar political figures or celebrities, whereas we might look for expressive symptoms in medicine, or precedential cases in law. In other words, we look for known entities as reference points, and then explore the content along the lines of their relations to others entities. Unfortunately, this approach is not reflected in current document models, which do not provide a similar focus on entities. As a direct result, the retrieval of entity-centric knowledge and relations from a flood of textual information becomes more difficult than it has to be, and the inclusion of external knowledge sources is impeded. In this thesis, we introduce implicit entity networks as a comprehensive document model that addresses this shortcoming and provides a holistic representation of document collections and document streams. Based on the premise of modelling the cooccurrence relations between terms and entities as first-class citizens, we investigate how the resulting network structure facilitates efficient and effective entity-centric search, and demonstrate the extraction of complex entity relations, as well as their summarization. We show that the implicit network model is fully compatible with dynamic streams of documents. Furthermore, we introduce document aggregation methods that are sensitive to the context of entity mentions, and can be used to distinguish between different entity relations. Beyond the relations of individual entities, we introduce network topics as a novel and scalable method for the extraction of topics from collections and streams of documents. Finally, we combine the insights gained from these applications in a versatile hypergraph document model that bridges the gap between unstructured text and structured knowledge sources

    A teachable semi-automatic web information extraction system based on evolved regular expression patterns

    Get PDF
    This thesis explores Web Information Extraction (WIE) and how it has been used in decision making and to support businesses in their daily operations. The research focuses on a WIE system based on Genetic Programming (GP) with an extensible model to enhance the automatic extractor. This uses a human as a teacher to identify and extract relevant information from the semi-structured HTML webpages. Regular expressions, which have been chosen as the pattern matching tool, are automatically generated based on the training data to provide an improved grammar and lexicon. This particularly benefits the GP system which may need to extend its lexicon in the presence of new tokens in the web pages. These tokens allow the GP method to produce new extraction patterns for new requirements

    Prehistoric dwelling : circular structures in north and central Britain c 2500 BC - AD 500

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Ciaran Carson

    Get PDF
    Ciaran Carson is one of the most challenging and inventive of contemporary Irish writers, exhibiting verbal brilliance, formal complexity, and intellectual daring across a remarkably varied body of work. This study considers the full range of his oeuvre, in poetry, prose, and translations, and discusses the major themes to which he returns, including: memory and history, narrative, language and translation, mapping, violence, and power. It argues that the singularity of Carson’s writing is to be found in his radical imaginative engagements with ideas of space and place. The city of Belfast, in particular, occupies a crucially important place in his texts, serving as an imaginative focal point around which his many other concerns are constellated. The city, in all its volatile mutability, is an abiding frame of reference and a reservoir of creative impetus for Carson’s imagination. Accordingly, the book adopts an interdisciplinary approach that draws upon geography, urbanism, and cultural theory as well as literary criticism. It provides both a stimulating and thorough introduction to Carson’s work, and a flexible critical framework for exploring literary representations of space

    Machine Learning

    Get PDF
    Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience

    The taberna structures of Roman Britain

    Get PDF
    The aim of this thesis is to explain how the shops (tabernae) of Roman Britain related to society. The buildings of a more humble nature, including tabernae, have been frequently overlooked at the expense of the more ornate public buildings and villas. This thesis proposes to redress this imbalance, as it is believed that retailing and manufacture were one of the most crucial features of Roman society. Varied sources have been used to aid this hypothetical reconstruction and these included the excavated archaeological remains, the extant remains from other parts of the empire and the ancient literary sources. Although these provided a wealth of information they are by themselves limited in what they can reveal about their society. Anthropological and geographical studies have proved an immensely useful tool to illuminate other aspects of society. These were approached with great circumspection and examined in relation to the archaeological evidence. Using all this information the thesis attempts to describe and explain the major factors that helped to create the form and geographical pattern of retail establishments in Roman Britain. It is argued that the tabernae were more responsive to and give a more accurate picture of the social and economic climate of Roman Britain than any other building type. It appears that the Romano-British community was well catered for in life's necessities with a wide variety of merchandise supplied by tabernae. The development of tabernae is difficult to summarise, as more than any other building type they were subject to a multitude of varied and individual circumstances, but it can be demonstrated that a thriving and competitive retailing community existed in the major settlements of Roman Britain

    Neolithic building technology and the social context of construction practices: the case of northern Greece

    Get PDF
    This thesis addresses building technology and the social implications of house construction contributing to the understanding of past societies. The spatiotemporal context of the study is the Neolithic period (ca. 6600/6500–3300/3200 cal BC) in northern Greece (Macedonia and Thrace). All available evidence from various excavations in the region is assembled and synthesised. The principal house types (semi-subterranean structures and above-ground dwellings) and their technological characteristics in terms of materials and techniques are discussed. In addition, the building remains from the late Middle/Late Neolithic settlement of Avgi (Kastoria, Greece) are thoroughly examined. Their study highlights the potentials of a detailed, micro-scale investigation and puts forth a methodology for the technological analysis of house rubble in the form of fire-hardened daub. The data deriving from both the survey of dwelling remains in northern Greece and the case study are examined within their wider sociocultural context. The technological repertoire of the region, although indicating the sharing of a common ‘architectural vocabulary’, reveals alternative chaînes opératoires and variability in different stages of the building process. Variability and patterning are more pronounced during the later stages of the Neolithic. The distribution of architectural choices does not suggest the existence of established and region-wide shared architectural traditions. However, the circulation of specific techniques and conceptions points to the operation of overlapping networks of technological and social interaction. At the site-specific scale, sameness and standardisation in building technology are the prominent themes. Nevertheless, different trends towards standardisation or variability are observed and are approached in terms of social interaction and intra-community dynamics. What is more, domestic architecture is not necessarily static in the long term. Change occurs and is often associated with the transformation of these dynamics. Occasional evidence of intra-site variability in building techniques and the more pronounced anchoring into space during the later stages of the Neolithic period are considered as a result of the changing relationship between social units and the community. The appearance of stone and mud(brick) architecture in Late Neolithic central Macedonia is approached in these terms. i
    corecore