1,373 research outputs found

    Management of Scientific Images: An approach to the extraction, annotation and retrieval of figures in the field of High Energy Physics

    Get PDF
    El entorno de la información en la primera década del siglo XXI no tiene precedentes. Las barreras físicas que han limitado el acceso al conocimiento están desapareciendo a medida que los métodos tradicionales de acceso a información se reemplazan o se mejoran gracias al uso de sistemas basados en computador. Los sistemas digitales son capaces de gestionar colecciones mucho más grandes de documentos, confrontando a los usuarios de información con la avalancha de documentos asociados a su tópico de interés. Esta nueva situación ha creado un incentivo para el desarrollo de técnicas de minería de datos y la creación de motores de búsqueda más eficientes y capaces de limitar los resultados de búsqueda a un subconjunto reducido de los más relevantes. Sin embargo, la mayoría de los motores de búsqueda en la actualidad trabajan con descripciones textuales. Estas descripciones se pueden extraer o bien del contenido o a través de fuentes externas. La recuperación basada en el contenido no textual de documentos es un tema de investigación continua. En particular, la recuperación de imágenes y el desentrañar la información contenida en ellas están suscitando un gran interés en la comunidad científica. Las bibliotecas digitales se sitúan en una posición especial dentro de los sistemas que facilitan el acceso al conocimiento. Actúan como repositorios de documentos que comparten algunas características comunes (por ejemplo, pertenecer a la misma área de conocimiento o ser publicados por la misma institución) y como tales contienen documentos considerados de interés para un grupo particular de usuarios. Además, facilitan funcionalidades de recuperación sobre las colecciones gestionadas. Normalmente, las publicaciones científicas son las unidades más pequeñas gestionadas por las bibliotecas digitales científicas. Sin embargo, en el proceso de creación científica hay diferentes tipos de artefactos, entre otros: figuras y conjuntos de datos. Las figuras juegan un papel particularmente importante en el proceso de publicación científica. Representan los datos en una forma gráfica que nos permite mostrar patrones sobre grandes conjuntos de datos y transmitir ideas complejas de un modo fácilmente entendible. Los sistemas existentes para bibliotecas digitales facilitan el acceso a figuras, pero solo como parte de los ficheros sobre los que se serializa la publicación entera. El objetivo de esta tesis es proponer un conjunto de métodos ytécnicas que permitan transformar las figuras en productos de primera clase dentro del proceso de publicación científica, permitiendo que los investigadores puedan obtener el máximo beneficio a la hora de realizar búsquedas y revisiones de bibliografía existente. Los métodos y técnicas propuestos están orientados a facilitar la adquisición, anotación semántica y búsqueda de figuras contenidas en publicaciones científicas. Para demostrar la completitud de la investigación se han ilustrado las teorías propuestas mediante ejemplos en el campo de la Física de Partículas (también conocido como Física de Altas Energías). Para aquellos casos en los que se han necesitadoo en las figuras que aparecen con más frecuencia en las publicaciones de Física de Partículas: los gráficos científicos denominados en inglés con el término plots. Los prototipos que propuestas más detalladas han desarrollado para esta tesis se han integrado parcialmente dentro del software Invenio (1) para bibliotecas digitales, así como dentro de INSPIRE, una de las mayores bibliotecas digitales en Física de Partículas mantenida gracias a la colaboración de grandes laboratorios y centros de investigación como son el CERN, SLAC, DESY y Fermilab. 1). http://invenio-software.org

    Report on the 2015 NSF Workshop on Unified Annotation Tooling

    Get PDF
    On March 30 & 31, 2015, an international group of twenty-three researchers with expertise in linguistic annotation convened in Sunny Isles Beach, Florida to discuss problems with and potential solutions for the state of linguistic annotation tooling. The participants comprised 14 researchers from the U.S. and 9 from outside the U.S., with 7 countries and 4 continents represented, and hailed from fields and specialties including computational linguistics, artificial intelligence, speech processing, multi-modal data processing, clinical & medical natural language processing, linguistics, documentary linguistics, sign-language linguistics, corpus linguistics, and the digital humanities. The motivating problem of the workshop was the balkanization of annotation tooling, namely, that even though linguistic annotation requires sophisticated tool support to efficiently generate high-quality data, the landscape of tools for the field is fractured, incompatible, inconsistent, and lacks key capabilities. The overall goal of the workshop was to chart the way forward, centering on five key questions: (1) What are the problems with current tool landscape? (2) What are the possible benefits of solving some or all of these problems? (3) What capabilities are most needed? (4) How should we go about implementing these capabilities? And, (5) How should we ensure longevity and sustainability of the solution? I surveyed the participants before their arrival, which provided significant raw material for ideas, and the workshop discussion itself resulted in identification of ten specific classes of problems, five sets of most-needed capabilities. Importantly, we identified annotation project managers in computational linguistics as the key recipients and users of any solution, thereby succinctly addressing questions about the scope and audience of potential solutions. We discussed management and sustainability of potential solutions at length. The participants agreed on sixteen recommendations for future work. This technical report contains a detailed discussion of all these topics, a point-by-point review of the discussion in the workshop as it unfolded, detailed information on the participants and their expertise, and the summarized data from the surveys

    Ontology-Driven Semantic Annotations for Multiple Engineering Viewpoints in Computer Aided Design

    Get PDF
    Engineering design involves a series of activities to handle data, including capturing and storing data, retrieval and manipulation of data. This also applies throughout the entire product lifecycle (PLC). Unfortunately, a closed loop of knowledge and information management system has not been implemented for the PLC. As part of product lifecycle management (PLM) approaches, computer-aided design (CAD) systems are extensively used from embodiment and detail design stages in mechanical engineering. However, current CAD systems lack the ability to handle semantically-rich information, thus to represent, manage and use knowledge among multidisciplinary engineers, and to integrate various tools/services with distributed data and knowledge. To address these challenges, a general-purpose semantic annotation approach based on CAD systems in the mechanical engineering domain is proposed, which contributes to knowledge management and reuse, data interoperability and tool integration. In present-day PLM systems, annotation approaches are currently embedded in software applications and use diverse data and anchor representations, making them static, inflexible and difficult to incorporate with external systems. This research will argue that it is possible to take a generalised approach to annotation with formal annotation content structures and anchoring mechanisms described using general-purpose ontologies. In this way viewpoint-oriented annotation may readily be captured, represented and incorporated into PLM systems together with existing annotations in a common framework, and the knowledge collected or generated from multiple engineering viewpoints may be reasoned with to derive additional knowledge to enable downstream processes. Therefore, knowledge can be propagated and evolved through the PLC. Within this framework, a knowledge modelling methodology has also been proposed for developing knowledge models in various situations. In addition, a prototype system has been designed and developed in order to evaluate the core contributions of this proposed concept. According to an evaluation plan, cost estimation and finite element analysis as case studies have been used to validate the usefulness, feasibility and generality of the proposed framework. Discussion has been carried out based on this evaluation. As a conclusion, the presented research work has met the original aim and objectives, and can be improved further. At the end, some research directions have been suggested.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora

    Get PDF
    Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consuming and expensive task. The field lacks efficient geo-annotation tools to support corpus building and lacks design guidelines for the development of such tools. Here, we present the iterative design of GeoAnnotator, a web-based, semi-automatic and collaborative visual analytics platform for geo-annotation. GeoAnnotator facilitates collaborative, multi-annotator creation of large corpora of geo-annotated text by generating computationally-generated pre-annotations that can be improved by human-annotator users. The resulting corpora can be used in improving and benchmarking geoparsing algorithms as well as various other spatial language-related methods. Further, the iterative design process and the resulting design decisions can be used in annotation platforms tailored for other application domains of NLP

    Community-driven & Work-integrated Creation, Use and Evolution of Ontological Knowledge Structures

    Get PDF

    From Texts to Prerequisites. Identifying and Annotating Propaedeutic Relations in Educational Textual Resources

    Get PDF
    openPrerequisite Relations (PRs) are dependency relations established between two distinct concepts expressing which piece(s) of information a student has to learn first in order to understand a certain target concept. Such relations are one of the most fundamental in Education, playing a crucial role not only for what concerns new knowledge acquisition, but also in the novel applications of Artificial Intelligence to distant and e-learning. Indeed, resources annotated with such information could be used to develop automatic systems able to acquire and organize the knowledge embodied in educational resources, possibly fostering educational applications personalized, e.g., on students' needs and prior knowledge. The present thesis discusses the issues and challenges of identifying PRs in educational textual materials with the purpose of building a shared understanding of the relation among the research community. To this aim, we present a methodology for dealing with prerequisite relations as established in educational textual resources which aims at providing a systematic approach for uncovering PRs in textual materials, both when manually annotating and automatically extracting the PRs. The fundamental principles of our methodology guided the development of a novel framework for PR identification which comprises three components, each tackling a different task: (i) an annotation protocol (PREAP), reporting the set of guidelines and recommendations for building PR-annotated resources; (ii) an annotation tool (PRET), supporting the creation of manually annotated datasets reflecting the principles of PREAP; (iii) an automatic PR learning method based on machine learning (PREL). The main novelty of our methodology and framework lies in the fact that we propose to uncover PRs from textual resources relying solely on the content of the instructional material: differently from other works, rather than creating de-contextualised PRs, we acknowledge the presence of a PR between two concepts only if emerging from the way they are presented in the text. By doing so, we anchor relations to the text while modelling the knowledge structure entailed in the resource. As an original contribution of this work, we explore whether linguistic complexity of the text influences the task of manual identification of PRs. To this aim, we investigate the interplay between text and content in educational texts through a crowd-sourcing experiment on concept sequencing. Our methodology values the content of educational materials as it incorporates the evidence acquired from such investigation which suggests that PR recognition is highly influenced by the way in which concepts are introduced in the resource and by the complexity of the texts. The thesis reports a case study dealing with every component of the PR framework which produced a novel manually-labelled PR-annotated dataset.openXXXIII CICLO - DIGITAL HUMANITIES. TECNOLOGIE DIGITALI, ARTI, LINGUE, CULTURE E COMUNICAZIONE - Lingue, culture e tecnologie digitaliAlzetta, Chiar

    Supporting Internet Search by Search-Log Publishing

    Get PDF
    Antud väitekiri on osa jätkuvast kollektiivsest uurimistööst, laiema eesmärgiga eeskätt parandada Internetiotsingu tuge keeruliste ja aeganõudvate ning tihti uurimusliku loomuga otsinguülesannete kiiremaks ja efektiivsemaks läbiviimiseks. Töö peamine uurimisprobleem on uut tüüpi otsinguülesannete logimise ja Internetis jagamise raamistiku väljatöötamine, olles alternatiiviks brauseri pistikprogrammide põhistele olemasolevatele meetoditele. Tegu oli keerulise insenertehnilise ülesandega, mille käigus tuli autoril täita mitmesuguseid programmeerimise, planeerimise, süsteemi komponentide integreerimise ja konfigureerimisega seotud ülesandeid. Püstitatud eesmärk sai edukalt täidetud. Väitekirjas pakuti välja proksipõhine meetod kasutajate otsingukäitumise logimiseks, mis on ühtlasi lihtsasti kohaldatav erinevatele veebilehitsejatele ning operatsioonisüsteemidele. Lahendust võrreldi varasemate sarnaste süsteemidega. Meetod sündis reaalsest vajadusest leida kergemalt hallatav ning porditav asendus varem väljatöötatud tarkvarale, mis kujutas endast pistikprogrammi Mozilla Firefox veebilehitsejale, kuid mida tuli parandada pärast iga uue brauseri versiooni väljatulekut. Teostus koosneb kahest suuremast komponendist, millest esimene ja tehniliselt keerulisem, otsinguülesannete logide koostamise ja jagamise süsteem, paikneb VirtualBox'i virtuaalses masinas. Teine on WordPress'il põhinev otsingulogide repositoorium, võimaldades lisaks kasutaja poolt annoteeritud logide avaldamise ka neist lihtsamaid otsinguid teostada. Süsteeme on põhjalikult testitud, kuid neid pole veel rakendatud Internetiotsinguga seotud kasutajauurimustesse. Autorile on teada, et selline huvi on olemas nii Tartu Ülikooli sees kui ka ühe välismaise partnerülikooli poolt. Lokaalselt paiknev otsinguülesannete koostamise ja jagamise süsteem koosneb kolmest võrdselt tähtsast alamkomponendist. Nendeks on Python'i keeles realiseeritud otsinguülesande logija; peamiselt PHP'd ja HTML'i kasutav veebiliides, mis muuhulgas võimaldab kasutajal eelpoolmainitud logijat sisse ja välja lülitada, aga ka kõiki otsinguülesandega seotud andmeid käsitsi muuta ja täiendada; ja antud ülesandeks spetsiaalselt konfigureeritud Privoxy veebiproksi server. Töös antakse põhjalik ülevaade olemasolevast tarkvarast, teaduspublikatsioonidest ja teoreetilistest alustest seoses väitekirja uurimisprobleemiga. Võrreldes olemasolevate meetoditega eristub autori pakutud proksipõhine otsinguülesannete logimise ja jagamise raamistik peamiselt kahel põhjusel. Esiteks, meetod tagab platvormist ja brauserist sõltumatuse, olles ühtlasi väga stabiilne. Teiseks, kasutajatele antav vabadus oma otsinguülesannet vabalt defineerida ning annoteerida on oluliseks uueks tähiseks. Väitekirja viimases peatükis käsitletakse tööga seotud tulevikuväljavaateid ja avatud probleeme. Üks neist on väljapakutavaga võrreldes muudetud arhitektuur, mis võimaldaks korraldada väiksema vaeva ja ajakuluga laborieksperimente. Internetiotsingu logimise süsteemi saab edasi arendada, lisades tuge enamatele JavaScript'i sündmustele. Otsingulogide repositoorium, olles veel üsna algeline, pakub hulgaliselt võimalusi täiendusteks tulevikuks.The main research problem of my thesis was engineering a new type of search task logging and publishing framework which would provide a better alternative for existing browser plug-in based methods. Right from the start, the proxy-based search task reporting system has been a complex engineering challenge involving code written in multiple programming languages, interactions planned across many software modules (some of which have already been existing large projects themselves), and a Linux operating system configured to ease the set-up process for the user. This was the decision process to make sure that this solution is reliable, extendible and maintainable in the future. My research goal was completed successfully. In my thesis, I proposed a proxy-based method for logging user search behaviour across different browsers and operating systems. I also compared it with an existing plug-in based Search Logger for Mozilla Firefox and other similar solutions. The idea of developing a proxy-based search task logging and publishing solution came from out of necessity, because the existing logging solution had significant problems with maintainability. The logs created by my solution are subsequently annotated by the user and made publicly available on a dedicated Internet blog called the Search Task Repository. Users can search against the already annotated and published Internet search logs. Ideally this would mean reduced complexity of search tasks for the users which in turn saves time. User studies to confirm this are still pending but there is confirmed interest from Tartu researchers as well as from one foreign university to use my solution in their search experiments. The proposed solution is comprised of two large units, which are the search task repository and the search task logging and publishing unit. The search task repository is a remote component, essentially a fairly simple WordPress blog, which enables search stories to be published automatically over XML-RPC protocol, search queries to be served, and search task logs to be displayed to the searcher. My logging system is configured as a VirtualBox virtual machine. It is much more complex, consisting of three sub-components: the main Web interface, the search task logger, and the Privoxy Web proxy specially configured for my needs. Logging can be started and stopped at a user's will in the main Web interface. What is more, this sub-component also gives them absolute control over what gets published online by providing an editing and annotating functionality for all search task data, both implicitly and explicitly logged. A comprehensive theoretical overview was given in my thesis about the state of the art, explaining basic related concepts in Information Retrieval and recent developments in Exploratory Search and search task logging systems. In contrast with existing browser plug-in based search task logging methods, my proposed proxy-based approach ensures platform and browser independence while also being very stable. By giving searcher's the opportunity to freely define and annotate their own search tasks, my search support solution is setting a new standard. In the final chapter, I conducted a thorough analysis about future work and presented my own vision about the future opportunities for this search support methodology. A modified architecture for more convenient laboratory experiments was outlined as an important task for the future. In conclusion, my proxy-based search task logging, editing and publishing framework can be extended further to log more JavaScript events. The search task repository is a large open area with lots of opportunities for future extensions

    Role of Semantic web in the changing context of Enterprise Collaboration

    Get PDF
    In order to compete with the global giants, enterprises are concentrating on their core competencies and collaborating with organizations that compliment their skills and core activities. The current trend is to develop temporary alliances of independent enterprises, in which companies can come together to share skills, core competencies and resources. However, knowledge sharing and communication among multidiscipline companies is a complex and challenging problem. In a collaborative environment, the meaning of knowledge is drastically affected by the context in which it is viewed and interpreted; thus necessitating the treatment of structure as well as semantics of the data stored in enterprise repositories. Keeping the present market and technological scenario in mind, this research aims to propose tools and techniques that can enable companies to assimilate distributed information resources and achieve their business goals

    Linguistic Approaches for Early Measurement of Functional Size from Software Requirements

    Get PDF
    The importance of early effort estimation, resource allocation and overall quality control in a software project has led the industry to formulate several functional size measurement (FSM) methods that are based on the knowledge gathered from software requirements documents. The main objective of this research is to develop a comprehensive methodology to facilitate and automate early measurement of a software's functional size from its requirements document written in unrestricted natural language. For the purpose of this research, we have chosen to use the FSM method developed by the Common Software Measurement International Consortium (COSMIC) and adopted as an international standard by the International Standardization Organization (ISO). This thesis presents a methodology to measure the COSMIC size objectively from various textual forms of functional requirements and also builds conceptual measurement models to establish traceability links between the output measurements and the input requirements. Our research investigates the feasibility of automating every major phase of this methodology with natural language processing and machine learning approaches. The thesis provides a step-by-step validation and demonstration of the implementation of this innovative methodology. It describes the details of empirical experiments conducted to validate the methodology with practical samples of textual requirements collected from both the industry and academia. Analysis of the results show that each phase of our methodology can successfully be automated and, in most cases, leads to an accurate measurement of functional size