39 research outputs found

    Collaborative Development and Evaluation of Text-processing Workflows in a UIMA-supported Web-based Workbench

    Get PDF
    Challenges in creating comprehensive text-processing worklows include a lack of the interoperability of individual components coming from different providers and/or a requirement imposed on the end users to know programming techniques to compose such workflows. In this paper we demonstrate Argo, a web-based system that addresses these issues in several ways. It supports the widely adopted Unstructured Information Management Architecture (UIMA), which handles the problem of interoperability; it provides a web browser-based interface for developing workflows by drawing diagrams composed of a selection of available processing components; and it provides novel user-interactive analytics such as the annotation editor which constitutes a bridge between automatic processing and manual correction. These features extend the target audience of Argo to users with a limited or no technical background. Here, we focus specifically on the construction of advanced workflows, involving multiple branching and merging points, to facilitate various comparative evalutions. Together with the use of user-collaboration capabilities supported in Argo, we demonstrate several use cases including visual inspections, comparisions of multiple processing segments or complete solutions against a reference standard, inter-annotator agreement, and shared task mass evaluations. Ultimetely, Argo emerges as a one-stop workbench for defining, processing, editing and evaluating text processing tasks

    Bringing named entity recognition on Drupal content management system

    Get PDF
    Publicado em "8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014)"Content management systems and frameworks (CMS/F) play a key role in Web development. They support common Web operations and provide for a number of optional modules to implement customized functionalities. Given the increasing demand for text mining (TM) applications, it seems logical that CMS/F extend their offer of TM modules. In this regard, this work contributes to Drupal CMS/F with modules that support customized named entity recognition and enable the construction of domain-specific document search engines. Implementation relies on well-recognized Apache Information Retrieval and TM initiatives, namely Apache Lucene, Apache Solr and Apache Unstructured Information Management Architecture (UIMA). As proof of concept, we present here the development of a Drupal CMS/F that retrieves biomedical articles and performs automatic recognition of organism names to enable further organism-driven document screening

    Report on the 2015 NSF Workshop on Unified Annotation Tooling

    Get PDF
    On March 30 & 31, 2015, an international group of twenty-three researchers with expertise in linguistic annotation convened in Sunny Isles Beach, Florida to discuss problems with and potential solutions for the state of linguistic annotation tooling. The participants comprised 14 researchers from the U.S. and 9 from outside the U.S., with 7 countries and 4 continents represented, and hailed from fields and specialties including computational linguistics, artificial intelligence, speech processing, multi-modal data processing, clinical & medical natural language processing, linguistics, documentary linguistics, sign-language linguistics, corpus linguistics, and the digital humanities. The motivating problem of the workshop was the balkanization of annotation tooling, namely, that even though linguistic annotation requires sophisticated tool support to efficiently generate high-quality data, the landscape of tools for the field is fractured, incompatible, inconsistent, and lacks key capabilities. The overall goal of the workshop was to chart the way forward, centering on five key questions: (1) What are the problems with current tool landscape? (2) What are the possible benefits of solving some or all of these problems? (3) What capabilities are most needed? (4) How should we go about implementing these capabilities? And, (5) How should we ensure longevity and sustainability of the solution? I surveyed the participants before their arrival, which provided significant raw material for ideas, and the workshop discussion itself resulted in identification of ten specific classes of problems, five sets of most-needed capabilities. Importantly, we identified annotation project managers in computational linguistics as the key recipients and users of any solution, thereby succinctly addressing questions about the scope and audience of potential solutions. We discussed management and sustainability of potential solutions at length. The participants agreed on sixteen recommendations for future work. This technical report contains a detailed discussion of all these topics, a point-by-point review of the discussion in the workshop as it unfolded, detailed information on the participants and their expertise, and the summarized data from the surveys

    D3.1. Architecture and design of the platform

    Get PDF
    This document aims to establish the requirements and the technological basis and design of the PANACEA platform. These are the main goals of the document: - Survey the different technological approaches that can be used in PANACEA. - Specify some guidelines for the metadata. - Establish the requirements for the platform. - Make a Common Interface proposal for the tools. - Propose a format for the data to be exchanged by the tools (Travelling Object). - Choose the technologies that will be used to develop the platform. - Propose a workplan

    The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium*

    Get PDF
    Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies

    Natural Language Processing: Integration of Automatic and Manual Analysis

    Get PDF
    There is a current trend to combine natural language analysis with research questions from the humanities. This requires an integration of automatic analysis with manual analysis, e.g. to develop a theory behind the analysis, to test the theory against a corpus, to generate training data for automatic analysis based on machine learning algorithms, and to evaluate the quality of the results from automatic analysis. Manual analysis is traditionally the domain of linguists, philosophers, and researchers from other humanities disciplines, who are often not expert programmers. Automatic analysis, on the other hand, is traditionally done by expert programmers, such as computer scientists and more recently computational linguists. It is important to bring these communities, their tools, and data closer together, to produce analysis of a higher quality with less effort. However, promising cooperations involving manual and automatic analysis, e.g. for the purpose of analyzing a large corpus, are hindered by many problems: - No comprehensive set of interoperable automatic analysis components is available. - Assembling automatic analysis components into workflows is too complex. - Automatic analysis tools, exploration tools, and annotation editors are not interoperable. - Workflows are not portable between computers. - Workflows are not easily deployable to a compute cluster. - There are no adequate tools for the selective annotation of large corpora. - In automatic analysis, annotation type systems are predefined, but manual annotation requires customizability. - Implementing new interoperable automatic analysis components is too complex. - Workflows and components are not sufficiently debuggable and refactorable. - Workflows that change dynamically via parametrization are not readily supported. - The user has no control over workflows that rely on expert skills from a different domain, undocumented knowledge, or third-party infrastructures, e.g. web services. In cooperation with researchers from the humanities, we develop innovative technical solutions and designs to facilitate the use of automatic analysis and to promote the integration of manual and automatic analysis. To address these issues, we set foundations in four areas: - Usability is improved by reducing the complexity of the APIs for building workflows and creating custom components, improving the handling of resources required by such components, and setting up auto-configuration mechanisms. - Reproducibility is improved through a concept for self-contained, portable analysis components and workflows combined with a declarative modeling approach for dynamic parametrized workflows, that facilitates avoiding unnecessary auxiliary manual steps in automatic workflows. - Flexibility is achieved by providing an extensive collection of interoperable automatic analysis components. We also compare annotation type systems used by different automatic analysis components to locate design patterns that allow for customization when used in manual analysis tasks. - Interactivity is achieved through a novel "annotation-by-query" process combining corpus search with annotation in a multi-user scenario. The process is supported by a web-based tool. We demonstrate the adequacy of our concepts through examples which represent whole classes of research problems. Additionally, we integrated all our concepts into existing open-source projects, or we implemented and published them within new open-source projects

    Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

    Get PDF
    No abstract available

    Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

    Get PDF
    No abstract available