207 research outputs found

    Framework for Knowledge Discovery in Educational Video Repositories

    Get PDF
    The ease of creating digital content coupled with technological advancements allows institutions and organizations to further embrace distance learning. Teaching materials also receive attention, because it is difficult for the student to obtain adequate didactic material, being necessary a high effort and knowledge about the material and the repository. This work presents a framework that enables the automatic metadata generation for materials available in educational video repositories. Each module of the framework works autonomously and can be used in isolation, complemented by another technique or replaced by a more appropriate approach to the field of use, such as repositories with other types of media or other content

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    The Interpretation of Tables in Texts

    Get PDF

    Learning for text mining : tackling the cost of feature and knowledge engineering.

    Get PDF
    Over the last decade, the state-of-the-art in text mining has moved towards the adoption of machine learning as the main paradigm at the heart of approaches. Despite significant advances, machine learning based text mining solutions remain costly to design, develop and maintain for real world problems. An important component of such cost (feature engineering) concerns the effort required to understand which features or characteristics of the data can be successfully exploited in inducing a predictive model of the data. Another important component of the cost (knowledge engineering) has to do with the effort in creating labelled data, and in eliciting knowledge about the mining systems and the data itself. I present a series of approaches, methods and findings aimed at reducing the cost of creating and maintaining document classification and information extraction systems. They address the following questions: Which classes of features lead to an improved classification accuracy in the document classification and entity extraction tasks? How to reduce the amount of labelled examples needed to train machine learning based document classification and information extraction systems, so as to relieve domain experts from this costly task? How to effectively represent knowledge about these systems and the data that they manipulate, in order to make systems interoperable and results replicable? I provide the reader with the background information necessary to understand the above questions and the contributions to the state-of the- art contained herein. The contributions include: the identification of novel classes of features for the document classification task which exploit the multimedia nature of documents and lead to improved classification accuracy; a novel approach to domain adaptation for text categorization which outperforms standard supervised and semi-supervised methods while requiring considerably less supervision; and a well-founded formalism for declaratively specifying text and multimedia mining systems

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio
    • 

    corecore