2,086 research outputs found

    On Learning Web Information Extraction Rules with TANGO

    Get PDF
    The research on Enterprise Systems Integration focuses on proposals to support business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a human user who interacts with them and extracts the information of interest in a structured for mat. In this article, we present TANGO, which is our proposal to learn rules to extract information from semi-structured web documents with high precision and recall, which is a must in the context of Enterprise Systems Integration. It relies on an open catalogue of features that helps map the input documents into a knowledge base in which every DOM node is represented by means of HTML, DOM, CSS, relational, and user-defined features. Then a procedure with many variation points is used to learn extraction rules from that knowledge base; the variation points include heuristics that range from how to select a condition to how to simplify the resulting rules. We also provide a systematic method to help re-configure our proposal. Our exhaustive experimentation proves that it beats others regarding effectiveness and is efficient enough for practical purposes. Our proposal was devised to be as configurable as possible, which helps adapt it to particular web sites and evolve it when necessary.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    Automatic Genre Classification of Latin Music Using Ensemble of Classifiers

    Get PDF
    This paper presents a novel approach to the task of automatic music genre classification which is based on ensemble learning. Feature vectors are extracted from three 30-second music segments from the beginning, middle and end of each music piece. Individual classifiers are trained to account for each music segment. During classification, the output provided by each classifier is combined with the aim of improving music genre classification accuracy. Experiments carried out on a dataset containing 600 music samples from two Latin genres (Tango and Salsa) have shown that for the task of automatic music genre classification, the features extracted from the middle and end music segments provide better results than using the beginning music segment. Furthermore, the proposed ensemble method provides better accuracy than using single classifiers and any individual segment

    Smart Photos

    Get PDF
    Recent technological leaps have been a great catalyst for changing how people interact with the world around us. Specifically, the field of Augmented Reality has led to many software and hardware advances that have formed a digital intermediary between humans and their environment. As of now, Augmented Reality is available to the select few with the means of obtaining Google Glass, Oculus Rifts, and other relatively expensive platforms. Be that as it may, the tech industry\u27s current goal has been integration of this technology into the public\u27s smartphones and everyday devices. One inhibitor of this goal is the difficulty of finding an Augmented Reality application whose usage could satisfy an everyday need or attraction. Augmented reality presents our world in a unique perspective that can be found nowhere else in the natural world. However, visual impact is weak without substance or meaning. The best technology is invisible, and what makes a good product is its ability to fill a void in a person\u27s life. The most important researchers in this field are those who have been augmenting the tasks that most would consider mundane, such as overlaying nutritional information directly onto a meal [4]. In the same vein, we hope to incorporate Augmented Reality into everyday life by unlocking the full potential of a technology often believed to have already have reached its peak. The humble photograph, a classic invention and unwavering enhancement to the human experience, captures moments in space and time and compresses them into a single permanent state. These two-dimensional assortments of pixels give us a physical representation of the memories we form in specific periods of our lives. We believe this representation can be further enhanced in what we like to call a Smart Photo. The idea behind a Smart Photo is to unlock the full potential in the way that people can interact with photographs. This same notion is explored in the field of Virtual Reality with inventions such as 3D movies, which provide a special appeal that ordinary 2D films cannot. The 3D technology places the viewer inside the film\u27s environment. We intend to marry this seemingly mutually exclusive dichotomy by processing 2D photos alongside their 3D counterparts

    On validating web information extraction proposals

    Get PDF
    Many people who have to make informed decisions in today’s always-on culture use information extractors to feed their systems with information that comes from human-friendly documents. Unfortunately, many proposals that validate information extractors have deficiencies that make it difficult to perform homogeneous comparisons, confirm or refute performance hypotheses, or draw unbiased conclusions. Consequently, it is very difficult to select the best-performing proposal on a sound basis. The state-of-the-art validation method overcomes many deficiencies in the previous proposals, but still overlooks the following issues: completeness of the validation datasets, that is, whether they provide a complete set of annotations or not; structure of the information, that is, whether they check the structure of the record instances extracted or just the attribute instances; and, finally, how extractions and annotations are matched. The decisions made regarding the previous issues have an impact on the effectiveness results. In this article, we have exhaustively analysed the literature and we have also highlighted the main weaknesses to tackle. We present a guideline and a method to compute the effectiveness, which complements and enhances the state-of-the-art validation method.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Computer Assisted Language Learning Based on Corpora and Natural Language Processing : The Experience of Project CANDLE

    Get PDF
    This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools

    Pattern Learning for Detecting Defect Reports and Improvement Requests in App Reviews

    Full text link
    Online reviews are an important source of feedback for understanding customers. In this study, we follow novel approaches that target this absence of actionable insights by classifying reviews as defect reports and requests for improvement. Unlike traditional classification methods based on expert rules, we reduce the manual labour by employing a supervised system that is capable of learning lexico-semantic patterns through genetic programming. Additionally, we experiment with a distantly-supervised SVM that makes use of noisy labels generated by patterns. Using a real-world dataset of app reviews, we show that the automatically learned patterns outperform the manually created ones, to be generated. Also the distantly-supervised SVM models are not far behind the pattern-based solutions, showing the usefulness of this approach when the amount of annotated data is limited.Comment: Accepted for publication in the 25th International Conference on Natural Language & Information Systems (NLDB 2020), DFKI Saarbr\"ucken Germany, June 24-26 202

    Fireground location understanding by semantic linking of visual objects and building information models

    Get PDF
    This paper presents an outline for improved localization and situational awareness in fire emergency situations based on semantic technology and computer vision techniques. The novelty of our methodology lies in the semantic linking of video object recognition results from visual and thermal cameras with Building Information Models (BIM). The current limitations and possibilities of certain building information streams in the context of fire safety or fire incident management are addressed in this paper. Furthermore, our data management tools match higher-level semantic metadata descriptors of BIM and deep-learning based visual object recognition and classification networks. Based on these matches, estimations can be generated of camera, objects and event positions in the BIM model, transforming it from a static source of information into a rich, dynamic data provider. Previous work has already investigated the possibilities to link BIM and low-cost point sensors for fireground understanding, but these approaches did not take into account the benefits of video analysis and recent developments in semantics and feature learning research. Finally, the strengths of the proposed approach compared to the state-of-the-art is its (semi -)automatic workflow, generic and modular setup and multi-modal strategy, which allows to automatically create situational awareness, to improve localization and to facilitate the overall fire understanding

    ForecastTKGQuestions: A Benchmark for Temporal Question Answering and Forecasting over Temporal Knowledge Graphs

    Full text link
    Question answering over temporal knowledge graphs (TKGQA) has recently found increasing interest. TKGQA requires temporal reasoning techniques to extract the relevant information from temporal knowledge bases. The only existing TKGQA dataset, i.e., CronQuestions, consists of temporal questions based on the facts from a fixed time period, where a temporal knowledge graph (TKG) spanning the same period can be fully used for answer inference, allowing the TKGQA models to use even the future knowledge to answer the questions based on the past facts. In real-world scenarios, however, it is also common that given the knowledge until now, we wish the TKGQA systems to answer the questions asking about the future. As humans constantly seek plans for the future, building TKGQA systems for answering such forecasting questions is important. Nevertheless, this has still been unexplored in previous research. In this paper, we propose a novel task: forecasting question answering over temporal knowledge graphs. We also propose a large-scale TKGQA benchmark dataset, i.e., ForecastTKGQuestions, for this task. It includes three types of questions, i.e., entity prediction, yes-no, and fact reasoning questions. For every forecasting question in our dataset, QA models can only have access to the TKG information before the timestamp annotated in the given question for answer inference. We find that the state-of-the-art TKGQA methods perform poorly on forecasting questions, and they are unable to answer yes-no questions and fact reasoning questions. To this end, we propose ForecastTKGQA, a TKGQA model that employs a TKG forecasting module for future inference, to answer all three types of questions. Experimental results show that ForecastTKGQA outperforms recent TKGQA methods on the entity prediction questions, and it also shows great effectiveness in answering the other two types of questions.Comment: Accepted to ISWC 202
    corecore