178,024 research outputs found

    Harvesting Entities from the Web Using Unique Identifiers -- IBEX

    Full text link
    In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A. Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting Entities from the Web Using Unique Identifiers. WebDB workshop, 201

    Automatic extraction of knowledge from web documents

    Get PDF
    A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper

    Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples

    Full text link
    Machine Learning has been a big success story during the AI resurgence. One particular stand out success relates to learning from a massive amount of data. In spite of early assertions of the unreasonable effectiveness of data, there is increasing recognition for utilizing knowledge whenever it is available or can be created purposefully. In this paper, we discuss the indispensable role of knowledge for deeper understanding of content where (i) large amounts of training data are unavailable, (ii) the objects to be recognized are complex, (e.g., implicit entities and highly subjective content), and (iii) applications need to use complementary or related data in multiple modalities/media. What brings us to the cusp of rapid progress is our ability to (a) create relevant and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP techniques. Using diverse examples, we seek to foretell unprecedented progress in our ability for deeper understanding and exploitation of multimodal data and continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI). arXiv admin note: substantial text overlap with arXiv:1610.0770

    Backward recalculation of seasonal series affected by economic crisis: a Model-Based-Link method for the case of Turkish GDP

    Get PDF
    When attempting to deal with the recalculation process, it is hard to answer the question “Does the recalculated series include economic events and seasonal behaviours in the past?”. This paper discusses some alternative backward recalculation methods and presents the applications and their results relative to the Turkish Gross Domestic Product (GDP) series. Using comparative analysis, it is shown that ordinary ARIMA forecasts and signal extraction methods are not successful at taking into account past events in the backward recalculated series. A new innovative method, named Modelbased-link, is then proposed and suggested by the authors in order to be able to take past economic events and seasonal patterns into account when the series is to be backward recalculated. A first application of this new method is run on the quarterly series of the Turkish GDP. In addition, it is shown that the Model-based-link method can be extended to data sets of different frequencies (i.e. annual data). Consequently, it can be claimed that a comparable recalculated quarterly and annual Turkish GDP series for forthcoming data is obtained. The paper is structured as following: section 1 introduces the reader to the state of the art in the current literature; section 2 defines the information set to be backward recalculated and presents some statistics on the data while section 3 presents the main methodological statistical aspects of classical methods compared to the methodological scheme of the Model-based-link that can be used for the recalculation process. Section 4 presents results of the methods mentioned in the previous section and section 5 discusses the extension of the Model-based-link method to monthly data and includes an application for annual data; section 6 concludes. Finally, section 7 presents topics for discussion and challenges for continuation of the analysis
    • …
    corecore