22 research outputs found

    Automatic extraction of facts, relations, and entities for web-scale knowledge base population

    Get PDF
    Equipping machines with knowledge, through the construction of machinereadable knowledge bases, presents a key asset for semantic search, machine translation, question answering, and other formidable challenges in artificial intelligence. However, human knowledge predominantly resides in books and other natural language text forms. This means that knowledge bases must be extracted and synthesized from natural language text. When the source of text is the Web, extraction methods must cope with ambiguity, noise, scale, and updates. The goal of this dissertation is to develop knowledge base population methods that address the afore mentioned characteristics of Web text. The dissertation makes three contributions. The first contribution is a method for mining high-quality facts at scale, through distributed constraint reasoning and a pattern representation model that is robust against noisy patterns. The second contribution is a method for mining a large comprehensive collection of relation types beyond those commonly found in existing knowledge bases. The third contribution is a method for extracting facts from dynamic Web sources such as news articles and social media where one of the key challenges is the constant emergence of new entities. All methods have been evaluated through experiments involving Web-scale text collections.Maschinenlesbare Wissensbasen sind ein zentraler Baustein f├╝r semantische Suche, maschinelles ├ťbersetzen, automatisches Beantworten von Fragen und andere komplexe Fragestellungen der K├╝nstlichen Intelligenz. Allerdings findet man menschliches Wissen bis dato ├╝berwiegend in B├╝chern und anderen nat├╝rlichsprachigen Texten. Das hat zur Folge, dass Wissensbasen durch automatische Extraktion aus Texten erstellt werden m├╝ssen. Bei Texten aus dem Web m├╝ssen Extraktionsmethoden mit einem hohen Ma├č an Mehrdeutigkeit und Rauschen sowie mit sehr gro├čen Datenvolumina und h├Ąufiger Aktualisierung zurechtkommen. Das Ziel dieser Dissertation ist, Methoden zu entwickeln, die die automatische Erstellung von Wissensbasen unter den zuvor genannten Unw├Ągbarkeiten von Texten aus dem Web erm├Âglichen. Die Dissertation leistet dazu drei Beitr├Ąge. Der erste Beitrag ist ein skalierbar verteiltes Verfahren, das die effiziente Extraktion hochwertiger Fakten unterst├╝tzt, indem logische Inferenzen mit robuster Textmustererkennung kombiniert werden. Der zweite Beitrag der Arbeit ist eine Methodik zur automatischen Konstruktion einer umfassenden Sammlung typisierter Relationen, die weit ├╝ber die in existierenden Wissensbasen bekannten Relationen hinausgeht. Der dritte Beitrag ist ein neuartiges Verfahren zur Extraktion von Fakten aus dynamischen Webinhalten wie Nachrichtenartikeln und sozialen Medien. Insbesondere werden L├Âsungen vorgestellt zur Erkennung und Registrierung neuer Entit├Ąten, die bislang in keiner Wissenbasis enthalten sind. Alle Verfahren wurden durch umfassende Experimente auf gro├čen Text und Webkorpora evaluiert

    A Hybrid Scavenger Grid Approach to Intranet Search

    Get PDF
    According to a 2007 global survey of 178 organisational intranets, 3 out of 5 organisations are not satis´Čüed with their intranet search services. However, as intranet data collections become large, effective full-text intranet search services are needed more than ever before. To provide an effective full-text search service based on current information retrieval algorithms, organisations have to deal with the need for greater computational power. Hardware architectures that can scale to large data collections and can be obtained and maintained at a reasonable cost are needed. Web search engines address scalability and cost-effectiveness by using large-scale centralised cluster architectures. The scalability of cluster architectures is evident in the ability of Web search engines to respond to millions of queries within a few seconds while searching very large data collections. Though more cost-effective than high-end supercomputers, cluster architectures still have relatively high acquisition and maintenance costs. Where information retrieval is not the core business of an organisation, a cluster-based approach may not be economically viable. A hybrid scavenger grid is proposed as an alternative architecture ÔÇö it consists of a combination of dedicated and dynamic resources in the form of idle desktop workstations. From the dedicated resources, the architecture gets predictability and reliability whereas from the dynamic resources it gets scalability. An experimental search engine was deployed on a hybrid scavenger grid and evaluated. Test results showed that the resources of the grid can be organised to deliver the best performance by using the optimal number of machines and scheduling the optimal combination of tasks that the machines perform. A system ef´Čüciency and cost-effectiveness comparison of a grid and a multi-core machine showed that for workloads of modest to large sizes, the grid architecture delivers better throughput per unit cost than the multi-core, at a system ef´Čüciency that is comparable to that of the multi-core. The study has shown that a hybrid scavenger grid is a feasible search engine architecture that is cost-effective and scales to medium- to large-scale data collections

    On Type-Aware Entity Retrieval

    Full text link
    Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in an idealized "oracle" setting, assuming that we know the distribution of target types of the relevant entities for a given query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we find that type information proves most useful when using large type taxonomies that provide very specific types. We provide further insights on the extensional coverage of entities and on the utility of target types.Comment: Proceedings of the 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR '17), 201

    Dynamic Role Allocation for Small Search Engine Clusters

    Get PDF
    Search engines facilitate efficient discovery of information in large information environments such as the Web. As the amount of information rapidly increases, search engines require greater computational resources. Similarly, as the user base increases search engines need to handle increasing numbers of user requests. Existing solutions to these scalability problems are often designed for large computer clusters. This paper presents a flexible solution that is deployable also on small clusters. The solution is based on the allocation and dynamic re-adjustment of indexing and querying roles to cluster nodes in order to optimize cluster utilisation. By allocating cluster machines to the job that requires the most computational power, indexing and querying may both realize performance gains, while neither overwhelms the limited resources available. A prototype system was built and tested on a small cluster using a dataset of over 100 000 Web pages from the uct.ac.za domain. Initial results confirm an improved system resource utilisation, which warrants further investigatio

    IntentsKB: A Knowledge Base of Entity-Oriented Search Intents

    Full text link
    We address the problem of constructing a knowledge base of entity-oriented search intents. Search intents are defined on the level of entity types, each comprising of a high-level intent category (property, website, service, or other), along with a cluster of query terms used to express that intent. These machine-readable statements can be leveraged in various applications, e.g., for generating entity cards or query recommendations. By structuring service-oriented search intents, we take one step towards making entities actionable. The main contribution of this paper is a pipeline of components we develop to construct a knowledge base of entity intents. We evaluate performance both component-wise and end-to-end, and demonstrate that our approach is able to generate high-quality data.Comment: Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM'18), 2018. 4 pages. 2 figure

    A Multidimensional Dataset Based on Crowdsourcing for Analyzing and Detecting News Bias

    Get PDF
    The automatic detection of bias in news articles can have a high impact on society because undiscovered news bias may influence the political opinions, social views, and emotional feelings of readers. While various analyses and approaches to news bias detection have been proposed, large data sets with rich bias annotations on a fine-grained level are still missing. In this paper, we firstly aggregate the aspects of news bias in related works by proposing a new annotation schema for labeling news bias. This schema covers the overall bias, as well as the bias dimensions (1) hidden assumptions, (2) subjectivity, and (3) representation tendencies. Secondly, we propose a methodology based on crowdsourcing for obtaining a large data set for news bias analysis and identification. We then use our methodology to create a dataset consisting of more than 2,000 sentences annotated with 43,000 bias and bias dimension labels. Thirdly, we perform an in-depth analysis of the collected data. We show that the annotation task is difficult with respect to bias and specific bias dimensions. While crowdworkers\u27 labels of representation tendencies correlate with experts\u27 bias labels for articles, subjectivity and hidden assumptions do not correlate with experts\u27 bias labels and, thus, seem to be less relevant when creating data sets with crowdworkers. The experts\u27 article labels better match the inferred crowdworkers\u27 article labels than the crowdworkers\u27 sentence labels. The crowdworkers\u27 countries of origin seem to affect their judgements. In our study, non-Western crowdworkers tend to annotate more bias either directly or in the form of bias dimensions (e.g., subjectivity) than Western crowdworkers do

    An Approach to Better System Resource Utilization for Search Engine Clusters

    Get PDF
    Better system resource utilization for search engine clusters can result in significant benefits. By allocating cluster machines to the job that requires the most computational power, indexing and querying both realize performance gains. In this paper we discuss an approach to better system resource utilization which was tested by implementing it in a cluster-based search engine. We test the approach on 100 000 webpages from the uct.ac.za domain. Our results show the benefits of enhanced system resource utilization in a search engine cluster

    Automatische Extraktion von Fakten, Beziehungen und Entit├Ąten f├╝r die Erstellung von Web-Scale-Wissensbasen

    No full text
    Equipping machines with knowledge, through the construction of machinereadable knowledge bases, presents a key asset for semantic search, machine translation, question answering, and other formidable challenges in artificial intelligence. However, human knowledge predominantly resides in books and other natural language text forms. This means that knowledge bases must be extracted and synthesized from natural language text. When the source of text is the Web, extraction methods must cope with ambiguity, noise, scale, and updates. The goal of this dissertation is to develop knowledge base population methods that address the afore mentioned characteristics of Web text. The dissertation makes three contributions. The first contribution is a method for mining high-quality facts at scale, through distributed constraint reasoning and a pattern representation model that is robust against noisy patterns. The second contribution is a method for mining a large comprehensive collection of relation types beyond those commonly found in existing knowledge bases. The third contribution is a method for extracting facts from dynamic Web sources such as news articles and social media where one of the key challenges is the constant emergence of new entities. All methods have been evaluated through experiments involving Web-scale text collections.Maschinenlesbare Wissensbasen sind ein zentraler Baustein f├╝r semantische Suche, maschinelles ├ťbersetzen, automatisches Beantworten von Fragen und andere komplexe Fragestellungen der K├╝nstlichen Intelligenz. Allerdings findet man menschliches Wissen bis dato ├╝berwiegend in B├╝chern und anderen nat├╝rlichsprachigen Texten. Das hat zur Folge, dass Wissensbasen durch automatische Extraktion aus Texten erstellt werden m├╝ssen. Bei Texten aus dem Web m├╝ssen Extraktionsmethoden mit einem hohen Ma├č an Mehrdeutigkeit und Rauschen sowie mit sehr gro├čen Datenvolumina und h├Ąufiger Aktualisierung zurechtkommen. Das Ziel dieser Dissertation ist, Methoden zu entwickeln, die die automatische Erstellung von Wissensbasen unter den zuvor genannten Unw├Ągbarkeiten von Texten aus dem Web erm├Âglichen. Die Dissertation leistet dazu drei Beitr├Ąge. Der erste Beitrag ist ein skalierbar verteiltes Verfahren, das die effiziente Extraktion hochwertiger Fakten unterst├╝tzt, indem logische Inferenzen mit robuster Textmustererkennung kombiniert werden. Der zweite Beitrag der Arbeit ist eine Methodik zur automatischen Konstruktion einer umfassenden Sammlung typisierter Relationen, die weit ├╝ber die in existierenden Wissensbasen bekannten Relationen hinausgeht. Der dritte Beitrag ist ein neuartiges Verfahren zur Extraktion von Fakten aus dynamischen Webinhalten wie Nachrichtenartikeln und sozialen Medien. Insbesondere werden L├Âsungen vorgestellt zur Erkennung und Registrierung neuer Entit├Ąten, die bislang in keiner Wissenbasis enthalten sind. Alle Verfahren wurden durch umfassende Experimente auf gro├čen Text und Webkorpora evaluiert
    corecore