7 research outputs found

    How Good Is Popularity? Summary Grading in Crowdsourcing

    Get PDF
    ABSTRACT In this paper, we applied the crowdsourcing approach to develop an automated popularity summary scoring, called wild summaries. In contrast, the golden standard summaries generated by one or more experts are called expert summaries. The innovation of our study is to compute LSA (Latent Semantic Analysis) similarities between target summary and wild summaries rather than expert summaries. We called this method CLSAS, i.e., crowdsourcingbased LSA similarity. We evaluated CLSAS by comparing it with other approaches, Coh-Metrix language and discourse features and LIWC psychometric word measures. Results showed that CLSAS alone could explain 19% of human summary score, which was equivalent to the variance explained by dozens of language and discourse features and/or the word features. Results also showed that adding language and/or word features to CLSAS increased small additional correlations. Findings imply that crowdsourcing-based LSA similarity approach is a promising method for automated summary assessment

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems β€” enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks
    corecore