Search CORE

98,706 research outputs found

Generating Concise and Readable Summaries of XML Documents

Author: Ifrim Georgiana
Kumar Kondreddi Sarath
Ramanath Maya
Publication venue
Publication date: 01/01/2009
Field of study

XML has become the de-facto standard for data representation and exchange, resulting in large scale repositories and warehouses of XML data. In order for users to understand and explore these large collections, a summarized, bird's eye view of the available data is a necessity. In this paper, we are interested in semantic XML document summaries which present the "important" information available in an XML document to the user. In the best case, such a summary is a concise replacement for the original document itself. At the other extreme, it should at least help the user make an informed choice as to the relevance of the document to his needs. In this paper, we address the two main issues which arise in producing such meaningful and concise summaries: i) which tags or text units are important and should be included in the summary, ii) how to generate summaries of different sizes.%for different memory budgets. We conduct user studies with different real-life datasets and show that our methods are useful and effective in practice

arXiv.org e-Print Archive

MPG.PuRe

A Novel ILP Framework for Summarizing Content with High Lexical Variety

Author: Almeida
Celikyilmaz
Conroy
DIANE LITMAN
FEI LIU
Goodfellow
Li
Li
Luo
Luo
Luo
Martins
Mazumder
Mosteller
Narayan
Qian
Ren
Tarnpradab
Wang
WENCAN LUO
Wilson
Xiong
ZITAO LIU
Publication venue
Publication date: 25/07/2018
Field of study

Summarizing content contributed by individuals can be challenging, because people make different lexical choices even when describing the same events. However, there remains a significant need to summarize such content. Examples include the student responses to post-class reflective questions, product reviews, and news articles published by different news agencies related to the same events. High lexical diversity of these documents hinders the system's ability to effectively identify salient content and reduce summary redundancy. In this paper, we overcome this issue by introducing an integer linear programming-based summarization framework. It incorporates a low-rank approximation to the sentence-word co-occurrence matrix to intrinsically group semantically-similar lexical items. We conduct extensive experiments on datasets of student responses, product reviews, and news documents. Our approach compares favorably to a number of extractive baselines as well as a neural abstractive summarization system. The paper finally sheds light on when and why the proposed framework is effective at summarizing content with high lexical variety.Comment: Accepted for publication in the journal of Natural Language Engineering, 201

arXiv.org e-Print Archive

Crossref

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Movie Popularity Classification based on Inherent Movie Attributes using C4.5,PART and Correlation Coefficient

Author: Ahmed Tanvir
Asad Khalid Ibnal
Rahman Md. Saiedur
Publication venue
Publication date: 01/05/2012
Field of study

Abundance of movie data across the internet makes it an obvious candidate for machine learning and knowledge discovery. But most researches are directed towards bi-polar classification of movie or generation of a movie recommendation system based on reviews given by viewers on various internet sites. Classification of movie popularity based solely on attributes of a movie i.e. actor, actress, director rating, language, country and budget etc. has been less highlighted due to large number of attributes that are associated with each movie and their differences in dimensions. In this paper, we propose classification scheme of pre-release movie popularity based on inherent attributes using C4.5 and PART classifier algorithm and define the relation between attributes of post release movies using correlation coefficient.Comment: 6 page

arXiv.org e-Print Archive

VBN

Recommended from our members

STRATEGIST : a program that models strategy-driven and content-driven inference behavior

Author: Eiselt Kurt P.
Granger Richard H.
Holbrook Jennifer K.
Publication venue: eScholarship, University of California
Publication date: 01/01/1983
Field of study

In the course of understanding a text, different readers use different inference strategies to guide their choice of interpretations of the events in the text. This is in contrast to previous computer models of understanding, which all use the content-driven inference. The separate strategies are theorized to be composed of the same component inference processes, but of different rules for application of the processes. The use of different strategies occasionally results in different results of new experimental data and a working computer program, called STRATEGIST, that models both strategy-drive and content-driven inference behavior. The rules which make up two of these strategies are presented

eScholarship - University of California

BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking

Author: C Luo
DM Blei
J Leskovec
J Leskovec
J-M Fourneau
LA Barroso
T Rabl
Z Jia
Publication venue
Publication date: 26/02/2014
Field of study

Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4V requirements of big data. Specifically, big data generators need to generate scalable data (Volume) of different types (Variety) under controllable generation rates (Velocity) while keeping the important characteristics of raw data (Veracity). This gives rise to various new challenges about how we design generators efficiently and successfully. To date, most existing techniques can only generate limited types of data and support specific big data systems such as Hadoop. Hence we develop a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity. The effectiveness of BDGS is demonstrated by developing six data generators covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data)

arXiv.org e-Print Archive

Crossref

TimeMachine: Timeline Generation for Knowledge-Base Entities

Author: Baeza-Yates R. A.
Dasgupta A.
Do Q. X.
Graus D.
Krause A.
Lin C.-Y.
Lin H.
Ling X.
Minoux M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/06/2015
Field of study

We present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such as works, awards, collaborations, and family relationships. We develop three orthogonal timeline quality criteria that an ideal timeline should satisfy: (1) it shows events that are relevant to the entity; (2) it shows events that are temporally diverse, so they distribute along the time axis, avoiding visual crowding and allowing for easy user interaction, such as zooming in and out; and (3) it shows events that are content diverse, so they contain many different types of events (e.g., for an actor, it should show movies and marriages and awards, not just movies). We present an algorithm to generate such timelines for a given time period and screen size, based on submodular optimization and web-co-occurrence statistics with provable performance guarantees. A series of user studies using Mechanical Turk shows that all three quality criteria are crucial to produce quality timelines and that our algorithm significantly outperforms various baseline and state-of-the-art methods.Comment: To appear at ACM SIGKDD KDD'15. 12pp, 7 fig. With appendix. Demo and other info available at http://cs.stanford.edu/~althoff/timemachine

arXiv.org e-Print Archive

Crossref