Search CORE

3,265 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts?

Author: Andersson J. Sebastian
Aylett Matthew P.
Badino Leonardo
Pidcock Christopher J.
Publication venue
Publication date: 01/01/2007
Field of study

In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc R○gave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data. Results suggest that mild audio quality artifacts introduced by lossy compression have almost as much impact on MOS perceived quality as concatenation errors introduced by sparse data in the smaller systems with bulked diphones. Index Terms: speech synthesis, unit selection. 1

CiteSeerX

Edinburgh Research Explorer

Real-time data analysis at the LHC: present and future

Author: Gligorov Vladimir V.
Publication venue
Publication date: 01/01/2014
Field of study

The Large Hadron Collider (LHC), which collides protons at an energy of 14 TeV, produces hundreds of exabytes of data per year, making it one of the largest sources of data in the world today. At present it is not possible to even transfer most of this data from the four main particle detectors at the LHC to "offline" data facilities, much less to permanently store it for future processing. For this reason the LHC detectors are equipped with real-time analysis systems, called triggers, which process this volume of data and select the most interesting proton-proton collisions. The LHC experiment triggers reduce the data produced by the LHC by between 1/1000 and 1/100000, to tens of petabytes per year, allowing its economical storage and further analysis. The bulk of the data-reduction is performed by custom electronics which ignores most of the data in its decision making, and is therefore unable to exploit the most powerful known data analysis strategies. I cover the present status of real-time data analysis at the LHC, before explaining why the future upgrades of the LHC experiments will increase the volume of data which can be sent off the detector and into off-the-shelf data processing facilities (such as CPU or GPU farms) to tens of exabytes per year. This development will simultaneously enable a vast expansion of the physics programme of the LHC's detectors, and make it mandatory to develop and implement a new generation of real-time multivariate analysis tools in order to fully exploit this new potential of the LHC. I explain what work is ongoing in this direction and motivate why more effort is needed in the coming years.Comment: Contribution to the proceedings of the HEPML workshop NIPS 2014. 20 pages, 5 figure

arXiv.org e-Print Archive

CERN Document Server

Real-time image streaming over a low-bandwidth wireless camera network

Author: Corke P.
Karlsson J.
Sikka P.
Valencia P.
Wark Timothy J.
Publication venue
Publication date: 01/01/2007
Field of study

In this paper we describe the recent development of a low-bandwidth wireless camera sensor network. We propose a simple, yet effective, network architecture which allows multiple cameras to be connected to the network and synchronize their communication schedules. Image compression of greater than 90% is performed at each node running on a local DSP coprocessor, resulting in nodes using 1/8th the energy compared to streaming uncompressed images. We briefly introduce the Fleck wireless node and the DSP/camera sensor, and then outline the network architecture and compression algorithm. The system is able to stream color QVGA images over the network to a base station at up to 2 frames per second. Â© 2007 IEEE

Queensland University of Technology ePrints Archive

A Query Focused Multi Document Automatic Summarization

Author: Bandyopadhyay Sivaji
Bhaskar Pinaki
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Induction of Word and Phrase Alignments for Automatic Document Summarization

Author: Daumé III Hal
Marcu Daniel
Publication venue
Publication date: 07/02/2008
Field of study

Current research in automatic single document summarization is dominated by two effective, yet naive approaches: summarization by sentence extraction, and headline generation via bag-of-words models. While successful in some tasks, neither of these models is able to adequately capture the large set of linguistic devices utilized by humans when they produce summaries. One possible explanation for the widespread use of these models is that good techniques have been developed to extract appropriate training data for them from existing document/abstract and document/headline corpora. We believe that future progress in automatic summarization will be driven both by the development of more sophisticated, linguistically informed models, as well as a more effective leveraging of document/abstract corpora. In order to open the doors to simultaneously achieving both of these goals, we have developed techniques for automatically producing word-to-word and phrase-to-phrase alignments between documents and their human-written abstracts. These alignments make explicit the correspondences that exist in such document/abstract pairs, and create a potentially rich data source from which complex summarization algorithms may learn. This paper describes experiments we have carried out to analyze the ability of humans to perform such alignments, and based on these analyses, we describe experiments for creating them automatically. Our model for the alignment task is based on an extension of the standard hidden Markov model, and learns to create alignments in a completely unsupervised fashion. We describe our model in detail and present experimental results that show that our model is able to learn to reliably identify word- and phrase-level alignments in a corpus of pairs

arXiv.org e-Print Archive

CiteSeerX

To dash or to dawdle: verb-associated speed of motion influences eye movements during spoken sentence comprehension

Author: AM Fecica
AM Glenberg
AM Glenberg
Ayse Pinar Saygin
B Yao
BK Bergen
Christoph Scheepers
DC Richardson
GTM Altmann
GTM Altmann
GTM Altmann
JA Hanley
LW Barsalou
M Coll-Florit
M Rinck
MC Stites
MK Tanenhaus
RA Zwaan
RA Zwaan
RA Zwaan
RA Zwaan
RD Morey
Shane Lindsay
SM Kosslyn
T Matlock
T Matlock
W Pan
Yuki Kamide
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

In describing motion events verbs of manner provide information about the speed of agents or objects in those events. We used eye tracking to investigate how inferences about this verb-associated speed of motion would influence the time course of attention to a visual scene that matched an event described in language. Eye movements were recorded as participants heard spoken sentences with verbs that implied a fast (“dash”) or slow (“dawdle”) movement of an agent towards a goal. These sentences were heard whilst participants concurrently looked at scenes depicting the agent and a path which led to the goal object. Our results indicate a mapping of events onto the visual scene consistent with participants mentally simulating the movement of the agent along the path towards the goal: when the verb implies a slow manner of motion, participants look more often and longer along the path to the goal; when the verb implies a fast manner of motion, participants tend to look earlier at the goal and less on the path. These results reveal that event comprehension in the presence of a visual world involves establishing and dynamically updating the locations of entities in response to linguistic descriptions of events

CiteSeerX

Repository@Hull - Worktribe

Crossref

Directory of Open Access Journals

PubMed Central

Enlighten

University of Dundee Online Publications

A Novel ILP Framework for Summarizing Content with High Lexical Variety

Author: Almeida
Celikyilmaz
Conroy
DIANE LITMAN
FEI LIU
Goodfellow
Li
Li
Luo
Luo
Luo
Martins
Mazumder
Mosteller
Narayan
Qian
Ren
Tarnpradab
Wang
WENCAN LUO
Wilson
Xiong
ZITAO LIU
Publication venue
Publication date: 25/07/2018
Field of study

Summarizing content contributed by individuals can be challenging, because people make different lexical choices even when describing the same events. However, there remains a significant need to summarize such content. Examples include the student responses to post-class reflective questions, product reviews, and news articles published by different news agencies related to the same events. High lexical diversity of these documents hinders the system's ability to effectively identify salient content and reduce summary redundancy. In this paper, we overcome this issue by introducing an integer linear programming-based summarization framework. It incorporates a low-rank approximation to the sentence-word co-occurrence matrix to intrinsically group semantically-similar lexical items. We conduct extensive experiments on datasets of student responses, product reviews, and news documents. Our approach compares favorably to a number of extractive baselines as well as a neural abstractive summarization system. The paper finally sheds light on when and why the proposed framework is effective at summarizing content with high lexical variety.Comment: Accepted for publication in the journal of Natural Language Engineering, 201

arXiv.org e-Print Archive

Crossref

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)