14,644 research outputs found
Word-to-Word Models of Translational Equivalence
Parallel texts (bitexts) have properties that distinguish them from other
kinds of parallel data. First, most words translate to only one other word.
Second, bitext correspondence is noisy. This article presents methods for
biasing statistical translation models to reflect these properties. Analysis of
the expected behavior of these biases in the presence of sparse data predicts
that they will result in more accurate models. The prediction is confirmed by
evaluation with respect to a gold standard -- translation models that are
biased in this fashion are significantly more accurate than a baseline
knowledge-poor model. This article also shows how a statistical translation
model can take advantage of various kinds of pre-existing knowledge that might
be available about particular language pairs. Even the simplest kinds of
language-specific knowledge, such as the distinction between content words and
function words, is shown to reliably boost translation model performance on
some tasks. Statistical models that are informed by pre-existing knowledge
about the model domain combine the best of both the rationalist and empiricist
traditions
Constrained structure of ancient Chinese poetry facilitates speech content grouping
Ancient Chinese poetry is constituted by structured language that deviates from ordinary language usage [1, 2]; its poetic genres impose unique combinatory constraints on linguistic elements [3]. How does the constrained poetic structure facilitate speech segmentation when common linguistic [4, 5, 6, 7, 8] and statistical cues [5, 9] are unreliable to listeners in poems? We generated artificial Jueju, which arguably has the most constrained structure in ancient Chinese poetry, and presented each poem twice as an isochronous sequence of syllables to native Mandarin speakers while conducting magnetoencephalography (MEG) recording. We found that listeners deployed their prior knowledge of Jueju to build the line structure and to establish the conceptual flow of Jueju. Unprecedentedly, we found a phase precession phenomenon indicating predictive processes of speech segmentationāthe neural phase advanced faster after listeners acquired knowledge of incoming speech. The statistical co-occurrence of monosyllabic words in Jueju negatively correlated with speech segmentation, which provides an alternative perspective on how statistical cues facilitate speech segmentation. Our findings suggest that constrained poetic structures serve as a temporal map for listeners to group speech contents and to predict incoming speech signals. Listeners can parse speech streams by using not only grammatical and statistical cues but also their prior knowledge of the form of language
SeLeCT: a lexical cohesion based news story segmentation system
In this paper we compare the performance of three distinct approaches to lexical cohesion based text segmentation. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e., distinct news stories from broadcast news programmes. Our approach to news story segmentation (the SeLeCT system) is based on an analysis of lexical cohesive strength between textual units using a linguistic technique called lexical chaining. We evaluate the relative performance of SeLeCT with respect to two other cohesion based segmenters: TextTiling and C99. Using a recently introduced evaluation metric WindowDiff, we contrast the segmentation accuracy of each system on both "spoken" (CNN news transcripts) and "written" (Reuters newswire) news story test sets extracted from the TDT1 corpus
Rank-frequency relation for Chinese characters
We show that the Zipf's law for Chinese characters perfectly holds for
sufficiently short texts (few thousand different characters). The scenario of
its validity is similar to the Zipf's law for words in short English texts. For
long Chinese texts (or for mixtures of short Chinese texts), rank-frequency
relations for Chinese characters display a two-layer, hierarchic structure that
combines a Zipfian power-law regime for frequent characters (first layer) with
an exponential-like regime for less frequent characters (second layer). For
these two layers we provide different (though related) theoretical descriptions
that include the range of low-frequency characters (hapax legomena). The
comparative analysis of rank-frequency relations for Chinese characters versus
English words illustrates the extent to which the characters play for Chinese
writers the same role as the words for those writing within alphabetical
systems.Comment: To appear in European Physical Journal B (EPJ B), 2014 (22 pages, 7
figures
Speech and hand transcribed retrieval
This paper describes the issues and preliminary work involved
in the creation of an information retrieval system that will
manage the retrieval from collections composed of both speech
recognised and ordinary text documents. In previous work, it
has been shown that because of recognition errors, ordinary
documents are generally retrieved in preference to recognised
ones. Means of correcting or eliminating the observed bias is
the subject of this paper. Initial ideas and some preliminary
results are presented
- ā¦