1,037 research outputs found
Satellite Workshop On Language, Artificial Intelligence and Computer Science for Natural Language Processing Applications (LAICS-NLP): Discovery of Meaning from Text
This paper proposes a novel method to disambiguate important words from a collection of documents. The
hypothesis that underlies this approach is that there is a
minimal set of senses that are significant in characterizing a context. We extend Yarowskyâs one sense
per discourse [13] further to a collection of related
documents rather than a single document. We perform
distributed clustering on a set of features representing
each of the top ten categories of documents in the
Reuters-21578 dataset. Groups of terms that have a
similar term distributional pattern across documents were
identified. WordNet-based similarity measurement was
then computed for terms within each cluster. An
aggregation of the associations in WordNet that was
employed to ascertain term similarity within clusters has
provided a means of identifying clustersâ root senses
Investigating the build-up of precedence effect using reflection masking
The auditory processing level involved in the buildâup of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874â884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottomâup approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5â15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5â and 10âms delay, no change in threshold was observed. It is concluded that the lowâlevel auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels
Concept and entity grounding using indirect supervision
Extracting and disambiguating entities and concepts is a crucial step toward understanding natural language text. In this thesis, we consider the problem of grounding concepts and entities mentioned in text to one or more knowledge bases (KBs). A well-studied scenario of this problem is the one in which documents are given in English and the goal is to identify concept and entity mentions, and find the corresponding entries the mentions refer to in Wikipedia. We extend this problem in two directions: First, we study identifying and grounding entities written in any language to the English Wikipedia. Second, we investigate using multiple KBs which do not contain rich textual and structural information Wikipedia does.
These more involved settings pose a few additional challenges beyond those addressed in the standard English Wikification problem. Key among them is that no supervision is available to facilitate training machine learning models. The first extension, cross-lingual Wikification, introduces problems such as recognizing multilingual named entities mentioned in text, translating non-English names into English, and computing word similarity across languages. Since it is impossible to acquire manually annotated examples for all languages, building models for all languages in Wikipedia requires exploring indirect or incidental supervision signals which already exist in Wikipedia. For the second setting, we need to deal with the fact that most KBs do not contain the rich information Wikipedia has; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this thesis, we show that supervision signals can be obtained by carefully examining the redundancy and relations between multiple KBs. By developing algorithms and models which harvest these incidental signals, we can achieve better performance on these tasks
Universal and language-specific processing : the case of prosody
A key question in the science of language is how speech processing can be influenced by both language-universal and language-specific mechanisms (Cutler, Klein, & Levinson, 2005). My graduate research aimed to address this question by adopting a crosslanguage approach to compare languages with different phonological systems. Of all components of linguistic structure, prosody is often considered to be one of the most language-specific dimensions of speech. This can have significant implications for our understanding of language use, because much of speech processing is specifically tailored to the structure and requirements of the native language. However, it is still unclear whether prosody may also play a universal role across languages, and very little comparative attempts have been made to explore this possibility. In this thesis, I examined both the production and perception of prosodic cues to prominence and phrasing in native speakers of English and Mandarin Chinese. In focus production, our research revealed that English and Mandarin speakers were alike in how they used prosody to encode prominence, but there were also systematic language-specific differences in the exact degree to which they enhanced the different prosodic cues (Chapter 2). This, however, was not the case in focus perception, where English and Mandarin listeners were alike in the degree to which they used prosody to predict upcoming prominence, even though the precise cues in the preceding prosody could differ (Chapter 3). Further experiments examining prosodic focus prediction in the speech of different talkers have demonstrated functional cue equivalence in prosodic focus detection (Chapter 4). Likewise, our experiments have also revealed both crosslanguage similarities and differences in the production and perception of juncture cues (Chapter 5). Overall, prosodic processing is the result of a complex but subtle interplay of universal and language-specific structure
Phonation Types in Marathi: An Acoustic Investigation
This dissertation presents a comprehensive instrumental acoustic analysis of phonation type distinctions in Marathi, an Indic language with numerous breathy voiced sonorants and obstruents. Important new facts about breathy voiced sonorants, which are crosslinguistically rare, are established: male and female speakers cue breathy phonation in sonorants differently, there are an abundance of trading relations, and--critically--phonation type distinctions are not cued as well by sonorants as by obstruents. Ten native speakers (five male, five female) were recorded producing Marathi words embedded in a carrier sentence. Tokens included plain and breathy voiced stops, affricates, nasals, laterals, rhotics, and approximants before the vowels [a] and [e]. Measures reported for consonants and subsequent vowels include duration, F0, Cepstral Peak Prominence (CPP), and corrected H1-H2*, H1-A1*, H1-A2*, and H1-A3* values. As expected, breathy voice is associated with decreased CPP and increased spectral values. A strong gender difference is revealed: low-frequency measures like H1-H2* cue breathy phonation more reliably in male speech, while CPP--which provides information about the aspiration noise included in the signal--is a more reliable cue in female speech. Trading relations are also reported: time and again, where one cue is weak or absent another cue is strong or present, underscoring the importance of including both genders and multiple vowel contexts when testing phonation type differences. Overall, the cues that are present for obstruents are not necessarily mirrored by sonorants. These findings are interpreted with reference to Dispersion Theory (Flemming 1995; Liljencrants & Lindblom 1972; Lindblom 1986, 1990). While various incarnations of Dispersion Theory focus on different aspects of perceptual and auditory distinctiveness, a basic claim is that one requirement for phonological contrasts is that they must be perceptually distinct: contrasts that are subject to great confusability are phonologically disfavored. The proposal, then, is that the typology of breathy voiced sonorants is due in part to the fact that they are not well differentiated acoustically. Breathy voiced sonorants are crosslinguistically rare because they do not make for strong phonemic contrasts
Cross-Lingual and Cross-Chronological Information Access to Multilingual Historical Documents
In this chapter, we present our work in realizing information access across different languages and periods. Nowadays, digital collections of historical documents have to handle materials written in many different languages in different time periods. Even in a particular language, there are significant differences over time in terms of grammar, vocabulary and script. Our goal is to develop a method to access digital collections in a wide range of periods from ancient to modern. We introduce an information extraction method for digitized ancient Mongolian historical manuscripts for reducing labour-intensive analysis. The proposed method performs computerized analysis on Mongolian historical documents. Named entities such as personal names and place names are extracted by employing support vector machine. The extracted named entities are utilized to create a digital edition that reflects an ancient Mongolian historical manuscript written in traditional Mongolian script. The Text Encoding Initiative guidelines are adopted to encode the named entities, transcriptions and interpretations of ancient words. A web-based prototype system is developed for utilizing digital editions of ancient Mongolian historical manuscripts as scholarly tools. The proposed prototype has the capability to display and search traditional Mongolian text and its transliteration in Latin letters along with the highlighted named entities and the scanned images of the source manuscript
- âŠ