174 research outputs found

    Generating Disambiguating Paraphrases for Use in Crowdsourced Judgments of Meaning

    Get PDF
    Adapting statistical parsers to new domains requires annotated data, which is expensive and time consuming to collect. Using crowdsourced annotation data as a “silver standard” is a step towards a more viable solution and so in order to facilitate the collection of this data, we have developed a system for creating semantic disambiguation tasks for use in crowdsourced judgments of meaning. In our system here described, these tasks are generated automatically using surface realizations of structurally ambiguous parse trees, along with minimal use of forced parse structure changes.NSF grant IIS-1319318No embargoAcademic Major: Computer and Information Scienc

    The head-modifier principle and multilingual term extraction

    Get PDF
    Advances in Language Engineering may be dependent on theoretical principles originating from linguistics since both share a common object of enquiry, natural language structures. We outline an approach to term extraction that rests on theoretical claims about the structure of words. We use the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy. The theoretical claims revolve around the head-modifier principle which determines the formation of a major class of compounds. Significantly it has been suggested that the principle operates in languages other than English. To demonstrate the extendibility of our approach beyond English, we present a case study of term extraction in Chinese, a language whose written form is the vehicle of communication for over 1.3 billion language users, and therefore has great significance for the development of language engineering technologies

    Processing subject-object ambiguities in Dutch

    Get PDF
    Various clause types in Dutch and German are at least temporarily ambiguous with respect to the order of subject and object. A number of previous studies regarding the processing of such subject-object ambiguities have reported a preference for a subject-object interpretation. This order preference has generally been attributed to a syntactic generalization, that is, a generalization which abstracts away from specific properties of the NPs and the verb in the clause. The results of the present experiments suggest, however, that the syntactic subjectobject preference is not as strong as has previously been assumed: the discourserelated properties of the NPs also play a role in determining order preferences. First, the subject-object preference for main clauses is much weaker when the first NP is a wh-phrase than when it is a non-deictic definite NP; second, embedded wh-questions may even show an object-subject preference when the second NP is a pronoun. However, whether this non-structural information has an effect, and to what extent, depends on other factors, such as the manner of disambiguation (case, number information), and the point of disambiguation. In Chapter 2 an overview was given of the current literature on subjectobject ambiguities in Dutch and German. With only a few exceptions, a preference for the subject-object order has been found, even in cases where plausibility or contextual information favored an object-subject interpretation. Several syntactic accounts of this order preference were discussed. Next, it was argued that information which is not purely structural in nature may also affect the order preference. Several predictions were formulated that were experimentally investigated in Chapters 3 and 4. In Chapter 3 the processing of subject-object and object-subject main clauses was investigated. Declarative clauses in which the first NP was a nondeictic definite NP were compared with wh-questions in which the first NP was a which-N (welke-N) phrase. Object-subject declaratives impose more restrictions on the discourse context than subject-object declaratives do. Subject- and objectinitial wh-questions do not differ in this respect. In addition, subject- and objectinitial declaratives have been claimed to differ in terms of phrase structure in a way subject- and object-initial wh-phrases do not. A weaker subject-object preference was therefore expected for the wh-clauses compared to the declaratives. Self-paced reading times (Experiment 1) showed a subject-object preference starting immediately at the disambiguating auxiliary. The difference between wh-clauses and declaratives with respect to the order preference became apparent one word later, suggesting that the nature of the first NP affects ambiguity resolution somewhat later than the overall syntactic subject-object bias. In Chapter 4 the impact of the discourse-related properties of the second NP was investigated using temporarily ambiguous embedded wh-questions. Pronouns differ from non-pronominal definite NPs in the frequency with which they are used in the subject position. This is related to the discourse-status of the elements they refer to. Non-pronominal definite NPs can either refer to given information or introduce new entities into the discourse. In contrast, pronouns are generally used to refer to given entities in the discourse which are salient. Given, salient entities are generally also the topic of discussion. The prototypical position for a topic is the subject position. Pronouns therefore bias towards a subject interpretation. This bias is much weaker for definite NPs, especially if sentences are presented in absence of a discourse context and the definite NP is taken to introduce new entities. A pronoun in second position thus introduces a bias for the object-subject order. This bias is in competition with the syntactic bias for the subject-object order. If the discourse-related properties of the NPs are taken into account during the processing of order ambiguities, a weak subject-object preference, or even a preference for an object-subject order is expected if the second NP is a pronoun. First, an off-line completion study (Experiment 2) was conducted, showing that the syntactic preference for the subject-object order also holds for embedded wh-clauses. Next, three experiments were carried out on wh-clauses in which the second NP was a case-marked pronoun. An off-line questionnaire study (Experiment 3) showed that people choose the nominative form (object-subject interpretation) more often than the accusative form (subject-object order). The preference for an object-subject order was replicated in two on-line studies. Selfpaced grammaticality decision times (Experiment 4) and self-paced reading times (Experiment 5) showed an increase for the subject-object order relative to the object-subject order starting at or immediately after the disambiguating pronoun. No preference for the subject-object order was seen. The object-subject preference was partially replicated in Experiment 6. In this experiment, the wh-clauses were disambiguated by number information at the finite auxiliary in penultimate position. The pronoun itself was ambiguous. In addition, the length of the ambiguous region was manipulated: either one or six words separated the second NP pronoun from the disambiguating auxiliary. Again, an object-subject preference was found, but only in the conditions with a long ambiguous region. In the short conditions, subject-object clauses were responded to faster than object-subject clauses, but this difference was not significant. Finally, in Experiment 7, wh-clauses containing a case-ambiguous pronoun were compared with clauses containing a non-pronominal definite NP in second position. This time, four words separated the second NP from the disambiguating auxiliary. The clauses with a definite NP showed a tendency for a subject-object preference. No preference for either order was found for clauses containing the ambiguous pronoun, in contrast to the object-subject preference found in conditions with a case-marked pronoun (Experiments 3-5) or in conditions in which the ambiguous region was six words in length (Experiment 6). These results suggest that the discourse-related properties of the NPs can indeed have an effect on order preference. However, the time course and strength of this effect depends on other factors such as the manner and point of disambiguation. In Chapter 5 the frequencies of occurrence of subject and object-initial wh-clauses were investigated in a sample of written Dutch texts. Collapsing across the various types of predicates, the subject-initial order is the most frequent. However, when counts are restricted to transitive and ditransitive predicates only, the object-subject order is the most frequent. The nature of the second NP appears to be of influence: the object-subject order is significantly more frequent in clauses containing a pronoun than in clauses containing a definite NP or an indefinite NP. These frequency data are interesting in the light of frequency-based theories of sentence processing. These theories predict a correspondence between processing difficulty and frequency: the most frequent solution to the ambiguity should elicit the least processing difficulties. An important issue in this respect is the grain-size problem: which categories can be distinguished in terms of frequency, and on the basis of which information? The present data suggest that a grain-size according to which transitive welke-questions are treated as one, separate class cannot be correct for the following reason. For transitive welkequestions in general, the object-subject order is the most frequent. Transitive welke-clauses containing a definite NP, however, showed a reading time advantage for the subject-object order (Experiment 6). Tabulating frequencies separately for welke-questions containing a definite NP will not solve this problem: the object-subject order for such clauses is still more frequent than the subject-object order, in spite of the reversed parsing preference. A possible solution is to assume either that grain-size is yet even finer, or that grain-size is not fixed, but that several levels of abstraction are taken into consideration during processing. The results of the experiments and the corpus study were summarized in Chapter 6. The data suggest that not only syntactic and discourse-related preferences play a role in determining the order preference, but that also the manner and point of disambiguation are of importance. Which order is ultimately preferred, the strength of this preference and the development of the preference over time are determined by the interplay of these and other factors. It was shown that these factors do not have an equally strong contribution; rather some factors or combinations of factors are stronger than others. Finally, four current theories of sentence processing were discussed. Garden-path theories and constraint-based theories account for the Dutch data most readily. These two approaches differ with respect to the modularity of syntactic processing: according to garden-path theories an initial, informationally encapsulated syntactic stage of processing can be distinguished; non-syntactic information may affect processing only somewhat later. According to constraint-based theories, all kinds of information are made use of immediately when available. Future research should be directed at constructing quantitative models which capture the relative impact of various sources of information. Only then can precise predictions be made which can be used to decide between garden-path and constraint-based approaches to sentence processing.

    A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions

    Get PDF
    In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems

    Processing Coordinated Verb Phrases: The Relevance of Lexical-Semantic, Conceptual, and Contextual Information towards Establishing Verbal Parallelism.

    Full text link
    This dissertation examines the influence of lexical-semantic representations, conceptual similarity, and contextual fit on the processing of coordinated verb phrases. The study integrates information gleaned from current linguistic theory with current psycholinguistic approaches to examining the processing of coordinated verb phrases. It has been claimed that in coordinated phrases, one conjunct may influence the processing of a second conjunct if they are sufficiently similar. For example, The likelihood of adopting an intransitive analysis for the optionally transitive verb of a subordinated clause in sentences like "Although the pirate ship sank the nearby British vessel did not send out lifeboats" may be increased if the ambiguous verb ("sank") is coordinated with a preceding, intransitively biased verb ("halted and sank"). Similarly, processing of the second conjunct may be facilitated when coordinated with a similar first conjunct. Such effects, and others in this vein have often been designated “parallelism effects.” However, notions of similarity underlying such effects have long been ill-defined. Many existing studies rely on relatively shallow features like syntactic category information or argument structure generalizations, such as transitive or intransitive, as a basis for structural comparison. But it may be that deeper levels of lexical-semantic representation and more varied, semantic or conceptual sources of information are also relevant to establishing similarity between conjuncts. In addition, little has been done to integrate parallelism effects to theories of the processing architecture underlying such effects, particularly for studies involving syntactic ambiguity resolution. Using two word-by-word reading and three eyetracking while reading experiments, I investigate what contribution detailed lexical-semantic representations, as well as conceptual and contextual information make towards establishing parallel coordination in the online processing of coordinated verb phrases. The five studies demonstrate that parallelism effects are indeed sensitive to deeper representational information, conceptual similarity, and contextual fit. Furthermore, by controlling for deeper representational information, it is demonstrated that expected facilitatory patterns arising from coordination of similar conjuncts may be disrupted. Implications for the architecture of the processing system are discussed, and it is argued that constraint-based/competition models of processing best accommodate the pattern of results.Ph.D.LinguisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/78841/1/damont_1.pd

    The reanalysis and interpretation of garden-path sentences by native speakers and second language learners

    Get PDF
    This dissertation examines factors (verb bias and plausibility) that influence reanalysis processes in native and non-native processing of English and Mandarin garden-path sentences (Chapters 2 and 3) and the relationship between the amount of reanalysis and final interpretation of such sentences (Chapter 4). Verb bias refers to the likelihood of a particular verb taking a particular argument structure, such as a direct object (DO) or a sentential complement (SC). Previous research has demonstrated that native speakers of English are able to use verb bias information fast enough to generate predictions about the upcoming syntactic structure and that verb bias plays a larger role than plausibility in this predictive process (e.g., Garnsey, Pearlmutter, Myers, & Lotocky, 1997). However, little is known about the relative importance of verb bias and plausibility in second language sentence processing. A prevailing view in the L2 psycholinguistic literature claims that L2 learners underuse structural cues during real time processing, and that to compensate, they rely predominantly on lexical-semantic cues (Clashen & Felser, 2006). What has not been considered on this view is the use of lexically-associated structural cues, such as verb bias. Since such information is both lexical and structural, it is unclear whether L2 learners would be able to use these cues in real-time processing. In two self-paced reading experiments, Chapter 2 compared L1-Mandarin speakers of L2 English and L1-Korean speakers of L2 English with native English speakers on the resolution of temporary DO/SC ambiguity in sentences. Results showed that similar to native speakers, both L2 groups were able to use verb bias cue to predict the likely type of following structure, but were unable to use the plausibility cue predictively when the verb bias cue was present, challenging the view that L2 learners rely more on plausibility than syntax during parsing. While substantial research has been conducted on verb bias effect in English, few studies have examined such effects in other languages, especially in languages that have been found to rely more on plausibility than structural information, such as Mandarin (Su, 2001a, 2001b, 2004). In one self-paced reading experiment, Chapter 3 compared the relative contributions of verb bias and plausibility in processing Mandarin sentences that bore the surface level resemblance to English sentences with temporary DO/SC ambiguity. Since Mandarin allows null subjects, such a structure is temporarily ambiguous between an embedded clause and a blended structure, in which the object of the first clause is also the subject of the second clause. Results showed that verb bias trumped plausibility in Mandarin, such that readers made use of verb bias cues to anticipate the following structure and were only sensitive to plausibility information when verb bias allowed it, contrary to the claim that Mandarin relies heavily on plausibility in sentence comprehension. In Chapters 2 and 3, reading time (RT) at the disambiguating region in sentences was used as the diagnostic in determining the effects of verb bias and plausibility, based on the assumption that RT at the disambiguation reflects the amount of reanalysis work. In two self-paced reading and two event-related brain potential (ERP) experiments, Chapter 4 demonstrated that RT and ERP on-line measures at the disambiguation might not reflect primarily reanalysis, since both RTs and the amplitudes of the P600 and N400 ERP components were found to be unrelated to the accuracy of the final interpretation of garden-path sentences, as measured by responses to post-sentence questions, thus calling into question traditional assumptions about the meaning of traditional measures. The original prediction was that more time/effort spent reanalyzing at the disambiguation would lead to more success in question responses. Instead, whenever there was any trend toward a relationship between the online measures and question responses, it was opposite the predicted direction, i.e., when more time/effort was spent on the disambiguation, questions tended to be answered less accurately. Chapter 4 thus proposed that the RTs and ERP component amplitudes at the disambiguation may reflect the amount of confusion about and/or competition between different possible interpretations, rather than or in addition to any reanalysis triggered there. Overall, this dissertation examined the reanalysis processes at the disambiguation in garden-path sentences in both native and non-native sentence processing and the link between the reanalysis processes and the final interpretation in native sentence processing. It paved the way for conducting similar research on the final interpretation of garden-path sentences by L2 learners

    Knowledge base integration in biomedical natural language processing applications

    Get PDF
    With the progress of natural language processing in the biomedical field, the lack of annotated data due to regulations and expensive labor remains an issue. In this work, we study the potential of knowledge bases for biomedical language processing to compensate for the shortage of annotated data. Accordingly, we experiment with the integration of a rigorous biomedical knowledge base, the Unified Medical Language System, in three different biomedical natural language processing applications: text simplification, conversational agents for medication adherence, and automatic evaluation of medical students' chart notes. In the first task, we take as a use case simplifying medication instructions to enhance medication adherence among patients. Given the lack of an appropriate parallel corpus, the Unified Medical Language System provided simpler synonyms for an unsupervised system we devise, and we show a positive impact on comprehension through a human subjects study. As for the second task, we devise an unsupervised system to automatically evaluate chart notes written by medical students. The purpose of the system is to speed up the feedback process and enhance the educational experience. With the lack of training corpora, utilizing the Unified Medical Language System proved to enhance the accuracy of evaluation after integration into the baseline system. For the final task, the Unified Medical Language System was used to augment the training data of a conversational agent that educates patients on their medications. As part of the educational procedure, the agent needed to assess the comprehension of the patients by evaluating their answers to predefined questions. Starting with a small seed set of paraphrases of acceptable answers, the Unified Medical Language System was used to artificially augment the original small seed set via synonymy. Results did not show an increase in quality of system output after knowledge base integration due to the majority of errors resulting from mishandling of counts and negations. We later demonstrate the importance of a (lacking) entity linking system to perform optimal integration of biomedical knowledge bases, and we offer a first stride towards solving that problem, along with conclusions on proper training setup and processes for automatic collection of an annotated dataset for biomedical word sense disambiguation

    Combining linguistics and statistics for high-quality limited domain English-Chinese machine translation

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 86-87).Second language learning is a compelling activity in today's global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrated into a translation game system. In the language understanding component, the TINA parser is enhanced with bottom-up and long distance constraint features. The results showed that with these features, the Chinese grammar ran ten times faster and covered 15% more of the test set. In the machine translation component, a combined method of linguistic and statistical system is introduced. The English-Chinese translation is done via an intermediate language "Zhonglish", where the English-Zhonglish translation is accomplished by a parse-and-paraphrase paradigm using hand-coded rules, mainly for structural reconstruction. Zhonglish-Chinese translation is accomplished by a standard phrase based statistical machine translation system, mostly accomplishing word sense disambiguation and lexicon mapping. We evaluated in an independent test set in IWSLT travel domain spoken language corpus. Substantial improvements were achieved for GIZA alignment crossover: we obtained a 45% decrease in crossovers compared to a traditional phrase-based statistical MT system. Furthermore, the BLEU score improved by 2 points. Finally, a framework of the translation game system is described, and the feasibility of integrating the components to produce reference translation and to automatically assess student's translation is verified.by Yushi Xu.S.M
    corecore