1,163 research outputs found

    Unified model of phrasal and sentential evidence for information extraction

    Get PDF
    Journal ArticleInformation Extraction (IE) systems that extract role fillers for events typically look at the local context surrounding a phrase when deciding whether to extract it. Often, however, role fillers occur in clauses that are not directly linked to an event word. We present a new model for event extraction that jointly considers both the local context around a phrase along with the wider sentential context in a probabilistic framework. Our approach uses a sentential event recognizer and a plausible role-filler recognizer that is conditioned on event sentences. We evaluate our system on two IE data sets and show that our model performs well in comparison to existing IE systems that rely on local phrasal context

    Hybridity in MT: experiments on the Europarl corpus

    Get PDF
    (Way & Gough, 2005) demonstrate that their Marker-based EBMT system is capable of outperforming a word-based SMT system trained on reasonably large data sets. (Groves & Way, 2005) take this a stage further and demonstrate that while the EBMT system also outperforms a phrase-based SMT (PBSMT) system, a hybrid 'example-based SMT' system incorporating marker chunks and SMT sub-sentential alignments is capable of outperforming both baseline translation models for French{English translation. In this paper, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline system of (Way & Gough, 2005). Using the Europarl (Koehn, 2005) training and test sets we show that this time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in (Groves & Way, 2005), to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive e®ect on translation quality

    An exploratory study using the predicate-argument structure to develop methodology for measuring semantic similarity of radiology sentences

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)The amount of information produced in the form of electronic free text in healthcare is increasing to levels incapable of being processed by humans for advancement of his/her professional practice. Information extraction (IE) is a sub-field of natural language processing with the goal of data reduction of unstructured free text. Pertinent to IE is an annotated corpus that frames how IE methods should create a logical expression necessary for processing meaning of text. Most annotation approaches seek to maximize meaning and knowledge by chunking sentences into phrases and mapping these phrases to a knowledge source to create a logical expression. However, these studies consistently have problems addressing semantics and none have addressed the issue of semantic similarity (or synonymy) to achieve data reduction. To achieve data reduction, a successful methodology for data reduction is dependent on a framework that can represent currently popular phrasal methods of IE but also fully represent the sentence. This study explores and reports on the benefits, problems, and requirements to using the predicate-argument statement (PAS) as the framework. A convenient sample from a prior study with ten synsets of 100 unique sentences from radiology reports deemed by domain experts to mean the same thing will be the text from which PAS structures are formed

    Doctor of Philosophy

    Get PDF
    dissertationEvents are one important type of information throughout text. Event extraction is an information extraction (IE) task that involves identifying entities and objects (mainly noun phrases) that represent important roles in events of a particular type. However, the extraction performance of current event extraction systems is limited because they mainly consider local context (mostly isolated sentences) when making each extraction decision. My research aims to improve both coverage and accuracy of event extraction performance by explicitly identifying event contexts before extracting individual facts. First, I introduce new event extraction architectures that incorporate discourse information across a document to seek out and validate pieces of event descriptions within the document. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively \zoom in" on relevant event information. LINKER is a unied discourse-guided approach that includes a structured sentence classier to sequentially read a story and determine which sentences contain event information based on both the local and preceding contexts. Experimental results on two distinct event domains show that compared to previous event extraction systems, TIER can nd more event information while maintaining a good extraction accuracy, and LINKER can further improve extraction accuracy. Finding documents that describe a specic type of event is also highly challenging because of the wide variety and ambiguity of event expressions. In this dissertation, I present the multifaceted event recognition approach that uses event dening characteristics (facets), in addition to event expressions, to eectively resolve the complexity of event descriptions. I also present a novel bootstrapping algorithm to automatically learn event expressions as well as facets of events, which requires minimal human supervision. Experimental results show that the multifaceted event recognition approach can eectively identify documents that describe a particular type of event and make event extraction systems more precise

    Short answers in Scottish Gaelic and their theoretical implications

    Get PDF
    This article presents an analysis of a novel short answer strategy in Scottish Gaelic, called the Verb-Answer, which differs from standard fragment answers in allowing us to directly observe some of the clausal structure in which it is embedded. It is shown that the Verb-Answer is identical to the fragment answer in virtually all other respects, demanding a unified analysis, and it is demonstrated that pursuing a unified analysis is problematic for Direct Interpretation approaches to short answers, but straightforward for the Silent Structure approach of Morgan (1973) and Merchant (2004). The extended typology of short answer strategies therefore provides an argument in favour of the latter approach to elliptical phenomena

    Hybrid data-driven models of machine translation

    Get PDF
    Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred. In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived. Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English. On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality. Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a)

    Isomorphy and Syntax-Prosody Relations in English

    Get PDF
    abstract: This dissertation investigates the precise degree to which prosody and syntax are related. One possibility is that the syntax-prosody mapping is one-to-one (“isomorphic”) at an underlying level (Chomsky & Halle 1968, Selkirk 1996, 2011, Ito & Mester 2009). This predicts that prosodic units should preferably match up with syntactic units. It is also possible that the mapping between these systems is entirely non-isomorphic, with prosody being influenced by factors from language perception and production (Wheeldon & Lahiri 1997, Lahiri & Plank 2010). In this work, I argue that both perspectives are needed in order to address the full range of phonological phenomena that have been identified in English and related languages, including word-initial lenition/flapping, word-initial segment-deletion, and vowel reduction in function words, as well as patterns of pitch accent assignment, final-pronoun constructions, and the distribution of null complementizer allomorphs. In the process, I develop models for both isomorphic and non-isomorphic phrasing. The former is cast within a Minimalist syntactic framework of Merge/Label and Bare Phrase Structure (Chomsky 2013, 2015), while the latter is characterized by a stress-based algorithm for the formation of phonological domains, following Lahiri & Plank (2010).Dissertation/ThesisDoctoral Dissertation English 201

    Grammar Deconstructed: Constructions and the Curious Case of the Comparative Correlative

    Get PDF
    Comparative correlatives, like "the longer you stay out in the rain, the colder you'll get," are prolific in the world's languages (i.e., there is no evidence of a language that lacks comparative correlatives). Despite this observation, the data do not present a readily apparent syntax. What is the relationship between the two clauses? What is the main verb? What is English's "the" which obligatorily appears at the start of each clause? This thesis reviews prior analyses of comparative correlatives, both syntactic and semantic (Fillmore, 1987; McCawley, 1988; McCawley, 1998; Beck, 1997; Culicover & Jackendoff, 1999; Borsley, 2003; Borsley, 2004; den Dikken, 2005; Abeillé, Borsley & Espinal, 2007; Lin, 2007). A formal syntactic analysis of comparative correlatives is presented which accounts for its syntactic behaviors across several languages. Most notably, it challenges the assumption that constructions are essential primitives for the successful derivation and interpretation of the data (Fillmore, 1987; McCawley, 1988; Culicover & Jackendoff, 1999; Borsley, 2003; Borsley, 2004; Abeillé, Borsley & Espinal, 2007). The analysis is framed within the goals of the Minimalist Program (Chomsky 1993, 1995a), specifically with respect to endocentricity and Bare Phrase Structure (Chomsky 1995b). Crosslingustically, the first clause is subordinate to the second clause, the main clause. A'-movement (e.g., Topicalization, wh-movement, Focus) out of each clause proceeds successive-cyclically and, in the case of the subordinate clause, via sideward movement (Nunes 1995, 2004; Hornstein, 2001). In English, the word the which obligatorily appears at the start of each clause in English is a Force0. This provides an explanation for the ban on Subject-Aux Inversion (SAI) in the entire expression. The degree phrases which are present in each clause of a comparative correlative crosslinguistically contain a quantifier phrase in Spec,DegP; this quantifier is phonetically null in English. This thesis concludes by presenting conceptual arguments against constructions as primitives in the grammar. Bare Phrase Structure (BPS) (Chomsky, 1995b) is included in the system by virtue of virtual conceptual necessity (VCN). Since constructions do not meet the criteria of (VCN), their existence would compromise the principles of BPS. Further, when applied carefully, BPS renders constructions unable to be defined

    Discovering multiword expressions

    Get PDF
    In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods
    corecore