72 research outputs found

    A constraint-based approach to noun phrase coreference resolution in German newspaper text

    Get PDF
    In this paper, we investigate the usefulness of a wide range of features for their usefulness in the resolution of nominal coreference, both as hard constraints (i.e. completely removing elements from the list of possible candidates) as well as soft constraints (where a cumulation of violations of soft constraints will make it less likely that a candidate is chosen as the antecedent). We present a state of the art system based on such constraints and weights estimated with a maximum entropy model, using lexical information to resolve cases of coreferent bridging

    Decorrelation and shallow semantic patterns for distributional clustering of nouns and verbs

    Get PDF
    Distributional approximations to lexical semantics are very useful not only in helping the creation of lexical semantic resources (Kilgariff et al., 2004; Snow et al., 2006), but also when directly applied in tasks that can benefit from large-coverage semantic knowledge such as coreference resolution (Poesio et al., 1998; Gasperin and Vieira, 2004; Versley, 2007), word sense disambiguation (Mc- Carthy et al., 2004) or semantical role labeling (Gordon and Swanson, 2007). We present a model that is built from Webbased corpora using both shallow patterns for grammatical and semantic relations and a window-based approach, using singular value decomposition to decorrelate the feature space which is otherwise too heavily influenced by the skewed topic distribution of Web corpora

    SUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference

    Full text link

    Optimization issues in machine learning of coreference resolution

    Get PDF

    Un corpus pour optimiser l’identification automatique des chaĂźnes de rĂ©fĂ©rence

    Get PDF
    Nous prĂ©sentons l’étude d’un corpus multi-genres constituĂ© pour identifier de maniĂšre automatique les chaĂźnes de rĂ©fĂ©rence (CR). Les CR sont des marqueurs linguistiques permettant d’identifier des ruptures ou des continuations thĂ©matiques dans le discours. Cette Ă©tude s’inscrit dans un projet visant le dĂ©veloppement d’un outil de dĂ©tection automatique de thĂšmes pour optimiser l’indexation des documents dans un moteur de recherche. Le moteur de recherche utilise l’indexation thĂ©matique et prend en compte le genre du document pour fournir Ă  l’utilisateur les documents pertinents liĂ©s Ă  sa requĂȘte. Dans notre perspective de Traitement Automatique des Langues, nous utilisons un corpus composĂ© de cinq genres textuels (articles journalistiques, Ă©ditoriaux, romans, lois europĂ©ennes, rapports publics) pour Ă©tudier les CR. Nous avons dĂ©fini cinq critĂšres pour comparer les CR suivant le genre textuel : la longueur moyenne des CR (nombre de maillons), la distance moyenne entre deux maillons d’une CR, la catĂ©gorie grammaticale privilĂ©giĂ©e dans l’ensemble des maillons des CR, la classe grammaticale des premiers maillons des CR, la correspondance entre le premier maillon d’une CR et le thĂšme phrastique (Ă©lĂ©ment prĂ©verbal). L’étude a rĂ©vĂ©lĂ© des diffĂ©rences quant au matĂ©riau linguistique prĂ©sent dans les CR suivant le genre textuel. Nous utilisons ces propriĂ©tĂ©s dans notre calcul des CR, pour paramĂ©trer notre outil suivant le genre. Nous discutons les rĂ©sultats obtenus.We present a multi-genre corpus study to automatically identify reference chains. Reference chains are linguistic markers identifying topic continuation or topic shift in discourse. The study is part of a project aiming at developing a system for automatic topic detection to optimize documents indexing in a search engine. The search engine uses topic indexing but also document genre to provide the user with relevant documents related to its application. In the view of Natural Language Processing, we use a corpus of five genres (articles, editorials, novels, European laws, public reports) to study the reference chains. We define five criteria to compare reference chains according textual genre : the average length of the reference chains (number of mentions), the average distance between two mentions of a reference chain, the grammatical category preferred in all mentions of the reference chains, the grammatical class of the first mentions of the reference chains, the correspondence between the first mention of a reference chain and the sentence topic. The corpus analysis reveals several differences across genres. We use these properties to configure our system according to the genre. We then discuss the results

    Un corpus pour optimiser l’identification automatique des chaĂźnes de rĂ©fĂ©rence

    Get PDF
    Nous prĂ©sentons l’étude d’un corpus multi-genres constituĂ© pour identifier de maniĂšre automatique les chaĂźnes de rĂ©fĂ©rence (CR). Les CR sont des marqueurs linguistiques permettant d’identifier des ruptures ou des continuations thĂ©matiques dans le discours. Cette Ă©tude s’inscrit dans un projet visant le dĂ©veloppement d’un outil de dĂ©tection automatique de thĂšmes pour optimiser l’indexation des documents dans un moteur de recherche. Le moteur de recherche utilise l’indexation thĂ©matique et prend en compte le genre du document pour fournir Ă  l’utilisateur les documents pertinents liĂ©s Ă  sa requĂȘte. Dans notre perspective de Traitement Automatique des Langues, nous utilisons un corpus composĂ© de cinq genres textuels (articles journalistiques, Ă©ditoriaux, romans, lois europĂ©ennes, rapports publics) pour Ă©tudier les CR. Nous avons dĂ©fini cinq critĂšres pour comparer les CR suivant le genre textuel : la longueur moyenne des CR (nombre de maillons), la distance moyenne entre deux maillons d’une CR, la catĂ©gorie grammaticale privilĂ©giĂ©e dans l’ensemble des maillons des CR, la classe grammaticale des premiers maillons des CR, la correspondance entre le premier maillon d’une CR et le thĂšme phrastique (Ă©lĂ©ment prĂ©verbal). L’étude a rĂ©vĂ©lĂ© des diffĂ©rences quant au matĂ©riau linguistique prĂ©sent dans les CR suivant le genre textuel. Nous utilisons ces propriĂ©tĂ©s dans notre calcul des CR, pour paramĂ©trer notre outil suivant le genre. Nous discutons les rĂ©sultats obtenus.We present a multi-genre corpus study to automatically identify reference chains. Reference chains are linguistic markers identifying topic continuation or topic shift in discourse. The study is part of a project aiming at developing a system for automatic topic detection to optimize documents indexing in a search engine. The search engine uses topic indexing but also document genre to provide the user with relevant documents related to its application. In the view of Natural Language Processing, we use a corpus of five genres (articles, editorials, novels, European laws, public reports) to study the reference chains. We define five criteria to compare reference chains according textual genre : the average length of the reference chains (number of mentions), the average distance between two mentions of a reference chain, the grammatical category preferred in all mentions of the reference chains, the grammatical class of the first mentions of the reference chains, the correspondence between the first mention of a reference chain and the sentence topic. The corpus analysis reveals several differences across genres. We use these properties to configure our system according to the genre. We then discuss the results

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    Error propagation

    Get PDF
