5 research outputs found

    Semi-supervised never-ending learning in rhetorical relation identification

    Get PDF
    Some languages do not have enough labeled data to obtain good discourse parsing, specially in the relation identification step, and the additional use of unlabeled data is a plausible solution. A workflow is presented that uses a semi-supervised learning approach. Instead of only a pre-defined additional set of unlabeled data, texts obtained from the web are continuously added. This obtains near human perfomance (0.79) in intra sentential rhetorical relation identification. An experiment for English also shows improvement using a similar workflow.São Paulo Research Foundation (FAPESP) (grant♯2014/11632)Natural Sciences and Engineering Research Council of CanadaUniversity of Toront

    In this paper, we provide a brief description of the multidisciplinary domain of research called Natural Language Processing (NLP), which aims at enabling the computer to deal with natural languages. In accordance with this description, NLP is conceived a

    Get PDF
    In this paper, we provide a brief description of the multidisciplinary domain of research called Natural Language Processing (NLP), which aims at enabling the computer to deal with natural languages. In accordance with this description, NLP is conceived as “human language engineering or technology”. Therefore, NLP requires consistent description of linguistic facts on every linguistic level: morphological, syntactic, semantic, and even the level of pragmatics and discourse. In addition to the linguistically-motivated conception of NLP, we emphasize the origin of such research field, the place occupied by NLP inside a multidisciplinary scenario, their objectives and challenges. Finally, we provide some remarks on the automatic processing of Brazilian Portuguese language. Key words: natural language processing, human language technology, computational linguistics, linguistics, natural language

    NILC-Metrix : assessing the complexity of written and spoken language in Brazilian Portuguese

    Get PDF
    This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task

    Review and Evaluation of DiZer – An Automatic Discourse Analyzer for Brazilian Portuguese

    No full text
    corecore