16 research outputs found

    Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

    Full text link
    The goal of this paper is to investigate the connection between the performance gain that can be obtained by selftraining and the similarity between the corpora used in this approach. Self-training is a semi-supervised technique designed to increase the performance of machine learning algorithms by automatically classifying instances of a task and adding these as additional training material to the same classifier. In the context of language processing tasks, this training material is mostly an (annotated) corpus. Unfortunately self-training does not always lead to a performance increase and whether it will is largely unpredictable. We show that the similarity between corpora can be used to identify those setups for which self-training can be beneficial. We consider this research as a step in the process of developing a classifier that is able to adapt itself to each new test corpus that it is presented with

    Word-Graph Construction Techniques for Context Analysis

    Get PDF
    A Nomo-Word Graph Construction Analysis Method (NWGC-AM) is used to graph let the corresponding construction phrases into essential and non-essential citation groups. NMCS-NR, or Nomo Maximum Common Sub-graph edge resemblance, Maximum Common Subgraph Directed Edge resemblance (MCS-DER), and Maximum Common Subgraph Resemblance. The graph resemblance metrics used in this work are called Undirected Edges Resemblance (MCS-UER). The tests included five distinct classifiers: Random Forest, Naive Bayes, K-Nearest Neighbors (KNN), Decision Trees, and Support Vector Machines (SVM).Four sixty one (361) citations made up the annotated dataset used for the studies.  The Decision Tree classifier exhibits superior performance, attaining an accuracy rate of 0.98

    Annotated Corpus for Citation Context Analysis

    Get PDF
    In this paper, we present a corpus composed of 85 scientific articles annotated with 2092 citations analyzed using context analysis. We obtained a high Inter-annotator agreement; therefore, we assure reliability and reproducibility of the annotation performed by three coders in an independent way. We applied this corpus to classify citations according to qualitative criteria using a medium granularity categorization scheme enriched by annotated keywords and labels to obtain high granularity. The annotation schema handle three dimensions: PURPOSE: POLARITY: ASPECTS. Citation purpose define functions classification: use, critique, comparison and background with more specific classes stablished using keywords: Based on, Supply; Useful; Contrast; Acknowledge, Corroboration, Debate; Weakness and Hedges. Citation aspects complement the citation characterization: concept, method, data, tool, task, among others. Polarity has three levels: Positive, Negative and Neutral. We developed the schema and annotated the corpus focusing in applications for citation influence assessment, but we suggest that applications as summary generation and information retrieval also could use this annotated corpus because of the organization of the scheme in clearly defined general dimensions

    Disciplinary Difference in Citation Opinion Expressions

    Get PDF
    This study examines academic opinion expressions in citation context. We first developed an annotation schema to annotate three aspects of each academic opinion expressed in a citation statement: rhetorical purpose, content aspect, and opinion polarity. We then annotated two samples: a natural science sample consisting of biomedical journal articles, and an engineering sample consisting of conference papers in the natural language processing field. A comparison of the annotations on the two samples showed disciplinary differences in citation opinion expressions. The result contributes to the understanding of academic opinion expressions in citation context and the development of automated citation opinion analysis tools to assist researchers' literature search and navigation.ye

    CORWA: A Citation-Oriented Related Work Annotation Dataset

    Full text link
    Academic research is an exploratory activity to discover new solutions to problems. By this nature, academic research works perform literature reviews to distinguish their novelties from prior work. In natural language processing, this literature review is usually conducted under the "Related Work" section. The task of related work generation aims to automatically generate the related work section given the rest of the research paper and a list of papers to cite. Prior work on this task has focused on the sentence as the basic unit of generation, neglecting the fact that related work sections consist of variable length text fragments derived from different information sources. As a first step toward a linguistically-motivated related work generation framework, we present a Citation Oriented Related Work Annotation (CORWA) dataset that labels different types of citation text fragments from different information sources. We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts. We further suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.Comment: Accepted by NAACL 202

    A Correlation Study of Co-opinion and Co-citation Similarity Measures

    Get PDF
    Co-citation forms a relational document network. Co-citation-based measures are found to be effective in retrieving relevant documents. However, they are far from ideal and need further enhancements. Co-opinion concept was proposed and tested in previous research and found to be effective in retrieving relevant documents. The present study endeavors to explore the correlation between opinion (dis)similarity measures and the traditional co-citation-based ones including Citation Proximity Index (CPI), co-citedness and co-citation context similarity. The results show significant, though weak to medium, correlations between the variables. The correlations are direct for co-opinion measure, while being inverse for the opinion distance. Accordingly, the two groups of measures are revealed to represent some similar aspects of the document relation. Moreover, the weakness of the correlations implies that there are different dimensions represented by the two group
    corecore