960 research outputs found

    Combining textual features with sentence embeddings

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 인문대학 언어학과, 2021.8. 박수지.이 논문의 목표는 한국어 기사 품질을 예측하기 위한 언어 모형을 개발하는 것이다. 기사 품질 예측 과제는 최근 가짜뉴스 등의 범람으로 그 필요성이 대두되면서도 자연언어처리의 최신 기법이 아직 적용되지 못하는 실정에 있다. 이 논문에서는 이러한 한계를 극복하기 위해 문장의 의미를 표상하는 SBERT 모형을 개발하고, 기사의 언어학적 자질을 활용하여 품질 분류의 성능을 높일 수 있는지를 검토하고자 한다. 그 결과 기사의 가독성, 응집성 등의 텍스트 자질을 사용한 기계학습 모형과 SBERT에서 자동으로 추출된 문맥 자질을 사용한 전이학습 모형이 모두 선행연구의 심층학습 결과보다 높은 성능을 보였고, 구체적으로는 SBERT 학습시 훈련 데이터를 확장하고 정제할 때, 그리고 텍스트 자질과 문맥 자질을 함께 사용할 때 성능이 더욱 향상되는 것을 관측하였다. 이를 통해 기사의 품질에서 언어학적 자질이 중요한 역할을 하며 자연언어처리의 최신 기법인 SBERT가 언어학적 자질을 추출하고 활용하는 데 실질적으로 기여할 수 있다는 결론을 내릴 수 있다.1 Introduction 1 2 Literature Review 5 2.1 Background 5 2.1.1 Text Classification 5 2.1.1.1 Initial Studies 5 2.1.1.2 News Classification 6 2.1.2 Text Quality Assessment 8 2.2 News Quality Prediction Task 9 2.2.1 News Data 9 2.2.1.1 Online vs. Offline 9 2.2.1.2 Expert-rated vs. User-rated 9 2.2.2 Prediction Methods 11 2.2.2.1 Manually Engineered Features v. Automatically Extracted Features 11 2.2.2.2 Machine Learning vs. Deep Learning 12 2.3 Instruments and Techniques 14 2.3.1 Sentence and Document Embeddings 14 2.3.1.1 Static Embeddings 14 2.3.1.2 Contextual Embeddings 16 2.3.2 Fusion Models 18 2.4 Summary 27 3 Methods 29 3.1 Data from Choi, Shin, and Kang (2021) 29 3.1.1 News Corpus 29 3.1.2 Quality Levels 29 3.1.3 Journalism Values 30 3.2 Linguistic Features 31 3.2.1 Justification of Using Linguistic Features Only 31 3.2.2 Two Types of Linguistic Features 32 3.2.2.1 Textual Features 32 3.2.2.2 Contextual Features 33 3.3 Summary 33 4 Ordinal Logistic Regression Models with Textual Features 35 4.1 Textual Features 35 4.1.1 Coh-Metrix 35 4.1.2 KOSAC Lexicon 36 4.1.3 K-LIWC 38 4.1.4 Others 38 4.2 Ordinal Logistic Regression 38 4.3 Results 39 4.3.1 Feature Selection 39 4.3.2 Impacts on Quality Evaluation 40 4.4 Discussion 40 4.4.1 Effect of Cosine Similarity by Issue 41 4.4.2 Effect of Quantitative Evidence 47 4.4.3 Effect of Sentiment 48 4.5 Summary 51 5 Deep Transfer Learning Models with Contextual Features 53 5.1 Contextual Features from SentenceBERT 53 5.1.1 Necessity of Sentence Embeddings 54 5.1.2 KR-SBERT 55 5.2 Deep Transfer Learning 56 5.3 Results 59 5.3.1 Measures of Multiclass Classification 59 5.3.2 Performances of news quality prediction models 60 5.4 Discussion 62 5.4.1 Effect of Data Size 62 5.4.2 Effect of Data Augmentation 62 5.4.3 Effect of Data Refinement 635.5 Summary 63 6 Fusion Models Combining Textual Features with Contextual Sentence Embeddings 65 6.1 Model Fusion 65 6.1.1 Feature-level Fusion: Concatenation 65 6.1.2 Logit-level Fusion: Interpolation 65 6.2 Results 68 6.2.1 Optimization of the Presentational Attribute Model 68 6.2.2 Performances of News Quality Prediction Models 68 6.3 Discussion 68 6.3.1 Effects of Fusion 70 6.3.2 Comparison with Choi et al. (2021) 71 6.4 Summary 71 7 Conclusion 73 References 75 A List of Words Used for Textual Feature Extraction 93 A.1 Coh-Metrix Features 93 A.2 Predicate Type Features 94 B Codes Used in Chapter 4 97 B.1 Python Code for Textual Feature Extraction 97 C Results of VIF test and Brant test 101 C.1 VIF Test in R 101 C.2 Brant Test in R 103 D Codes Used in Chapter 6 107 D.1 Python Code for Feature-Level Fusion 107 D.2 Python Code for Logit-Level Fusion 108박

    Towards Interoperability in E-health Systems: a three-dimensional approach based on standards and semantics

    Get PDF
    Proceedings of: HEALTHINF 2009 (International Conference on Helath Informatics), Porto (Portugal), January 14-17, 2009, is part of BIOSTEC (Intemational Joint Conference on Biomedical Engineering Systems and Technologies)The interoperability problem in eHealth can only be addressed by mean of combining standards and technology. However, these alone do not suffice. An appropiate framework that articulates such combination is required. In this paper, we adopt a three-dimensional (information, conference and inference) approach for such framework, based on OWL as formal language for terminological and ontological health resources, SNOMED CT as lexical backbone for all such resources, and the standard CEN 13606 for representing EHRs. Based on tha framewok, we propose a novel form for creating and supporting networks of clinical terminologies. Additionally, we propose a number of software modules to semantically process and exploit EHRs, including NLP-based search and inference, wich can support medical applications in heterogeneous and distributed eHealth systems.This work has been funded as part of the Spanish nationally funded projects ISSE (FIT-350300-2007-75) and CISEP (FIT-350301-2007-18). We also acknowledge IST-2005-027595 EU project NeO

    BeeSpace Navigator: exploratory analysis of gene function using semantic indexing of biological literature

    Get PDF
    With the rapid decrease in cost of genome sequencing, the classification of gene function is becoming a primary problem. Such classification has been performed by human curators who read biological literature to extract evidence. BeeSpace Navigator is a prototype software for exploratory analysis of gene function using biological literature. The software supports an automatic analogue of the curator process to extract functions, with a simple interface intended for all biologists. Since extraction is done on selected collections that are semantically indexed into conceptual spaces, the curation can be task specific. Biological literature containing references to gene lists from expression experiments can be analyzed to extract concepts that are computational equivalents of a classification such as Gene Ontology, yielding discriminating concepts that differentiate gene mentions from other mentions. The functions of individual genes can be summarized from sentences in biological literature, to produce results resembling a model organism database entry that is automatically computed. Statistical frequency analysis based on literature phrase extraction generates offline semantic indexes to support these gene function services. The website with BeeSpace Navigator is free and open to all; there is no login requirement at www.beespace.illinois.edu for version 4. Materials from the 2010 BeeSpace Software Training Workshop are available at www.beespace.illinois.edu/bstwmaterials.php

    Symposium franco-chinois de télédétection quantitative en agronomie et environnement. Bilan et perspectives de collaboration. Rapport de mission (26 au 30 mars 2000)

    Full text link
    Ce rapport présente les principaux résultats d'un Symposium en Télédétection entre des équipes de chercheurs de l'INRA, du CIRAD, de l'Université de Lille et leurs homologues chinois de l'Institute of Remote Sensing Applications (IRSA) of Chinese Academy of Sciences (CAS), et du National Satellite Meteorological Center (NSMC). Les perspectives d'un programme de collaboration sont présentées avec deux axes majeurs correspondant à deux niveaux d'approche, régional et local en agriculture de précision. (Résumé d'auteur

    A Coherent Unsupervised Model for Toponym Resolution

    Full text link
    Toponym Resolution, the task of assigning a location mention in a document to a geographic referent (i.e., latitude/longitude), plays a pivotal role in analyzing location-aware content. However, the ambiguities of natural language and a huge number of possible interpretations for toponyms constitute insurmountable hurdles for this task. In this paper, we study the problem of toponym resolution with no additional information other than a gazetteer and no training data. We demonstrate that a dearth of large enough annotated data makes supervised methods less capable of generalizing. Our proposed method estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. We explore the interactions between multiple interpretations of mentions and the relationships between different toponyms in a document to build a model that finds the most coherent resolution. Our model is evaluated on three news corpora, two from the literature and one collected and annotated by us; then, we compare our methods to the state-of-the-art unsupervised and supervised techniques. We also examine three commercial products including Reuters OpenCalais, Yahoo! YQL Placemaker, and Google Cloud Natural Language API. The evaluation shows that our method outperforms the unsupervised technique as well as Reuters OpenCalais and Google Cloud Natural Language API on all three corpora; also, our method shows a performance close to that of the state-of-the-art supervised method and outperforms it when the test data has 40% or more toponyms that are not seen in the training data.Comment: 9 pages (+1 page reference), WWW '18 Proceedings of the 2018 World Wide Web Conferenc

    Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding

    Full text link
    In clinical radiology reports, doctors capture important information about the patient's health status. They convey their observations from raw medical imaging data about the inner structures of a patient. As such, formulating reports requires medical experts to possess wide-ranging knowledge about anatomical regions with their normal, healthy appearance as well as the ability to recognize abnormalities. This explicit grasp on both the patient's anatomy and their appearance is missing in current medical image-processing systems as annotations are especially difficult to gather. This renders the models to be narrow experts e.g. for identifying specific diseases. In this work, we recover this missing link by adding human anatomy into the mix and enable the association of content in medical reports to their occurrence in associated imagery (medical phrase grounding). To exploit anatomical structures in this scenario, we present a sophisticated automatic pipeline to gather and integrate human bodily structures from computed tomography datasets, which we incorporate in our PAXRay: A Projected dataset for the segmentation of Anatomical structures in X-Ray data. Our evaluation shows that methods that take advantage of anatomical information benefit heavily in visually grounding radiologists' findings, as our anatomical segmentations allow for up to absolute 50% better grounding results on the OpenI dataset as compared to commonly used region proposals. The PAXRay dataset is available at https://constantinseibold.github.io/paxray/.Comment: 33rd British Machine Vision Conference (BMVC 2022

    Software Evolution for Industrial Automation Systems. Literature Overview

    Get PDF

    “Describing Without Identifying”: The Phenomenological Role Of Gender in Cataloging Practices

    Get PDF
    This dissertation explores gendering practices of visual information catalogers. The work aims to understand how catalogers perceive gender when describing persons within visual information. The qualitative study deployed queer interpretative phenomenological analysis to understand how catalogers think broadly about describing identity. The infused queer theoretical tenets helped to understand that while participants may not directly name gender as challenging, the conflation of gender into cisnormative monoliths (assuming every person\u27s gender matches their sex-assigned-at birth) or silence around gender produce telling opinions concerning nonbinary gender. The research also utilized a Think Aloud exercise wherein participants undertook in-the-moment cataloging three moving images. One image represented “neutral” cisgender identities, and two clips represented subversions to gender binaries. Thirteen catalogers were interviewed, and data produced noteworthy findings. The small sample size reflects qualitative methodological priority regarding a participants’ intimate, lived experiences rather than aiming for generalizability. Catalogers describe work with visual information as inherently challenging since describing anything without context requires caution. Catalogers also noted hesitance around describing humans given societal complexities around identities like race and gender. Nevertheless, participants during the Think Aloud exercise relied on gendering as descriptive shorthand (pronouns, male/female labels) and only reflected on these presumptions when engaging with the footage whose contents challenged gender binaries. Implications suggest a need for inclusivity training catalogers around contemporary notions of gender. Further, given the impact of the gender nonconforming footage on cataloger’s perceived practices, another implication suggests value in increased access to and representation of gender diverse materials within cultural heritage
    corecore