119 research outputs found

    Combining Text Classification and Fact Checking to Detect Fake News

    Get PDF
    Due to the widespread use of fake news in social and news media, it is an emerging research topic gaining attention in today‘s world. In news media and social media, information is spread at high speed but without accuracy, and therefore detection mechanisms should be able to predict news quickly enough to combat the spread of fake news. It has the potential for a negative impact on individuals and society. Therefore, detecting fake news is important and also a technically challenging problem nowadays. The challenge is to use text classification to combat fake news. This includes determining appropriate text classification methods and evaluating how good these methods are at distinguishing between fake and non- fake news. Machine learning is helpful for building Artificial intelligence systems based on tacit knowledge because it can help us solve complex problems based on real-world data. For this reason, I proposed that integrating text classification and fact checking of check-worthy statements can be helpful in detecting fake news. I used text processing and three classifiers such as Passive Aggressive, Naïve Bayes, and Support Vector Machine to classify the news data. Text classification mainly focuses on extracting various features from texts and then incorporating these features into the classification. The big challenge in this area is the lack of an efficient method to distinguish between fake news and non-fake news due to the lack of corpora. I applied three different machine learning classifiers to two publicly available datasets. Experimental analysis based on the available dataset shows very encouraging and improved performance. Simple classification is not quite accurate in detecting fake news because the classification methods are not specialized for fake news. So I added a system that checks the news in depth sentence by sentence. Fact checking is a multi-step process that begins with the extraction of check-worthy statements. Identification of check-worthy statements is a subtask in the fact checking process, the automation of which would reduce the time and effort required to fact check a statement. In this thesis I have proposed an approach that focuses on classifying statements into check-worthy and not check-worthy, while also taking into account the context around a statement. This work shows that inclusion of context in the approach makes a significant contribution to classification, while at the same time using more general features to capture information from sentences. The aim of thischallenge is to propose an approach that automatically identifies check-worthy statements for fact checking, including the context around a statement. The results are analyzed by examining which features contributes more to classification, but also how well the approach performs. For this work, a dataset is created by consulting different fact checking organizations. It contains debates and speeches in the domain of politics. The capability of the approach is evaluated in this domain. The approach starts with extracting sentence and context features from the sentences, and then classifying the sentences based on these features. The feature set and context features are selected after several experiments, based on how well they differentiate check-worthy statements. Fact checking has received increasing attention after the 2016 United States Presidential election; so far that many efforts have been made to develop a viable automated fact checking system. I introduced a web based approach for fact checking that compares the full news text and headline with known facts such as name, location, and place. The challenge is to develop an automated application that takes claims directly from mainstream news media websites and fact checks the news after applying classification and fact checking components. For fact checking a dataset is constructed that contains 2146 news articles labelled fake, non-fake and unverified. I include forty mainstream news media sources to compare the results and also Wikipedia for double verification. This work shows that a combination of text classification and fact checking gives considerable contribution to the detection of fake news, while also using more general features to capture information from sentences

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Persönliche Wege der Interaktion mit multimedialen Inhalten

    Get PDF
    Today the world of multimedia is almost completely device- and content-centered. It focuses it’s energy nearly exclusively on technical issues such as computing power, network specifics or content and device characteristics and capabilities. In most multimedia systems, the presentation of multimedia content and the basic controls for playback are main issues. Because of this, a very passive user experience, comparable to that of traditional TV, is most often provided. In the face of recent developments and changes in the realm of multimedia and mass media, this ”traditional” focus seems outdated. The increasing use of multimedia content on mobile devices, along with the continuous growth in the amount and variety of content available, make necessary an urgent re-orientation of this domain. In order to highlight the depth of the increasingly difficult situation faced by users of such systems, it is only logical that these individuals be brought to the center of attention. In this thesis we consider these trends and developments by applying concepts and mechanisms to multimedia systems that were first introduced in the domain of usercentrism. Central to the concept of user-centrism is that devices should provide users with an easy way to access services and applications. Thus, the current challenge is to combine mobility, additional services and easy access in a single and user-centric approach. This thesis presents a framework for introducing and supporting several of the key concepts of user-centrism in multimedia systems. Additionally, a new definition of a user-centric multimedia framework has been developed and implemented. To satisfy the user’s need for mobility and flexibility, our framework makes possible seamless media and service consumption. The main aim of session mobility is to help people cope with the increasing number of different devices in use. Using a mobile agent system, multimedia sessions can be transferred between different devices in a context-sensitive way. The use of the international standard MPEG-21 guarantees extensibility and the integration of content adaptation mechanisms. Furthermore, a concept is presented that will allow for individualized and personalized selection and face the need for finding appropriate content. All of which can be done, using this approach, in an easy and intuitive way. Especially in the realm of television, the demand that such systems cater to the need of the audience is constantly growing. Our approach combines content-filtering methods, state-of-the-art classification techniques and mechanisms well known from the area of information retrieval and text mining. These are all utilized for the generation of recommendations in a promising new way. Additionally, concepts from the area of collaborative tagging systems are also used. An extensive experimental evaluation resulted in several interesting findings and proves the applicability of our approach. In contrast to the ”lean-back” experience of traditional media consumption, interactive media services offer a solution to make possible the active participation of the audience. Thus, we present a concept which enables the use of interactive media services on mobile devices in a personalized way. Finally, a use case for enriching TV with additional content and services demonstrates the feasibility of this concept.Die heutige Welt der Medien und der multimedialen Inhalte ist nahezu ausschließlich inhalts- und geräteorientiert. Im Fokus verschiedener Systeme und Entwicklungen stehen oft primär die Art und Weise der Inhaltspräsentation und technische Spezifika, die meist geräteabhängig sind. Die zunehmende Menge und Vielfalt an multimedialen Inhalten und der verstärkte Einsatz von mobilen Geräten machen ein Umdenken bei der Konzeption von Multimedia Systemen und Frameworks dringend notwendig. Statt an eher starren und passiven Konzepten, wie sie aus dem TV Umfeld bekannt sind, festzuhalten, sollte der Nutzer in den Fokus der multimedialen Konzepte rücken. Um dem Nutzer im Umgang mit dieser immer komplexeren und schwierigen Situation zu helfen, ist ein Umdenken im grundlegenden Paradigma des Medienkonsums notwendig. Durch eine Fokussierung auf den Nutzer kann der beschriebenen Situation entgegengewirkt werden. In der folgenden Arbeit wird auf Konzepte aus dem Bereich Nutzerzentrierung zurückgegriffen, um diese auf den Medienbereich zu übertragen und sie im Sinne einer stärker nutzerspezifischen und nutzerorientierten Ausrichtung einzusetzen. Im Fokus steht hierbei der TV-Bereich, wobei die meisten Konzepte auch auf die allgemeine Mediennutzung übertragbar sind. Im Folgenden wird ein Framework für die Unterstützung der wichtigsten Konzepte der Nutzerzentrierung im Multimedia Bereich vorgestellt. Um dem Trend zur mobilen Mediennutzung Sorge zu tragen, ermöglicht das vorgestellte Framework die Nutzung von multimedialen Diensten und Inhalten auf und über die Grenzen verschiedener Geräte und Netzwerke hinweg (Session mobility). Durch die Nutzung einer mobilen Agentenplattform in Kombination mit dem MPEG-21 Standard konnte ein neuer und flexibel erweiterbarer Ansatz zur Mobilität von Benutzungssitzungen realisiert werden. Im Zusammenhang mit der stetig wachsenden Menge an Inhalten und Diensten stellt diese Arbeit ein Konzept zur einfachen und individualisierten Selektion und dem Auffinden von interessanten Inhalten und Diensten in einer kontextspezifischen Weise vor. Hierbei werden Konzepte und Methoden des inhaltsbasierten Filterns, aktuelle Klassifikationsmechanismen und Methoden aus dem Bereich des ”Textminings” in neuer Art und Weise in einem Multimedia Empfehlungssystem eingesetzt. Zusätzlich sind Methoden des Web 2.0 in eine als Tag-basierte kollaborative Komponente integriert. In einer umfassenden Evaluation wurde sowohl die Umsetzbarkeit als auch der Mehrwert dieser Komponente demonstriert. Eine aktivere Beteiligung im Medienkonsum ermöglicht unsere iTV Komponente. Sie unterstützt das Anbieten und die Nutzung von interaktiven Diensten, begleitend zum Medienkonsum, auf mobilen Geräten. Basierend auf einem Szenario zur Anreicherung von TV Sendungen um interaktive Dienste konnte die Umsetzbarkeit dieses Konzepts demonstriert werden

    디지털 도시 인프라와 모바일 신체: 코로나19 유행병 시기 QR코드 생산 활동과 공간 구성에 대한 연구

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 환경대학원 환경계획학과, 2022.2. 전상인.This thesis studies the assemblage of Electronic Entry Register (EER) as digital infrastructure during covid-19 pandemic in the city of Seoul. Electronic Entry Register is a spatial planning and strategy that the South Korean government developed to control the circulation of mobile bodies as a response to the pandemic. This case study adopts an assemblage thinking to reveal how the EER came into being. It particularly highlights the data-producing human actors by adopting a posthumanist approach, to bring them forward as one of the main actors in materialising this assemblage. Examining the development processes of the EER revealed that assembling the ‘circulatory conduit’ (Deleuze & Guattari, 1997) depended largely on creating a population of docile bodies (Foucault, 2020) who were willing to, and capable of producing the right kinds of data. For this end, the South Korean government chose to simulate the national population on commercial mobile apps; which leaves the question that perhaps the task of creating a networked population is too often taken-for-granted in the discourse of smart city. Three critical dimensions in the production of digital infrastructure are proposed: the urban screens, the posthuman performances, and the leveraging effects of digital technology. The data-producing mobile bodies became the most critical actor in assembling the EER. Field research conducted at the sites of the EER across the city of Seoul, revealed that the mobile phone numbers intimately entangled to the mobile bodies (Barns, 2020) became the most critical ‘dividual’ (Deleuze, 1992) that indicated the mobile bodies. The illegibility of the QR codes and the invisibility embedded in the processing of digital data alienated the very producers; raising a sense of alienation which accompanied feelings of anxieties, doubts and powerlessness. Findings on their differentiated posthuman bodies and their sense of alienation indicated that they were anything but the homogenous ‘smart citizens’ as often imagined in the smart city discourse. Lastly, the thesis discusses the spatialities entailed in the QR codified urban space in two dimensions: spatial order embedded in the EER and spatial shifts experienced by the citizens. Spatial order embedded in the EER are discussed as ‘fragmented circulation’, ‘data-based public space’, and ‘invisible enclosure’. Spatial shifts encountered by the citizens are discussed as ‘collapsed linearity’, ‘liquid boundaries’, and ‘reproduction of digital speed’. The core element in mobilising this urban assemblage was the data-producing docile bodies moving across the urban space with the smartphones as their prostheses. As Lefebvre (2013) asserts that time-space is produced through practice, these bodies reproduced the digital speed onto the urban landscape. This case study highlights digital mediation in urban space where it emerges through the body-smartphone. It proposes that the study of digitally mediated cities, including smart city discourse, could more productively take the posthuman body a valid unit of analysis.본 연구는 코로나19 유행병에 대한 대응 차원에서 인구의 흐름을 통제하기 위해 한국 정부가 구축한 ‘전자출입명부’의 형성 과정을 도시 아상블라주 (urban assemblage)로서 연구하였다. 특히 포스트휴머니즘 관점을 도입하여 스마트폰을 보철(prosthesis)로서 체화하고 디지털 데이터를 생산하는 행위자에 주목하였다. 정부가 요구하는 데이터를 적시에 생산하는 인구를 창출하는 것이 전자출입명부 개발과정의 핵심임을 밝혔다. 이는 다양한 방식의 대국민 커뮤니케이션을 통해 이루어졌다. 또한 정부는 기업의 모바일 플랫폼에 QR코드 기능을 탑재하여 기업의 ‘온라인 고객’을 국가의 ‘온라인 인구’로 대체하였다. 이 과정은 전 인구가 연결된 네트워크 장(場)을 형성하고, 일상적으로 데이터를 생산하게 하는 것이 얼마나 어려울 수 있는지를 반증하며, 스마트시티 담론에서 온라인 인구 구축에 대한 명확한 전제가 미비함을 지적하게 한다. 또한 디지털 기반시설의 주요 구성 요소로서 도시 스크린 (urban screen), 데이터 생산자, 레버리지 효과를 제시하였다. 이동 중 스마트폰을 작동하여 지식을 생산하는 유순한 신체 (Foucault, 2020)는 전자출입명부를 도시 아상블라주로서 구축하는 핵심 동력이 되었다. 서울 시민을 대상으로 한 전자출입명부 현장연구는 QR코드가 이동하는 몸을 가리키는 가장 중요한 ‘분체(dividual)’ (Deleuze, 1992)로서 작동함을 확인하였다. 인간의 눈으로 판독 불가한 QR코드 패턴이 상징하듯, 디지털 데이터의 비시인성은 생산, 수집, 산출, 활용의 전 과정에서 데이터 생산자들을 소외시켰는데, 이는 불안감과 무력감으로도 표출되었다. 본 연구에서 관찰한 데이터생산자들은 스마트시티 담론에서 ‘스마트 시민(smart citizen)’으로 표상되는 정치적 주체들과는 거리가 있었다. 마지막으로, 논문은 QR코드화된 도시공간의 공간성을 두 가지 측면에서 논의하였다. 전자출입명부에 내재된 공간적 질서에 대한 측면과, 변화된 도시공간 구조변화의 측면이다. 전자출입명부에 내재된 공간적 질서는 ‘파편화된 순환 (fragmented circulation)’, ‘데이터기반 공공 공간(data-based public space)’, ‘비가시적 봉인성 (invisible enclosure)’으로 논의하였다. 변화된 도시 공간성에 대한 경험은 ‘선형성의 붕괴 (collapsed linearity)’, ‘액체적 경계 (liquid boundaries)’, ‘디지털 속도의 재생산 (reproduction of digital speed)’으로 제시하였다. 본 연구는 코로나19 유행병의 방역을 위해 한국 정부가 구축한 전자출입명부를 디지털 기반시설 조성의 사례로 연구하며, 그 개발과정에 있어 디지털 데이터를 생산하는 시민을 창조하는 것이 핵심이었음을 밝혔다. 르페브르(2013)가 실천을 통해 시공간이 생성된다고 하였듯 디지털을 체화한 신체는 디지털 속도를 도시공간에 재현하였는데, 본 사례연구는 이러한 도시공간의 재조직이 신체-스마트폰을 통해 이루어지는 현상을 포착할 수 있었다. 이에 따라 본 연구는 디지털 기기를 보철로서 체화한 포스트휴먼을 도시공간 연구에 있어 유효한 연구 단위로서 제안하며, 디지털 도시를 연구함에 있어 도시계획학적 함의가 적지 않음을 제시한다.Acknowledgements i Abstract ii Chapter 1. Introduction 1 1.1. QR Codifying Practice during Covid-19 Pandemic in Seoul 3 1.2. Research Objective and Questions 10 Chapter 2. Theoretical Background 11 2.1. Problematic: Spatial Imagination on Digital Cities 11 2.2. Theoretical Framework 12 2.2.1. Urban Assemblage 12 2.2.2. Digital Infrastructure 17 2.2.3. Mobile Dispositif 19 2.2.4. Assembling the Electronic Entry Register 24 Chapter 3. Methodology 32 3.1. Research Design 32 3.2. Assembling / Structuring / Entrapping 34 3.3. Assembled / Altering / Empowering 51 Chapter 4. Developing Digital Urban Infrastructure 67 4.1. Prototyping and Building Ecosystem 67 4.2. Creating Data-Producing Citizens 71 4.3. Networking Population on Commercial Platforms 77 4.4. Core Components of Digital Infrastructure 82 Chapter 5. Data-Producing Mobile Bodies 92 5.1. Mobile Phone Numbers as Identification of Mobile Bodies 92 5.2. Relationship with Digital Data 96 5.3. Differentiated Posthuman Bodies 106 Chapter 6. Digitally Mediated Urban Space 115 6.1. Spatial Order Intrinsic in the EER 116 6.1.1. Fragmented Circulation 116 6.1.2. Data-based Public Space 118 6.1.3. Invisible Enclosure 120 6.2. Spatialities Experienced by Citizens 127 6.2.1. Collapsed Linearity 127 6.2.2. Liquid Boundaries 130 6.2.3. Reproduction of Digital Speed 134 Chapter 7. Conclusion 143 Reference 150 Appendix iv Abstract in Korean xviii박

    Eesti keele üldvaldkonna tekstide laia kattuvusega automaatne sündmusanalüüs

    Get PDF
    Seoses tekstide suuremahulise digitaliseerimisega ning digitaalse tekstiloome järjest laiema levikuga on tohutul hulgal loomuliku keele tekste muutunud ja muutumas masinloetavaks. Masinloetavus omab potentsiaali muuta tekstimassiivid inimeste jaoks lihtsamini hallatavaks, nt lubada rakendusi nagu automaatne sisukokkuvõtete tegemine ja tekstide põhjal küsimustele vastamine, ent paraku ei ulatu praegused automaatanalüüsi võimalused tekstide sisu tegeliku mõistmiseni. Oletatakse, tekstide sisu mõistvale automaatanalüüsile viib meid lähemale sündmusanalüüs – kuna paljud tekstid on narratiivse ülesehitusega, tõlgendatavad kui „sündmuste kirjeldused”, peaks tekstidest sündmuste eraldamine ja formaalsel kujul esitamine pakkuma alust mitmete „teksti mõistmist” nõudvate keeletehnoloogia rakenduste loomisel. Käesolevas väitekirjas uuritakse, kuivõrd saab eestikeelsete tekstide sündmusanalüüsi käsitleda kui avatud sündmuste hulka ja üldvaldkonna tekste hõlmavat automaatse lingvistilise analüüsi ülesannet. Probleemile lähenetakse eesti keele automaatanalüüsi kontekstis uudsest, sündmuste ajasemantikale keskenduvast perspektiivist. Töös kohandatakse eesti keelele TimeML märgendusraamistik ja luuakse raamistikule toetuv automaatne ajaväljendite tuvastaja ning ajasemantilise märgendusega (sündmusviidete, ajaväljendite ning ajaseoste märgendusega) tekstikorpus; analüüsitakse korpuse põhjal inimmärgendajate kooskõla sündmusviidete ja ajaseoste määramisel ning lõpuks uuritakse võimalusi ajasemantika-keskse sündmusanalüüsi laiendamiseks geneeriliseks sündmusanalüüsiks sündmust väljendavate keelendite samaviitelisuse lahendamise näitel. Töö pakub suuniseid tekstide ajasemantika ja sündmusstruktuuri märgenduse edasiarendamiseks tulevikus ning töös loodud keeleressurssid võimaldavad nii konkreetsete lõpp-rakenduste (nt automaatne ajaküsimustele vastamine) katsetamist kui ka automaatsete märgendustööriistade edasiarendamist.  Due to massive scale digitalisation processes and a switch from traditional means of written communication to digital written communication, vast amounts of human language texts are becoming machine-readable. Machine-readability holds a potential for easing human effort on searching and organising large text collections, allowing applications such as automatic text summarisation and question answering. However, current tools for automatic text analysis do not reach for text understanding required for making these applications generic. It is hypothesised that automatic analysis of events in texts leads us closer to the goal, as many texts can be interpreted as stories/narratives that are decomposable into events. This thesis explores event analysis as broad-coverage and general domain automatic language analysis problem in Estonian, and provides an investigation starting from time-oriented event analysis and tending towards generic event analysis. We adapt TimeML framework to Estonian, and create an automatic temporal expression tagger and a news corpus manually annotated for temporal semantics (event mentions, temporal expressions, and temporal relations) for the language; we analyse consistency of human annotation of event mentions and temporal relations, and, finally, provide a preliminary study on event coreference resolution in Estonian news. The current work also makes suggestions on how future research can improve Estonian event and temporal semantic annotation, and the language resources developed in this work will allow future experimentation with end-user applications (such as automatic answering of temporal questions) as well as provide a basis for developing automatic semantic analysis tools

    Knowledge extraction and popularity modeling using social media

    Get PDF

    Methods for improving entity linking and exploiting social media messages across crises

    Get PDF
    Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real world applications. In the first part of this thesis I explore an approximation of the difficulty to link entity mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments. An important component in the majority of the entity linking tools is the probability that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks
    corecore