1,081 research outputs found

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations

    Explainable methods for knowledge graph refinement and exploration via symbolic reasoning

    Get PDF
    Knowledge Graphs (KGs) have applications in many domains such as Finance, Manufacturing, and Healthcare. While recent efforts have created large KGs, their content is far from complete and sometimes includes invalid statements. Therefore, it is crucial to refine the constructed KGs to enhance their coverage and accuracy via KG completion and KG validation. It is also vital to provide human-comprehensible explanations for such refinements, so that humans have trust in the KG quality. Enabling KG exploration, by search and browsing, is also essential for users to understand the KG value and limitations towards down-stream applications. However, the large size of KGs makes KG exploration very challenging. While the type taxonomy of KGs is a useful asset along these lines, it remains insufficient for deep exploration. In this dissertation we tackle the aforementioned challenges of KG refinement and KG exploration by combining logical reasoning over the KG with other techniques such as KG embedding models and text mining. Through such combination, we introduce methods that provide human-understandable output. Concretely, we introduce methods to tackle KG incompleteness by learning exception-aware rules over the existing KG. Learned rules are then used in inferring missing links in the KG accurately. Furthermore, we propose a framework for constructing human-comprehensible explanations for candidate facts from both KG and text. Extracted explanations are used to insure the validity of KG facts. Finally, to facilitate KG exploration, we introduce a method that combines KG embeddings with rule mining to compute informative entity clusters with explanations.Wissensgraphen haben viele Anwendungen in verschiedenen Bereichen, beispielsweise im Finanz- und Gesundheitswesen. Wissensgraphen sind jedoch unvollstĂ€ndig und enthalten auch ungĂŒltige Daten. Hohe Abdeckung und Korrektheit erfordern neue Methoden zur Wissensgraph-Erweiterung und Wissensgraph-Validierung. Beide Aufgaben zusammen werden als Wissensgraph-Verfeinerung bezeichnet. Ein wichtiger Aspekt dabei ist die ErklĂ€rbarkeit und VerstĂ€ndlichkeit von Wissensgraphinhalten fĂŒr Nutzer. In Anwendungen ist darĂŒber hinaus die nutzerseitige Exploration von Wissensgraphen von besonderer Bedeutung. Suchen und Navigieren im Graph hilft dem Anwender, die Wissensinhalte und ihre Limitationen besser zu verstehen. Aufgrund der riesigen Menge an vorhandenen EntitĂ€ten und Fakten ist die Wissensgraphen-Exploration eine Herausforderung. Taxonomische Typsystem helfen dabei, sind jedoch fĂŒr tiefergehende Exploration nicht ausreichend. Diese Dissertation adressiert die Herausforderungen der Wissensgraph-Verfeinerung und der Wissensgraph-Exploration durch algorithmische Inferenz ĂŒber dem Wissensgraph. Sie erweitert logisches Schlussfolgern und kombiniert es mit anderen Methoden, insbesondere mit neuronalen Wissensgraph-Einbettungen und mit Text-Mining. Diese neuen Methoden liefern Ausgaben mit ErklĂ€rungen fĂŒr Nutzer. Die Dissertation umfasst folgende BeitrĂ€ge: Insbesondere leistet die Dissertation folgende BeitrĂ€ge: ‱ Zur Wissensgraph-Erweiterung prĂ€sentieren wir ExRuL, eine Methode zur Revision von Horn-Regeln durch HinzufĂŒgen von Ausnahmebedingungen zum Rumpf der Regeln. Die erweiterten Regeln können neue Fakten inferieren und somit LĂŒcken im Wissensgraphen schließen. Experimente mit großen Wissensgraphen zeigen, dass diese Methode Fehler in abgeleiteten Fakten erheblich reduziert und nutzerfreundliche ErklĂ€rungen liefert. ‱ Mit RuLES stellen wir eine Methode zum Lernen von Regeln vor, die auf probabilistischen ReprĂ€sentationen fĂŒr fehlende Fakten basiert. Das Verfahren erweitert iterativ die aus einem Wissensgraphen induzierten Regeln, indem es neuronale Wissensgraph-Einbettungen mit Informationen aus Textkorpora kombiniert. Bei der Regelgenerierung werden neue Metriken fĂŒr die RegelqualitĂ€t verwendet. Experimente zeigen, dass RuLES die QualitĂ€t der gelernten Regeln und ihrer Vorhersagen erheblich verbessert. ‱ Zur UnterstĂŒtzung der Wissensgraph-Validierung wird ExFaKT vorgestellt, ein Framework zur Konstruktion von ErklĂ€rungen fĂŒr Faktkandidaten. Die Methode transformiert Kandidaten mit Hilfe von Regeln in eine Menge von Aussagen, die leichter zu finden und zu validieren oder widerlegen sind. Die Ausgabe von ExFaKT ist eine Menge semantischer Evidenzen fĂŒr Faktkandidaten, die aus Textkorpora und dem Wissensgraph extrahiert werden. Experimente zeigen, dass die Transformationen die Ausbeute und QualitĂ€t der entdeckten ErklĂ€rungen deutlich verbessert. Die generierten unterstĂŒtzen ErklĂ€rungen unterstĂŒtze sowohl die manuelle Wissensgraph- Validierung durch Kuratoren als auch die automatische Validierung. ‱ Zur UnterstĂŒtzung der Wissensgraph-Exploration wird ExCut vorgestellt, eine Methode zur Erzeugung von informativen EntitĂ€ts-Clustern mit ErklĂ€rungen unter Verwendung von Wissensgraph-Einbettungen und automatisch induzierten Regeln. Eine Cluster-ErklĂ€rung besteht aus einer Kombination von Relationen zwischen den EntitĂ€ten, die den Cluster identifizieren. ExCut verbessert gleichzeitig die Cluster- QualitĂ€t und die Cluster-ErklĂ€rbarkeit durch iteratives VerschrĂ€nken des Lernens von Einbettungen und Regeln. Experimente zeigen, dass ExCut Cluster von hoher QualitĂ€t berechnet und dass die Cluster-ErklĂ€rungen fĂŒr Nutzer informativ sind

    Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps

    Full text link
    Concept maps can be used to concisely represent important information and bring structure into large document collections. Therefore, we study a variant of multi-document summarization that produces summaries in the form of concept maps. However, suitable evaluation datasets for this task are currently missing. To close this gap, we present a newly created corpus of concept maps that summarize heterogeneous collections of web documents on educational topics. It was created using a novel crowdsourcing approach that allows us to efficiently determine important elements in large document collections. We release the corpus along with a baseline system and proposed evaluation protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
    • 

    corecore