265 research outputs found
De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models
Electronic Medical Records (EMRs) contain clinical narrative text that is of
great potential value to medical researchers. However, this information is
mixed with Personally Identifiable Information (PII) that presents risks to
patient and clinician confidentiality. This paper presents an end-to-end
de-identification framework to automatically remove PII from hospital discharge
summaries. Our corpus included 600 hospital discharge summaries which were
extracted from the EMRs of two principal referral hospitals in Sydney,
Australia. Our end-to-end de-identification framework consists of three
components: 1) Annotation: labelling of PII in the 600 hospital discharge
summaries using five pre-defined categories: person, address, date of birth,
identification number, phone number; 2) Modelling: training six named entity
recognition (NER) deep learning base-models on balanced and imbalanced
datasets; and evaluating ensembles that combine all six base-models, the three
base-models with the best F1 scores and the three base-models with the best
recall scores respectively, using token-level majority voting and stacking
methods; and 3) De-identification: removing PII from the hospital discharge
summaries. Our results showed that the ensemble model combined using the
stacking Support Vector Machine (SVM) method on the three base-models with the
best F1 scores achieved excellent results with a F1 score of 99.16% on the test
set of our corpus. We also evaluated the robustness of our modelling component
on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the
token-level majority voting method on all six base-models, achieved the highest
F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64%
at binary token-level matching compared to two state-of-the-art methods. The
framework provides a robust solution to de-identifying clinical narrative text
safely
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
In this paper we present GumDrop, Georgetown University's entry at the DISRPT
2019 Shared Task on automatic discourse unit segmentation and connective
detection. Our approach relies on model stacking, creating a heterogeneous
ensemble of classifiers, which feed into a metalearner for each final task. The
system encompasses three trainable component stacks: one for sentence
splitting, one for discourse unit segmentation and one for connective
detection. The flexibility of each ensemble allows the system to generalize
well to datasets of different sizes and with varying levels of homogeneity.Comment: Proceedings of Discourse Relation Parsing and Treebanking
(DISRPT2019
Transfer learning: bridging the gap between deep learning and domain-specific text mining
Inspired by the success of deep learning techniques in Natural Language Processing (NLP), this dissertation tackles the domain-specific text mining problems for which the generic deep learning approaches would fail. More specifically, the domain-specific problems are: (1) success prediction in crowdfunding, (2) variants identification in biomedical literature, and (3) text data augmentation for domains with low-resources.
In the first part, transfer learning in a multimodal perspective is utilized to facilitate solving the project success prediction on the crowdfunding application. Even though the information in a project profile can be of different modalities such as text, images, and metadata, most existing prediction approaches leverage only the text modality. It is promising to utilize the visual images in project profiles to find out how images could contribute to the success prediction. An advanced neural network scheme is designed and evaluated combining information learned from different modalities for project success prediction.
In the second part, transfer learning is combined with deep learning techniques to solve genomic variants Named Entity Recognition (NER) problems in biomedical literature. Most of the advanced generic NER algorithms can fail due to the restricted training corpus. However, those generic deep learning algorithms are capable of learning from a canonical corpus, without any effort on feature engineering. This work aims to build an end-to-end deep learning approach to transfer the domain-specific knowledge to those advanced generic NER algorithms, addressing the challenges in low-resource training and requiring neither hand-crafted features nor post-processing rules.
For the last part, transfer learning with knowledge distillation and active learning are utilized to solve text augmentation for domains with low-resources. Most of the recent text augmentation methods heavily rely on large external resources. This work is dedicates to solving the text augmentation problem adaptively and consistently with minimal resources for token-level tasks like NER. The solution can also assure the reliability of machine labels for noisy data and can enhance training consistency with noisy labels.
All the works are evaluated on different domain-specific benchmarks, respectively. Experimental results demonstrate the effectiveness of those proposed methods. The advantages also indicate promising potential for transfer learning in domain-specific applications
Multi-Engine Approach for Named Entity Recognition in Bengali
PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200
Neural approaches to sequence labeling for information extraction
Een belangrijk aspect binnen artificiële intelligentie (AI) is het interpreteren van menselijke taal uitgedrukt in tekstuele (geschreven) vorm: natural Language processing (NLP) is belangrijk gezien tekstuele informatie nuttig is voor veel toepassingen. Toch is het verstaan ervan (zogenaamde natural Language understanding, (NLU) een uitdaging, gezien de ongestructureerde vorm van tekst, waarvan de betekenis vaak dubbelzinnig en contextafhankelijk is. In dit proefschrift introduceren we oplossingen voor tekortkomingen van gerelateerd werk bij het behandelen van fundamentele taken in natuurlijke taalverwerking, zoals named entity recognition (i.e. het identificeren van de entiteiten die in een zin voorkomen) en relatie-extractie (het identificeren van relaties tussen entiteiten). Vertrekkend van een specifiek probleem (met name het identificeren van de structuur van een huis aan de hand van een tekstueel zoekertje), bouwen we stapsgewijs een complete (geautomatiseerde) oplossing voor de bovengenoemde taken, op basis van neutrale netwerkarchitecturen. Onze oplossingen zijn algemeen toepasbaar op verschillende toepassingsdomeinen en talen. We beschouwen daarnaast ook de taak van het identificeren van relevante gebeurtenissen tijdens een evenement (bv. een doelpunt tijdens een voetbalwedstrijd), in informatiestromen op Twitter. Meer bepaald formuleren we dit probleem als het labelen van woord sequenties (vergelijkbaar met named entity recognition), waarbij we de chronologische relatie tussen opeenvolgende tweets benutten
Recommended from our members
Information Extraction from Unstructured Text for the Biodefense Knowledge Center
The Bio-Encyclopedia at the Biodefense Knowledge Center (BKC) is being constructed to allow an early detection of emerging biological threats to homeland security. It requires highly structured information extracted from variety of data sources. However, the quantity of new and vital information available from every day sources cannot be assimilated by hand, and therefore reliable high-throughput information extraction techniques are much anticipated. In support of the BKC, Lawrence Livermore National Laboratory and Oak Ridge National Laboratory, together with the University of Utah, are developing an information extraction system built around the bioterrorism domain. This paper reports two important pieces of our effort integrated in the system: key phrase extraction and semantic tagging. Whereas two key phrase extraction technologies developed during the course of project help identify relevant texts, our state-of-the-art semantic tagging system can pinpoint phrases related to emerging biological threats. Also we are enhancing and tailoring the Bio-Encyclopedia by augmenting semantic dictionaries and extracting details of important events, such as suspected disease outbreaks. Some of these technologies have already been applied to large corpora of free text sources vital to the BKC mission, including ProMED-mail, PubMed abstracts, and the DHS's Information Analysis and Infrastructure Protection (IAIP) news clippings. In order to address the challenges involved in incorporating such large amounts of unstructured text, the overall system is focused on precise extraction of the most relevant information for inclusion in the BKC
Cyber Security
This open access book constitutes the refereed proceedings of the 18th China Annual Conference on Cyber Security, CNCERT 2022, held in Beijing, China, in August 2022. The 17 papers presented were carefully reviewed and selected from 64 submissions. The papers are organized according to the following topical sections: ​​data security; anomaly detection; cryptocurrency; information security; vulnerabilities; mobile internet; threat intelligence; text recognition
- …