192 research outputs found
Discovery of sensitive data with natural language processing
The process of protecting sensitive data is continually growing and becoming increasingly important,
especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but in most cases, the processes behind them are
still manual or semi-automatic. In this work, we have developed a component that can extract
and classify sensitive data, from unstructured text information in European Portuguese. The
objective was to create a system that allows organizations to understand their data and comply
with legal and security purposes. We studied a hybrid approach to the problem of Named
Entities Recognition for the Portuguese language. This approach combines several techniques
such as rule-based/lexical-based models, machine learning algorithms and neural networks. The
rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining
classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models –
Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional-
LSTM approach as experimented. The best results were achieved with the Stanford NER model
(86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional
Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With
the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensíveis está em constante crescimento e cada vez apresenta
maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia.
O esforço para criar sistemas automáticos é contínuo, mas o processo é realizado na maioria dos
casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de
Extração e Classificação de dados sensíveis, que processa textos não-estruturados em Português
Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender
os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema,
foi estudada uma abordagem híbrida de Reconhecimento de Entidades Mencionadas para
a língua Portuguesa. Esta abordagem combina técnicas baseadas em regras e léxicos, algoritmos
de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e
léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes
de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatísticos
— Conditional Random Fields e Random Forest – e por fim testada uma abordagem
baseada em redes neuronais – Bidirectional-LSTM. Ao nível das ferramentas utilizadas os melhores
resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos
estatísticos percebemos que o Conditional Random Fields é o que consegue obter melhores resultados,
com um f1-score de 65,50%. Com a última abordagem, uma rede neuronal Bi-LSTM,
conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes
abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA
News Corpus e DataSense NER Corpus
A study on reusing resources of speech synthesis for closely-related languages
This thesis describes research on building a text-to-speech (TTS) framework that can accommodate the lack of linguistic information of under-resource languages by using existing resources from another language. It describes the adaptation process required when such limited resource is used. The main natural languages involved in this research are Malay and Iban language.
The thesis includes a study on grapheme to phoneme mapping and the substitution of phonemes. A set of substitution matrices is presented which show the phoneme confusion in term of perception among respondents. The experiments conducted study the intelligibility as well as perception based on context of utterances.
The study on the phonetic prosody is then presented and compared to the Klatt duration model. This is to find the similarities of cross language duration model if one exists. Then a comparative study of Iban native speaker with an Iban polyglot TTS using Malay resources is presented. This is to confirm that the prosody of Malay can be used to generate Iban synthesised speech.
The central hypothesis of this thesis is that by using a closely-related language resource, a natural sounding speech can be produced. The aim of this research was to show that by sticking to the indigenous language characteristics, it is possible to build a polyglot synthesised speech system even with insufficient speech resources
고유명사 정규화 기법을 이용한 지식 그래프 구축
학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2023. 2. 조성준.Text mining aims to extract the information from documents to derive valuable insights. The knowledge graph provides richer information from various documents. Past literature responded for such needs by building technology trees or concept network from the bibliographic information of the documents, or by relying on text mining techniques in order to extract keywords and/or phrases. In this paper, we propose a framework for building a knowledge graph using named entities. The knowledge graph construction framework in this paper satisfies the following conditions: (1) extracting the named entity in the completed form, (2) Building datasets that can be trained and be evaluated by the named entity normalization models in various domains such as finance and technical documents in addition to bio-informatics, where existing NEN research has been active, (3) creating the better performing named entity normalization model, and (4) constructing the knowledge graph by grouping named entities with the same meaning that appear in various forms.텍스트 마이닝은 다양한 인사이트를 얻기 위해 문서에서 정보를 추출하는 것을 목표로 한다. 문서의 정보를 표현하는 방식 중 하나인 지식 그래프는 다양한 문서에서 더욱 풍부한 정보를 제공한다. 기존 연구들은 텍스트 마이닝 기법을 이용하여 문서의 정보들로 기술 트리 또는 개념 네트워크를 구축하거나 키워드 및 구문을 추출하였다. 본 논문에 서는 고유명사를 이용하여 지식 그래프를 구축하기 위한 프레임워크를 제안한다. 본 논문의 지식 그래프 구축 프레임워크는 다음과 같은 조건을 만족한다. (1) 고유명사를 사람이 이해하기 쉬운 형태로 추출한다. (2) 기존 고유명사 정규화 연구가 활발했던 생물정보학 외에 금융 문서, 반도체 관련 특허 문서에서 추출한 고유명사로 고유명사 정규화 데이터셋을 구축한다. (3) 더 나은 성능의 고유명사 정규화 모델을 구축한다. (4) 다양한 형태의 동일한 의미를 가진 고유명사를 그룹화하여 지식 그래프를 구축한다.Chapter 1 Introduction 1
Chapter 2 Literature review 5
2.1 Named entity normalization dataset 5
2.2 Named entity normalization 6
2.3 Knowledge graph construction 9
Chapter 3 Dictionary construction for named entity normalization 11
3.1 Background 11
3.2 Dictionary construction methods 12
3.2.1 Finance named entity normalization dataset 12
3.2.2 Patent named entity normalization dataset 18
3.3 Chapter summary 24
Chapter 4 Named entity normalization model using edge weight updating neural network 26
4.1 Background 26
4.2 Proposed model 28
4.2.1 Ground truth entity graph construction 31
4.2.2 Similarity-based entity graph construction 32
4.2.3 Edge weight updating neural network training 35
4.2.4 Edge weight updating neural network inferencing 38
4.3 Experiment results 39
4.3.1 Datasets 39
4.3.2 Experiment settings: named entity normalization in bioinformatics 40
4.3.3 Experiment Settings: Named Entity Normalization in Finance 42
4.4 Results 44
4.4.1 Quantitative Analysis: Bioinformatics 45
4.4.2 QuantitativeAnalysis:Finance 46
4.4.3 QualitativeAnalysis 47
4.5 Chapter summary 51
Chapter 5 Building knowledge graph using named entity recognition and normalization models 53
5.1 Background 53
5.2 Proposed model 55
5.2.1 Named entity normalization 56
5.2.2 Construction of the semiconductor-related patent knowledge graph 61
5.3 Experiment results 62
5.3.1 Comparison models 62
5.3.2 Parameters ettings 64
5.4 Results 64
5.4.1 Quantitative evaluations 64
5.4.2 Qualitative evaluations 70
5.4.3 Knowledge graph visualization and exemplary investigation 71
5.5 Chapter summary 75
Chapter 6 Conclusion 77
6.1 Contributions 77
6.2 Future work 78
Bibliography 79
국문초록 92
감사의 글 93박
An Approach for Automatic Generation of on-line Information Systems based on the Integration of Natural Language Processing and Adaptive Hypermedia Techniques
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid. Escuela Politécnica Superior, Departamento de ingeniería informática. Fecha de lectura: 29-05-200
- …