Search CORE

836 research outputs found

Russian word sense induction by clustering averaged word embeddings

Author: Kutuzov Andrey
Publication venue
Publication date: 01/01/2018
Field of study

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

WiSeBE: Window-based Sentence Boundary Evaluation

Author: D Yu
DD Palmer
G Hinton
H Brum
JL Fleiss
K Pearson
M Rott
N Jamil
T Kiss
Y Liu
Publication venue
Publication date: 01/01/2018
Field of study

Sentence Boundary Detection (SBD) has been a major research topic since Automatic Speech Recognition transcripts have been used for further Natural Language Processing tasks like Part of Speech Tagging, Question Answering or Automatic Summarization. But what about evaluation? Do standard evaluation metrics like precision, recall, F-score or classification error; and more important, evaluating an automatic system against a unique reference is enough to conclude how well a SBD system is performing given the final application of the transcript? In this paper we propose Window-based Sentence Boundary Evaluation (WiSeBE), a semi-supervised metric for evaluating Sentence Boundary Detection systems based on multi-reference (dis)agreement. We evaluate and compare the performance of different SBD systems over a set of Youtube transcripts using WiSeBE and standard metrics. This double evaluation gives an understanding of how WiSeBE is a more reliable metric for the SBD task.Comment: In proceedings of the 17th Mexican International Conference on Artificial Intelligence (MICAI), 201

arXiv.org e-Print Archive

Crossref

PolyPublie

Development of Graph from D-matrix based on Ontological Text Mining Method

Author: Tinal R. Thombare, Lalit Dole
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2015
Field of study

Fault dependency (D-matrix) is a diagnostic model that catches the fault system data and its causal relationship at the hierarchical system-level. It consists of dependencies and relationship between observable failure modes and symptoms associated with a system. Constructing such D-matrix fault detection model is time consuming task. In this paper, a system is proposed that describes an ontology based text mining method for automatically constructing D-matrix by mining hundreds of repair verbatim text data (typically written in unstructured text) collected during the diagnosis episodes. First we construct fault diagnosis ontology and then text mining techniques are applied to identify dependencies among failure modes and observable symptom. D-matrix is represented in graph so that analysis gets easier and faulty parts becomes easily detectable. The proposed method will be implemented as a prototype tool and validated by using real-life data collected from the automobile domain. DOI: 10.17762/ijritcc2321-8169.15055

International Journal on Recent and Innovation Trends in Computing and Communication

Period disambiguation with MaxEnt model

Author: Chunyu Kit
Xiaoyue Liu
Publication venue: Springer
Publication date: 01/01/2005
Field of study

Abstract. This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.

CiteSeerX

A text Ontology Method based on mining Develop D –MATRIX

Author: Tanaji L Dhikale, Asso. Prof. Mr. Ajay Kumar Kurra
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/06/2016
Field of study

In this issue, we demonstrate a text mining method of ontology based on the development and updating of a D-matrix naturally extraction of a large number of verbatim repairs (written in unstructured text) collected during the analysis stages. dependence (D) Fault - Matrix is a systematic demonstrative model is used to capture data symptomatic level progressive elimination system including dependencies between observable symptoms and failure modes associated with a frame. Matrix is a time D-long process. The development of D-matrix from first standards and update using the domain information is a concentrated work. In addition, increased D-die time for the disclosure of new symptoms and failure modes observed for the first race is a difficult task. In this methodology, we first develop the fault diagnosis ontology includes concepts and relationships regularly seen in fault diagnosis field. Then we use text mining algorithm that make use of this ontology to distinguish basic items, such as coins, symptoms, failure modes, and conditions of the unstructured text verbatim repair. The proposed technique is tools like a prototype tool and accepted using real - life information collected from cars space

International Journal on Recent and Innovation Trends in Computing and Communication

Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

Author: Han Jialong
Li Chenliang
Phan Minh C.
Sun Aixin
Tay Yi
Publication venue
Publication date: 01/01/2018
Field of study

Collective entity disambiguation aims to jointly resolve multiple mentions by linking them to their associated entities in a knowledge base. Previous works are primarily based on the underlying assumption that entities within the same document are highly related. However, the extend to which these mentioned entities are actually connected in reality is rarely studied and therefore raises interesting research questions. For the first time, we show that the semantic relationships between the mentioned entities are in fact less dense than expected. This could be attributed to several reasons such as noise, data sparsity and knowledge base incompleteness. As a remedy, we introduce MINTREE, a new tree-based objective for the entity disambiguation problem. The key intuition behind MINTREE is the concept of coherence relaxation which utilizes the weight of a minimum spanning tree to measure the coherence between entities. Based on this new objective, we design a novel entity disambiguation algorithms which we call Pair-Linking. Instead of considering all the given mentions, Pair-Linking iteratively selects a pair with the highest confidence at each step for decision making. Via extensive experiments, we show that our approach is not only more accurate but also surprisingly faster than many state-of-the-art collective linking algorithms

arXiv.org e-Print Archive

DR-NTU (Digital Repository of NTU)

Chinese named entity recognition using lexicalized HMMs

Author: Fu G
Luke KK
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2005
Field of study

This paper presents a lexicalized HMM-based approach to Chinese named entity recognition (NER). To tackle the problem of unknown words, we unify unknown word identification and NER as a single tagging task on a sequence of known words. To do this, we first employ a known-word bigram-based model to segment a sentence into a sequence of known words, and then apply the uniformly lexicalized HMMs to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity. Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. We have tested our system using different public corpora. The results show that lexicalized HMMs can substantially improve NER performance over standard HMMs. The results also indicate that character-based tagging (viz. the tagging based on pure single-character words) is comparable to and can even outperform the relevant known-word based tagging when a lexicalization technique is applied.postprin

HKU Scholars Hub

Indonesian Sentence Boundary Detection using Deep Learning Approaches

Author: Kurniawan Fachrul
Purwanto Christian Nathaniel
Santoso Joan
Setiawan Esther Irawati
Publication venue: 'State University of Malang (UM)'
Publication date: 01/06/2021
Field of study

Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes.

Portal Jurnal Elektronik Universitas Negeri Malang

Directory of Open Access Journals

D4.1. Technologies and tools for corpus creation, normalization and annotation

Author: Aleksic Vera
B?l Nuria
Bartolini Roberto
Caselli Tommaso
Frontini Francesca
Hamon Olivier
Papavassiliou Vassilis
Pecina Pavel
Poch Riera Marc
Poibeau Thierry
Prokopis Prokopidis
Rimell Laura
Thurmair Gregor
Publication venue
Publication date
Field of study

The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

PUblication MAnagement