643 research outputs found
Exploiting semantic annotations for open information extraction: an experience in the biomedical domain
The increasing amount of unstructured text published on the Web is demanding new tools and methods to automatically process and extract relevant information. Traditional information extraction has focused on harvesting domain-specific, pre-specified relations, which usually requires manual labor and heavy machinery; especially in the biomedical domain, the main efforts have been directed toward the recognition of well-defined entities such as genes or proteins, which constitutes the basis for extracting the relationships between the recognized entities. The intrinsic features and scale of the Web demand new approaches able to cope with the diversity of documents, where the number of relations is unbounded and not known in advance. This paper presents a scalable method for the extraction of domain-independent relations from text that exploits the knowledge in the semantic annotations. The method is not geared to any specific domain (e.g., protein–protein interactions and drug–drug interactions) and does not require any manual input or deep processing. Moreover, the method uses the extracted relations to compute groups of abstract semantic relations characterized by their signature types and synonymous relation strings. This constitutes a valuable source of knowledge when constructing formal knowledge bases, as we enable seamless integration of the extracted relations with the available knowledge resources through the process of semantic annotation. The proposed approach has successfully been applied to a large text collection in the biomedical domain and the results are very encouraging.The work was supported by the CICYT project TIN2011-24147 from the Spanish Ministry of Economy and Competitiveness (MINECO)
Ekstraksi Relasi Antar Entitas di Bahasa Indonesia Menggunakan Neural Network
Dengan perkembangan zaman yang begitu pesat, berdampak pada perkembangan data pula. Salah satu bentuk data yang paling banyak saat ini berupa data tekstual seperti artikel sederhana maupun dokumen lain yang terdapat di internet. Agar data tekstual tersebut dapat dimengerti dan dimanfaatkan dengan baik oleh manusia, maka perlu di proses dan disederhanakan agar menjadi informasi yang ringkas dan jelas. Oleh karena itu, semakin berkembang pula penelitian dalam bidang Information Extraction (IE) dan salah satu contoh penelitian di IE adalah Relation Extraction (RE). Penelitian RE sudah banyak dilakukan terutama pada Bahasa Inggris dimana resourcenya sudah termasuk banyak. Metode yang digunakan pun bermacam-macam seperti kernel, tree kernel, support vector machine, long short-term memory, convulution recurrent neural network, dan lain sebagainya. Pada penelitian kali ini adalah penelitian RE pada Bahasa Indonesia dengan menggunakan metode convulution recurrent neural network yang sudah dipergunakan untuk RE Bahasa Inggris. Dataset yang digunakan pada penelitian ini adalah dataset Bahasa Indonesia yang berasal dari file xml wikipedia. File xml wikipedia ini kemudian diproses sehingga menghasilkan dataset seperti yang digunakan pada CRNN dalam Bahasa inggris yaitu dalam format SemEval-2 Task 8. Uji coba dilakukan dengan berbagai macam perbandingan data training dan testing yaitu 80:20, 70:30, dan 60:40. Selain itu, parameter pooling untuk CRNN yang digunakan ada dua macam yaitu ‘att’ dan ‘max’. Dari uji coba yang dilakukan, hasil yang didapatkan adalah bervariasi mulai dari mendekati maupun lebih baik bila dibandingkan dengan CRNN dengan menggunakan dataset Bahasa inggris sehingga dapat disimpulkan bahwa dengan CRNN ini bisa digunakan untuk proses RE pada Bahasa Indonesia apabila dataset yang digunakan sesuai dengan penelitian sebelumnya
Active Relation Discovery: Towards General and Label-aware Open Relation Extraction
Open Relation Extraction (OpenRE) aims to discover novel relations from open
domains. Previous OpenRE methods mainly suffer from two problems: (1)
Insufficient capacity to discriminate between known and novel relations. When
extending conventional test settings to a more general setting where test data
might also come from seen classes, existing approaches have a significant
performance decline. (2) Secondary labeling must be performed before practical
application. Existing methods cannot label human-readable and meaningful types
for novel relations, which is urgently required by the downstream tasks. To
address these issues, we propose the Active Relation Discovery (ARD) framework,
which utilizes relational outlier detection for discriminating known and novel
relations and involves active learning for labeling novel relations. Extensive
experiments on three real-world datasets show that ARD significantly
outperforms previous state-of-the-art methods on both conventional and our
proposed general OpenRE settings. The source code and datasets will be
available for reproducibility.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Exploratory Search on Mobile Devices
The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype
Unsupervised learning of relation detection patterns
L'extracció d'informació és l'à rea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades
estructurades a partir de la informació rellevant continguda en fragments textuals.
L'extracció d'informació requereix una quantitat considerable de coneixement lingüÃstic. La especificitat d'aquest
coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un
cost en termes d'esforç humà . Durant dècades, s'han aplicat tècniques d'aprenentatge automà tic per tal de superar aquest
coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta
la disponibilitat de grans col·leccions de documents, esdevenen necessà ries aproximacions completament nosupervisades
per tal d'explotar el coneixement que hi ha en elles.
La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació,
per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el
problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les
diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes
de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que
incorporessin la informació de clustering.
Al final d'aquesta tesi, havÃem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de
patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mÃnim de supervisió humana, és competitiu i
fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant
information contained in textual fragments.
Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a
drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort.
Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively
reducing the amount of involved human supervision. However, as the availability of large document collections increases,
completely unsupervised approaches become necessary in order to mine the knowledge contained in them.
The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to
further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation
detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this
combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third,
devising pattern learning procedures which incorporated clustering information.
By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns
which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable
approaches in the state of the art.Postprint (published version
A Semi-Supervised Information Extraction Framework for Large Redundant Corpora
The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy
- …