2 research outputs found
TÉCNICAS DE PROCESSAMENTO DE LINGUAGEM NATURAL APLICADAS AO PROCESSO DE MINERAÇÃO DE TEXTOS: RESULTADOS PRELIMINARES DE UM MAPEAMENTO SISTEMÁTICO
Text mining is an activity that aims to discover knowledge in not-structured data (textual. This process uses itself algorithms as well as known and consolidated techniques, among which can be termed Natural Language Processing (NLP) which has incremented obtained results and has justified the necessary computational effort. Objective: The aim of this study was to identify and evaluate the techniques of NLP available to perform data mining in textual databases. Method: We applied a systematic mapping study to identify, evaluate and interpret relevant studies about this research topic. Results: We identify 24 papers discussing about 11 NLP techniques applied in text mining, in which the ontology was presented as the most efficient technique throughout the years.A mineração de textos é a atividade que surgiu com o propósito de descobrir conhecimento em dados não estruturados (textuais). Este processo utiliza além de algoritmos próprios, técnicas já conhecidas e consolidadas, dentre elas o Processamento de Linguagem Natural (PLN) tem incrementado os resultados obtidos. Objetivo: Este estudo teve como objetivo identificar e avaliar as técnicas de PLN disponíveis para realizar mineração em bases de dados textuais com o intuito de discutir sobre essas técnicas a partir das experiências publicadas neste contexto. Método: Foi utilizada a técnica de mapeamento sistemático, cujo propósito é identificar, avaliar e interpretar estudos disponíveis e relevantes sobre uma determinada questão de pesquisa, executando um processo de revisão rigoroso e confiável. Resultados: Foram analisados 24 estudos aplicando 11 técnicas diferentes de PLN na mineração de textos, sendo que dentre todas essas técnicas, a ontologia se mostrou a mais recorrente e eficiente.
Developing a Dataset for Technology Structure Mining
Conference paperThis paper describes steps that have been taken to construct a
development dataset for the task of Technology Structure Mining. We have
defined the proposed task as the process of mapping a scientific corpus
into a labeled digraph named a Technology Structure Graph as described
in the paper. The generated graph expresses the domain semantics in
terms of interdependencies between pairs of technologies that are named
(introduced) in the target scientific corpus. The dataset comprises a
set of sentences extracted from the ACL Anthology Corpus. Each sentence
is annotated with at least two technologies in the domain of Human
Language Technology and the interdependence between them. The
annotations - technology mark-up and their interdependencies - are
expressed at two layers: lexical and termino-conceptual. Lexical
representation of technologies comprises varying lexicalizations of a
technology. However, at the termino-conceptual layer all these lexical
variations refer to the same concept. We have adopted the same approach
for representing Semantic Relations, at the lexical layer a semantic
relation is a predicate i.e. defined based on the sentence surface
structure, however at the termino-conceptual layer semantic relations
are classified into conceptual relations either taxonomic or
non-taxonomic. Morover, the contexts that interdependencies are
extracted from are classified into five groups based on the linguistic
criteria and syntactic structure that are identified by the human
annotators. The dataset initially comprises of 482 sentences. We hope
this effort results in a benchmark that can be used for the technology
structure mining task as defined in the paper