234 research outputs found
A Proficient Method For High Eminence And Cohesive Relevant Phrase Mining
A sentence is an integral unit of semantic nature, context and significance. Visualizing sentences for each topic is an important way to investigate and interpret unstructured corporate texts in subject modeling. Usually the term mining method is double: mining phrases and modeling theme. Current methods also suffer from order-sensitive and improper segmentation problems for phrase mining, which often lead to phrases of low content. The limitations of sentences, which may undermine continuity, are not entirely taken into account by standard topic models for topic modeling. In addition, current methods are frequently subject to domain terminology loss as the effect of topical domain dissemination is disregarded. We suggest an effective approach for high-quality and coherent topical sentence mining in this article. A high-quality sentence must meet the requirements for frequency, phrasing, integrity and suitability. In order to increase the both phrase consistency and topical cohesion, we combine the quality assured phrase mining process, a novel subject models that incorporate phrasing restriction, and a novel text clustering method into an iterative system. Effective algorithm designs to perform these methods effectively are often defined
Autoentity: automated entity detection from massive text corpora
Entity detection is one of the fundamental tasks in Natural Language Processing and Information Retrieval. Most existing methods rely on human annotated data and hand-crafted linguistic features, which makes it hard to apply the model to an emerging domain. In this paper, we propose a novel automated entity detection framework, called AutoEntity, that performs automated phrase mining to create entity mention candidates and enforces lexico-syntactic rules to select entity mentions from candidates. Our experiments on real-world datasets in different domains and multiple languages have demonstrated the effectiveness and robustness of the proposed method
Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text
We advance the state of the art in biomolecular interaction extraction with
three contributions: (i) We show that deep, Abstract Meaning Representations
(AMR) significantly improve the accuracy of a biomolecular interaction
extraction system when compared to a baseline that relies solely on surface-
and syntax-based features; (ii) In contrast with previous approaches that infer
relations on a sentence-by-sentence basis, we expand our framework to enable
consistent predictions over sets of sentences (documents); (iii) We further
modify and expand a graph kernel learning framework to enable concurrent
exploitation of automatically induced AMR (semantic) and dependency structure
(syntactic) representations. Our experiments show that our approach yields
interaction extraction systems that are more robust in environments where there
is a significant mismatch between training and test conditions.Comment: Appearing in Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence (AAAI-16
Corpus-based extraction and identification of Portuguese Multiword Expressions
This presentation reports the methodology followed and the results attained on an on-going project aiming at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore statistically interpreted using lexical association measures and are undergoing a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, like collocations. We aim to achieve two main objectives with this resource: to build on the large set of data of different types of MW expressions to revise existing typologies of collocations and to integrate them in a larger theory of MW units; to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.Cet article présente la méthodologie suivie et les résultats obtenus dans le cadre
dâun projet qui a pour objectif la construction dâune large base de donnĂ©es dâexpressions
multi-mots de la langue portugaise. Ces expressions multi-mots ont été automatiquement
extraites dâun corpus Ă©quilibrĂ© de 50 millions de mots, interprĂ©tĂ©es statistiquement Ă lâaide de
mesures dâassociation lexicales et ont Ă©tĂ© ensuite manuellement vĂ©rifiĂ©es. La base de donnĂ©es
lexicales recouvre diffĂ©rent types dâexpressions multi-mots avec diffĂ©rents degrĂ©s de
cohĂ©sion, qui vont de la quasi totale fixitĂ© jusquâaux groupes de mots qui se rĂ©alisent
préférentiellement ensemble, comme les collocations. Le large ensemble de données de cette
ressource permettra une rĂ©vision des typologies dâunitĂ©s multi-mots en portugais et
lâĂ©valuation de diffĂ©rentes mesures dâassociations lexicales.info:eu-repo/semantics/publishedVersio
Constructing and modeling text-rich information networks: a phrase mining-based approach
A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data.
To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities.
We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance.
Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes.
Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed nodes into a shared low-dimensional vectorial space such that node proximity can be easily computed and accurately predicted
The integration of machine translation and translation memory
We design and evaluate several models for integrating Machine Translation (MT) output into a Translation Memory (TM) environment to facilitate the adoption of MT technology
in the localization industry.
We begin with the integration on the segment level via translation recommendation and translation reranking. Given an input to be translated, our translation recommendation
model compares the output from the MT and the TMsystems, and presents the better one to the post-editor. Our translation reranking model combines k-best lists from both systems,
and generates a new list according to estimated post-editing effort. We perform both automatic and human evaluation on these models. When measured against the consensus of
human judgement, the recommendation model obtains 0.91 precision at 0.93 recall, and the reranking model obtains 0.86 precision at 0.59 recall. The high precision of these models indicates that they can be integrated into TM environments without the risk of deteriorating the quality of the post-editing candidate, and can thereby preserve TM assets and established cost estimation methods associated with TMs.
We then explore methods for a deeper integration of translation memory and machine translation on the sub-segment level. We predict whether phrase pairs derived from fuzzy matches could be used to constrain the translation of an input segment. Using a series of novel linguistically-motivated features, our constraints lead both to more consistent translation output, and to improved translation quality, reflected by a 1.2 improvement in BLEU score and a 0.72 reduction in TER score, both of statistical significance (p < 0.01).
In sum, we present our work in three aspects: 1) translation recommendation and translation reranking models that can access high quality MT outputs in the TMenvironment, 2)
a sub-segment translation memory and machine translation integration model that improves both translation consistency and translation quality, and 3) a human evaluation pipeline to validate the effectiveness of our models with human judgements
DEXTER: A workbench for automatic term extraction with specialized corpora
[EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñån-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824
A Graph-Based Approach for the Summarization of Scientific Articles
Automatic text summarization is one of the eminent applications in the field of
Natural Language Processing. Text summarization is the process of generating
a gist from text documents. The task is to produce a summary which contains
important, diverse and coherent information, i.e., a summary should be self-contained.
The approaches for text summarization are conventionally extractive.
The extractive approaches select a subset of sentences from an input document
for a summary. In this thesis, we introduce a novel graph-based extractive summarization
approach.
With the progressive advancement of research in the various fields of science,
the summarization of scientific articles has become an essential requirement for
researchers. This is our prime motivation in selecting scientific articles as our
dataset. This newly formed dataset contains scientific articles from the PLOS
Medicine journal, which is a high impact journal in the field of biomedicine.
The summarization of scientific articles is a single-document summarization task.
It is a complex task due to various reasons, one of it being, the important information
in the scientific article is scattered all over it and another reason being, scientific
articles contain numerous redundant information. In our approach, we deal
with the three important factors of summarization: importance, non-redundancy
and coherence. To deal with these factors, we use graphs as they solve data sparsity
problems and are computationally less complex.
We employ bipartite graphical representation for the summarization task, exclusively.
We represent input documents through a bipartite graph that consists of
sentence nodes and entity nodes. This bipartite graph representation contains entity
transition information which is beneficial for selecting the relevant sentences
for a summary. We use a graph-based ranking algorithm to rank the sentences in
a document. The ranks are considered as relevance scores of the sentences which
are further used in our approach.
Scientific articles contain reasonable amount of redundant information, for example,
Introduction and Methodology sections contain similar information regarding
the motivation and approach. In our approach, we ensure that the summary contains
sentences which are non-redundant.
Though the summary should contain important and non-redundant information of
the input document, its sentences should be connected to one another such that
it becomes coherent, understandable and simple to read. If we do not ensure
that a summary is coherent, its sentences may not be properly connected. This
leads to an obscure summary. Until now, only few summarization approaches
take care of coherence. In our approach, we take care of coherence in two different
ways: by using the graph measure and by using the structural information. We
employ outdegree as the graph measure and coherence patterns for the structural
information, in our approach.
We use integer programming as an optimization technique, to select the best subset
of sentences for a summary. The sentences are selected on the basis of relevance,
diversity and coherence measure. The computation of these measures is
tightly integrated and taken care of simultaneously.
We use human judgements to evaluate coherence of summaries. We compare
ROUGE scores and human judgements of different systems on the PLOS Medicine
dataset. Our approach performs considerably better than other systems on this
dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare
the results with the recent state-of-the-art systems. The results show that our
graph-based approach outperforms other systems on DUC 2002. In conclusion,
our approach is robust, i.e., it works on both scientific and news articles. Our
approach has the further advantage of being semi-supervised
- âŠ