357 research outputs found
COMPENDIUM: a text summarisation tool for generating summaries of multiple purposes, domains, and genres
In this paper, we present a Text Summarisation tool, compendium, capable of generating the most common types of summaries. Regarding the input, single- and multi-document summaries can be produced; as the output, the summaries can be extractive or abstractive-oriented; and finally, concerning their purpose, the summaries can be generic, query-focused, or sentiment-based. The proposed architecture for compendium is divided in various stages, making a distinction between core and additional stages. The former constitute the backbone of the tool and are common for the generation of any type of summary, whereas the latter are used for enhancing the capabilities of the tool. The main contributions of compendium with respect to the state-of-the-art summarisation systems are that (i) it specifically deals with the problem of redundancy, by means of textual entailment; (ii) it combines statistical and cognitive-based techniques for determining relevant content; and (iii) it proposes an abstractive-oriented approach for facing the challenge of abstractive summarisation. The evaluation performed in different domains and textual genres, comprising traditional texts, as well as texts extracted from the Web 2.0, shows that compendium is very competitive and appropriate to be used as a tool for generating summaries.This research has been supported by the project āDesarrollo de TĆ©cnicas Inteligentes e Interactivas de MinerĆa de Textosā (PROMETEO/2009/119) and the project reference ACOMP/2011/001 from the Valencian Government, as well as by the Spanish Government (grant no. TIN2009-13391-C04-01)
MultiGBS: A multi-layer graph approach to biomedical summarization
Automatic text summarization methods generate a shorter version of the input
text to assist the reader in gaining a quick yet informative gist. Existing
text summarization methods generally focus on a single aspect of text when
selecting sentences, causing the potential loss of essential information. In
this study, we propose a domain-specific method that models a document as a
multi-layer graph to enable multiple features of the text to be processed at
the same time. The features we used in this paper are word similarity, semantic
similarity, and co-reference similarity, which are modelled as three different
layers. The unsupervised method selects sentences from the multi-layer graph
based on the MultiRank algorithm and the number of concepts. The proposed
MultiGBS algorithm employs UMLS and extracts the concepts and relationships
using different tools such as SemRep, MetaMap, and OGER. Extensive evaluation
by ROUGE and BERTScore shows increased F-measure values
Knowledge representation and text mining in biomedical, healthcare, and political domains
Knowledge representation and text mining can be employed to discover new knowledge and develop services by using the massive amounts of text gathered by modern information systems. The applied methods should take into account the domain-specific nature of knowledge. This thesis explores knowledge representation and text mining in three application domains.
Biomolecular events can be described very precisely and concisely with appropriate representation schemes. Proteināprotein interactions are commonly modelled in biological databases as binary relationships, whereas the complex relationships used in text mining are rich in information. The experimental results of this thesis show that complex relationships can be reduced to binary relationships and that it is possible to reconstruct complex relationships from mixtures of linguistically similar relationships. This encourages the extraction of complex relationships from the scientific literature even if binary relationships are required by the application at hand. The experimental results on cross-validation schemes for pair-input data help to understand how existing knowledge regarding dependent instances (such those concerning proteināprotein pairs) can be leveraged to improve the generalisation performance estimates of learned models.
Healthcare documents and news articles contain knowledge that is more difficult to model than biomolecular events and tend to have larger vocabularies than biomedical scientific articles. This thesis describes an ontology that models patient education documents and their content in order to improve the availability and quality of such documents. The experimental results of this thesis also show that the Recall-Oriented Understudy for Gisting Evaluation measures are a viable option for the automatic evaluation of textual patient record summarisation methods and that the area under the receiver operating characteristic curve can be used in a large-scale sentiment analysis. The sentiment analysis of Reuters news corpora suggests that the Western mainstream media portrays China negatively in politics-related articles but not in general, which provides new evidence to consider in the debate over the image of China in the Western media
Towards generic relation extraction
A vast amount of usable electronic data is in the form of unstructured text. The relation
extraction task aims to identify useful information in text (e.g., PersonW works
for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational
database that can be more effectively used for querying and automated reasoning.
However, adapting conventional relation extraction systems to new domains
or tasks requires significant effort from annotators and developers. Furthermore, previous
adaptation approaches based on bootstrapping start from example instances of
the target relations, thus requiring that the correct relation type schema be known in
advance. Generic relation extraction (GRE) addresses the adaptation problem by applying
generic techniques that achieve comparable accuracy when transferred, without
modification of model parameters, across domains and tasks.
Previous work on GRE has relied extensively on various lexical and shallow syntactic
indicators. I present new state-of-the-art models for GRE that incorporate governordependency
information. I also introduce a dimensionality reduction step into the GRE
relation characterisation sub-task, which serves to capture latent semantic information
and leads to significant improvements over an unreduced model. Comparison of dimensionality
reduction techniques suggests that latent Dirichlet allocation (LDA) ā a
probabilistic generative approach ā successfully incorporates a larger and more interdependent
feature set than a model based on singular value decomposition (SVD) and
performs as well as or better than SVD on all experimental settings. Finally, I will
introduce multi-document summarisation as an extrinsic test bed for GRE and present
results which demonstrate that the relative performance of GRE models is consistent
across tasks and that the GRE-based representation leads to significant improvements
over a standard baseline from the literature.
Taken together, the experimental results 1) show that GRE can be improved using
dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE
for the content selection step of extractive summarisation and 3) validate the GRE
claim of modification-free adaptation for the first time with respect to both domain and
task. This thesis also introduces data sets derived from publicly available corpora for
the purpose of rigorous intrinsic evaluation in the news and biomedical domains
- ā¦