75 research outputs found
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Towards More Human-Like Text Summarization: Story Abstraction Using Discourse Structure and Semantic Information.
PhD ThesisWith the massive amount of textual data being produced every day,
the ability to effectively summarise text documents is becoming increasingly
important. Automatic text summarization entails the selection
and generalisation of the most salient points of a text in order
to produce a summary. Approaches to automatic text summarization
can fall into one of two categories: abstractive or extractive approaches.
Extractive approaches involve the selection and concatenation
of spans of text from a given document. Research in automatic
text summarization began with extractive approaches, scoring and
selecting sentences based on the frequency and proximity of words.
In contrast, abstractive approaches are based on a process of interpretation,
semantic representation, and generalisation. This is closer
to the processes that psycholinguistics tells us that humans perform
when reading, remembering and summarizing. However in the sixty
years since its inception, the field has largely remained focused on
extractive approaches.
This thesis aims to answer the following questions. Does knowledge
about the discourse structure of a text aid the recognition of
summary-worthy content? If so, which specific aspects of discourse
structure provide the greatest benefit? Can this structural information
be used to produce abstractive summaries, and are these more
informative than extractive summaries? To thoroughly examine these
questions, they are each considered in isolation, and as a whole, on
the basis of both manual and automatic annotations of texts. Manual
annotations facilitate an investigation into the upper bounds of
what can be achieved by the approach described in this thesis. Results
based on automatic annotations show how this same approach
is impacted by the current performance of imperfect preprocessing
steps, and indicate its feasibility.
Extractive approaches to summarization are intrinsically limited
by the surface text of the input document, in terms of both content
selection and summary generation. Beginning with a motivation
for moving away from these commonly used methods of producing
summaries, I set out my methodology for a more human-like
approach to automatic summarization which examines the benefits of
using discourse-structural information. The potential benefit of this
is twofold: moving away from a reliance on the wording of a text
in order to detect important content, and generating concise summaries
that are independent of the input text. The importance of
discourse structure to signal key textual material has previously been
recognised, however it has seen little applied use in the field of autovii
matic summarization. A consideration of evaluation metrics also features
significantly in the proposed methodology. These play a role in
both preprocessing steps and in the evaluation of the final summary
product. I provide evidence which indicates a disparity between the
performance of coreference resolution systems as indicated by their
standard evaluation metrics, and their performance in extrinsic tasks.
Additionally, I point out a range of problems for the most commonly
used metric, ROUGE, and suggest that at present summary evaluation
should not be automated.
To illustrate the general solutions proposed to the questions raised
in this thesis, I use Russian Folk Tales as an example domain. This
genre of text has been studied in depth and, most importantly, it has a
rich narrative structure that has been recorded in detail. The rules of
this formalism are suitable for the narrative structure reasoning system
presented as part of this thesis. The specific discourse-structural elements
considered cover the narrative structure of a text, coreference
information, and the story-roles fulfilled by different characters.
The proposed narrative structure reasoning system produces highlevel
interpretations of a text according to the rules of a given formalism.
For the example domain of Russian Folktales, a system is implemented
which constructs such interpretations of a tale according to
an existing set of rules and restrictions. I discuss how this process of
detecting narrative structure can be transferred to other genres, and
a key factor in the success of this process: how constrained are the
rules of the formalism. The system enumerates all possible interpretations
according to a set of constraints, meaning a less restricted rule
set leads to a greater number of interpretations.
For the example domain, sentence level discourse-structural annotations
are then used to predict summary-worthy content. The results
of this study are analysed in three parts. First, I examine the relative
utility of individual discourse features and provide a qualitative
discussion of these results. Second, the predictive abilities of these
features are compared when they are manually annotated to when
they are annotated with varying degrees of automation. Third, these
results are compared to the predictive capabilities of classic extractive
algorithms. I show that discourse features can be used to more
accurately predict summary-worthy content than classic extractive algorithms.
This holds true for automatically obtained annotations, but
with a much clearer difference when using manual annotations.
The classifiers learned in the prediction of summary-worthy sentences
are subsequently used to inform the production of both extractive
and abstractive summaries to a given length. A human-based
evaluation is used to compare these summaries, as well as the outputs
of a classic extractive summarizer. I analyse the impact of knowledge
about discourse structure, obtained both manually and automatically,
on summary production. This allows for some insight into the knock
on effects on summary production that can occur from inaccurate discourse
information (narrative structure and coreference information).
My analyses show that even given inaccurate discourse information,
the resulting abstractive summaries are considered more informative
than their extractive counterparts. With human-level knowledge
about discourse structure, these results are even clearer.
In conclusion, this research provides a framework which can be
used to detect the narrative structure of a text, and shows its potential
to provide a more human-like approach to automatic summarization.
I show the limit of what is achievable with this approach both
when manual annotations are obtainable, and when only automatic
annotations are feasible. Nevertheless, this thesis supports the suggestion
that the future of summarization lies with abstractive and not
extractive techniques
Recommended from our members
Computational Argumentation Approaches to Improve Sensemaking and Evidence-based Reasoning in Online Deliberation Systems
Deliberation is the process through which communities identify potential solutions for a problem and select the solution that most effectively meets their diverse requirements through dialogic communication. Online deliberation is implemented nowadays with means of social media and online discussion platforms; however, these media present significant challenges and issues that can be traced to inadequate support for Sensemaking processes and poor endorsement of the quality characteristics of deliberation.
This thesis investigates integrating computational argumentation methods in online deliberation platforms as an effective way to improve participants' perception of the quality of the deliberation process, their way of making sense of the overall process and producing healthier social dynamics.
For that, two computational artefacts are proposed: (i) a Synoptical summariser of long discussions and (ii) a Scientific Argument Recommender System (SciArgRecSys).
The two artefacts are designed and developed with state-of-the-art methods (with the use of Large Language Models - LLMs) and evaluated intrinsically and extrinsically when deployed in a real live platform (BCause).
Through extensive evaluation, the positive effect of both artefacts is illustrated in human Sensemaking and essential quality characteristics of deliberation such as reciprocal Engagement, Mutual Understanding, and Social dynamics. In addition, it has been demonstrated that these interventions effectively reduce polarisation, the formation of sub-communities while significantly enhancing the quality of the discussion by making it more coherent and diverse
Certified Public Accountant Education and Ethical Decision-Making Preparedness: A Phenomenological Study Exploring the Connection
The purpose of this phenomenological study was not to evaluate efficacy of ethical training for current accounting programs, but to understand the perceptions of Certified Public Accountants (CPAs) for the influence of accounting ethical education requirements on their ethical decision-making preparedness. The study used pre-interview surveys, face-to-face interviews, and timeline drawings to acquire data in four principle areas: CPA ethical decision-making preparedness, accounting program ethics educational requirements, accounting program ethics learning activities, and CPA perceptions of necessary changes to accounting program ethics requirements. Results from Virginia CPA perceptions indicate the definition of ethical decision-making as doing the right thing at all times, ethical decision-making preparedness as the education and experience to determine the right course of action, case studies as most effective ethics learning activities, lack of required ethics education in accounting programs, as well as increased stand-alone ethics courses and ethics incorporated in all accounting courses as necessary changes to current accounting programs. Other emergent themes from Virginia CPA perceptions include CPA reluctance to participate in context-based research, CPA Firm efforts to encourage CPAs to violate guidelines, and CPAs walking away from the industry to avoid pressure to violate guidelines
Automatic text summarisation using linguistic knowledge-based semantics
Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance
Great expectations: unsupervised inference of suspense, surprise and salience in storytelling
Stories interest us not because they are a sequence of mundane and predictable events but because they have drama and tension. Crucial to creating dramatic and exciting stories are surprise and suspense. Likewise, certain events are key to the plot and more important than others. Importance is referred to as salience. Inferring suspense, surprise and salience are highly challenging for computational systems. It is difficult because all these elements require a strong comprehension of the characters and their motivations, places, changes over time, and the cause/effect of complex interactions.
Recently advances in machine learning (often called deep learning) have substantially improved in many language-related tasks, including story comprehension and story writing. Most of these systems rely on supervision; that is, huge numbers of people need to tag large quantities of data to tell the system what to teach these systems. An example would be tagging which events are suspenseful. It is highly inflexible and costly.
Instead, the thesis trains a series of deep learning models via only reading stories, a self-supervised (or unsupervised) system. Narrative theory methods (rules and procedures) are applied to the knowledge built into the deep learning models to directly infer salience, surprise, and salience in stories. Extensions add memory and external knowledge from story plots and from Wikipedia to infer salience on novels such as Great Expectations and plays such as Macbeth. Other work adapts the models as a planning system for generating new stories.
The thesis finds that applying the narrative theory to deep learning models can align with the typical reader. In follow up work, the insights could help improve computer models for tasks such as automatic story writing, assistance for writing, summarising or editing stories. Moreover, the approach of applying narrative theory to the inherent qualities built in a system that learns itself (self-supervised) from reading from books, watching videos, listening to audio is much cheaper and more adaptable to other domains and tasks. Progress is swift in improving self-supervised systems. As such, the thesis's relevance is that applying domain expertise with these systems may be a more productive approach in many areas of interest for applying machine learning
Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021
The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown
- …