427 research outputs found
Recommended from our members
Adapting Automatic Summarization to New Sources of Information
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages
Detecting Sarcasm in Multimodal Social Platforms
Sarcasm is a peculiar form of sentiment expression, where the surface
sentiment differs from the implied sentiment. The detection of sarcasm in
social media platforms has been applied in the past mainly to textual
utterances where lexical indicators (such as interjections and intensifiers),
linguistic markers, and contextual information (such as user profiles, or past
conversations) were used to detect the sarcastic tone. However, modern social
media platforms allow to create multimodal messages where audiovisual content
is integrated with the text, making the analysis of a mode in isolation
partial. In our work, we first study the relationship between the textual and
visual aspects in multimodal posts from three major social media platforms,
i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to
quantify the extent to which images are perceived as necessary by human
annotators. Moreover, we propose two different computational frameworks to
detect sarcasm that integrate the textual and visual modalities. The first
approach exploits visual semantics trained on an external dataset, and
concatenates the semantics features with state-of-the-art textual features. The
second method adapts a visual neural network initialized with parameters
trained on ImageNet to multimodal sarcastic posts. Results show the positive
effect of combining modalities for the detection of sarcasm across platforms
and methods.Comment: 10 pages, 3 figures, final version published in the Proceedings of
ACM Multimedia 201
Феномен синкретизма в украинской лингвистике
У сучасній лінгвістиці вивчення складних системних зв’язків та динамізму мови навряд чи буде завершеним без урахування синкретизму. Традиційно явища транзитивності трактуються як поєднання різних типів утворень як результат процесів трансформації або відображення проміжних, синкретичних фактів, що характеризують мовну систему в синхронному аспекті.In modern linguistics, the study of complex systemic relations and language dynamism is unlikely to be complete without considering the transitivity. Traditionally, transitivity phenomena are treated as a combination of different types of entities, formed as a result of the transformation processes or the reflection of the intermediate, syncretic facts that characterize the language system in the synchronous aspect.В современной лингвистике изучение сложных системных отношений и языкового динамизма вряд ли будет полным без учета синкретизма. Традиционно явления транзитивности трактуются как совокупность различных типов сущностей, сформированных в результате процессов преобразования или отражения промежуточных синкретических фактов, которые характеризуют языковую систему в синхронном аспекте
Vagueness Markers in Italian
Moving from a broad socio-pragmatic perspective, this study analyses how speakers of different ages use a class of items and constructions that codify intentional vagueness in Italian.
Items as un po’ ‘a bit’, tipo ‘kind’, diciamo ‘let us say’, così ‘so’, e cose del genere ‘and things like that’, or cosa ‘thing’ constitute a class of linguistically heterogeneous means that often function in conversation as vagueness markers, i.e. elements by which speakers signal that their knowledge or communication are somehow only tentative, approximate and vague. Their use does not depend on language systemic factors, but is the result of a, more or less conscious, choice of speakers to enhance conversation for different reasons, which include facilitating the flow of conversation, signifying a vague categorization, and, eventually, being polite.
Operating at the pragmatic level, vagueness markers represent elements that are readily available to speakers’ choices and contribute to characterize individual and generational discourse styles. Through a corpus-based analysis of listeners’ phone-ins to a Milan radio station, this study investigates how vagueness markers are used by speakers of different ages in 1976 and in 2010, and how Italian discourse styles have evolved in the last forty years
Vagueness Markers in Italian
Moving from a broad socio-pragmatic perspective, this study analyses how speakers of different ages use a class of items and constructions that codify intentional vagueness in Italian.
Items as un po’ ‘a bit’, tipo ‘kind’, diciamo ‘let us say’, così ‘so’, e cose del genere ‘and things like that’, or cosa ‘thing’ constitute a class of linguistically heterogeneous means that often function in conversation as vagueness markers, i.e. elements by which speakers signal that their knowledge or communication are somehow only tentative, approximate and vague. Their use does not depend on language systemic factors, but is the result of a, more or less conscious, choice of speakers to enhance conversation for different reasons, which include facilitating the flow of conversation, signifying a vague categorization, and, eventually, being polite.
Operating at the pragmatic level, vagueness markers represent elements that are readily available to speakers’ choices and contribute to characterize individual and generational discourse styles. Through a corpus-based analysis of listeners’ phone-ins to a Milan radio station, this study investigates how vagueness markers are used by speakers of different ages in 1976 and in 2010, and how Italian discourse styles have evolved in the last forty years
"A little more than kin" - Quotations as a linguistic phenomenon : a study based on quotations from Shakespeare's Hamlet
Quotations "oscillate between the occasional and the conventional" as Burger/Buhofer/Sialm (1982) once succinctly formulated. Developed from a PhD thesis, this book explores precisely this "oscillating" character of quotations: It discusses the nature of quotations and the relationship between common quotations and phraseology from a theoretical and an empirical perspective. Shakespeare's Hamlet was chosen as a canonical text whose frequently quoted traces can be followed across centuries. Scholarly work from various disciplines leads to an understanding of quotations as moving in a space created by the two dimensions of reference and repetition: Quotations are definable by a horizontal communicative axis (reference) and a vertical, intertextual axis of manifest lineages of use (repetition). Empirically, the data led to a categorisation of quotations as verbal, thematic and onomastic, based on the question "what has been repeated: words, themes or names?" Case studies further corroborate the proposition that verbal quotations may become (almost) ordinary multi-word units if the following conditions are met: a) they lose their referential dimension, b) they develop formal and/or semantic usage patterns and/or c) they are no longer limited to their original, literary discourse
Procedures and strategies in English-Kurdish translation of written media discourse
The present research explores translation procedures and strategies employed in current English-Kurdish translation of written media discourse. It is located within Toury’s (1995/2012) framework of Descriptive Translation Studies (DTS). The research sets out to contribute to Translation Studies, specifically the study of journalism translation. Despite the fact that translation has been an inseparable part of media and journalism activities for decades, if not centuries, the systematic study of media translation is as recent as the turn of the new millennium. This study focuses on English-Kurdish translation of written media discourse, which has remained largely under-researched.
The study precisely sets out to identify the patterns of translation procedures and the overall translation strategies prevalent in Kurdish translations of English journalistic texts. To do so, a composite model is formulated based on an integration of three influential taxonomies of translation procedures proposed by Vinay and Darbelnet (1958/1995), Newmark (1988) and Dickins, Hervey and Higgins (2002). The model is applied to a set of 45 journalistic texts translated from English into Kurdish, which altogether make a corpus of approximately 75,000 words. A comparative analysis of ST-TT coupled pairs is carried out to identify patterns of translation procedures at the linguistic as well as cultural level. To look at the findings from a different perspective, a research questionnaire is also conducted with English-Kurdish translators working in the Kurdish media. Based on the patterns of translation procedures, the overall transition strategies are then determined.
Analysis in Chapters 6, 7 and 8 leads to the conclusion that literal translation, borrowing and omission are the most frequent translation procedures at the linguistic level, and cultural borrowing, cultural redomestication and calque are the most frequent at the cultural level, keeping in mind that the notion of cultural redomestication constitutes the present study’s major contribution to Translation Studies. As for the overall strategies, semantic translation is the predominant orientation of the linguistic aspect of the translation, while foreignization is the predominate orientation of the cultural aspect of the translation
Automatic Image Captioning with Style
This thesis connects two core topics in machine learning, vision
and language. The problem of choice is image caption generation:
automatically constructing natural language descriptions of image
content. Previous research into image caption generation has
focused on generating purely descriptive captions; I focus on
generating visually relevant captions with a distinct linguistic
style. Captions with style have the potential to ease
communication and add a new layer of personalisation.
First, I consider naming variations in image captions, and
propose a method for predicting context-dependent names that
takes into account visual and linguistic information. This method
makes use of a large-scale image caption dataset, which I also
use to explore naming conventions and report naming conventions
for hundreds of animal classes. Next I propose the SentiCap
model, which relies on recent advances in artificial neural
networks to generate visually relevant image captions with
positive or negative sentiment. To balance descriptiveness and
sentiment, the SentiCap model dynamically switches between two
recurrent neural networks, one tuned for descriptive words and
one for sentiment words. As the first published model for
generating captions with sentiment, SentiCap has influenced a
number of subsequent works. I then investigate the sub-task of
modelling styled sentences without images. The specific task
chosen is sentence simplification: rewriting news article
sentences to make them easier to understand.
For this task I design a neural sequence-to-sequence model that
can work with
limited training data, using novel adaptations for word copying
and sharing
word embeddings. Finally, I present SemStyle, a system for
generating visually
relevant image captions in the style of an arbitrary text corpus.
A shared term
space allows a neural network for vision and content planning to
communicate
with a network for styled language generation. SemStyle achieves
competitive
results in human and automatic evaluations of descriptiveness and
style.
As a whole, this thesis presents two complete systems for styled
caption generation that are first of their kind and demonstrate,
for the first time, that automatic style transfer for image
captions is achievable. Contributions also include novel ideas
for object naming and sentence simplification. This thesis opens
up inquiries into highly personalised image captions; large scale
visually grounded concept naming; and more generally, styled text
generation with content control
- …