84 research outputs found
A classification scheme for annotating speech acts in a business email corpus
This paper reports on the process of manual annotation of speech acts in a corpus
of business emails, in the context of the PROBE project (PRagmatics of
Business English). The project aims to bring together corpus, computational,
and theoretical linguistics by drawing on the insights made available by the
annotated corpus. The corpus data sheds light on the linguistic and discourse
structures of speech act use in business email communication. This enhanced
linguistic description can be compared to theoretical linguistic representations
of speech act categories to assess how well traditional distinctions relate to
real-world, naturally occurring data. From a computational perspective, the
annotated data is required for the development of an automated speech act tagging
tool. Central to this research is the creation of a high quality, manually
annotated speech act corpus, using an easily interpretable classification
scheme. We discuss the scheme chosen for the project and the training guidelines
given to the annotators, and describe the main challenges identified by the
annotators
Recommended from our members
A classification scheme for annotating speech acts in a business email corpus
This paper reports on the process of manual annotation of speech acts in a corpus of business emails, in the context of the PROBE project (PRagmatics of Business English). The project aims to bring together corpus, computational, and theoretical linguistics by drawing on the insights made available by the annotated corpus. The corpus data sheds light on the linguistic and discourse structures of speech act use in business email communication. This enhanced linguistic description can be compared to theoretical linguistic representations of speech act categories to assess how well traditional distinctions relate to real-world, naturally occurring data. From a computational perspective, the annotated data is required for the development of an automated speech act tagging tool. Central to this research is the creation of a high quality, manually annotated speech act corpus, using an easily interpretable classification scheme. We discuss the scheme chosen for the project and the training guidelines given to the annotators, and describe the main challenges identified by the annotators
Identification of Informativeness in Text using Natural Language Stylometry
In this age of information overload, one experiences a rapidly growing over-abundance of written text. To assist with handling this bounty, this plethora of texts is now widely used to develop and optimize statistical natural language processing (NLP) systems. Surprisingly, the use of more fragments of text to train these statistical NLP systems may not necessarily lead to improved performance. We hypothesize that those fragments that help the most with training are those that contain the desired information. Therefore, determining informativeness in text has become a central issue in our view of NLP. Recent developments in this field have spawned a number of solutions to identify informativeness in text. Nevertheless, a shortfall of most of these solutions is their dependency on the genre and domain of the text. In addition, most of them are not efficient regardless of the natural language processing problem areas. Therefore, we attempt to provide a more general solution to this NLP problem.
This thesis takes a different approach to this problem by considering the underlying theme of a linguistic theory known as the Code Quantity Principle. This theory suggests that humans codify information in text so that readers can retrieve this information more efficiently. During the codification process, humans usually change elements of their writing ranging from characters to sentences. Examples of such elements are the use of simple words, complex words, function words, content words, syllables, and so on. This theory suggests that these elements have reasonable discriminating strength and can play a key role in distinguishing informativeness in natural language text. In another vein, Stylometry is a modern method to analyze literary style and deals largely with the aforementioned elements of writing. With this as background, we model text using a set of stylometric attributes to characterize variations in writing style present in it. We explore their effectiveness to determine informativeness in text. To the best of our knowledge, this is the first use of stylometric attributes to determine informativeness in statistical NLP. In doing so, we use texts of different genres, viz., scientific papers, technical reports, emails and newspaper articles, that are selected from assorted domains like agriculture, physics, and biomedical science. The variety of NLP systems that have benefitted from incorporating these stylometric attributes somewhere in their computational realm dealing with this set of multifarious texts suggests that these attributes can be regarded as an effective solution to identify informativeness in text. In addition to the variety of text genres and domains, the potential of stylometric attributes is also explored in some NLP application areas---including biomedical relation mining, automatic keyphrase indexing, spam classification, and text summarization---where performance improvement is both important and challenging. The success of the attributes in all these areas further highlights their usefulness
Recommended from our members
Text Classification: Exploiting the Social Network
Within the context of social networks, existing methods for document classification tasks typically only capture textual semantics while ignoring the text’s metadata, e.g., the users who exchange emails and the communication networks they form. However, some work has shown that incorporating the social network information in addition to information from language is useful for various NLP applications, including sentiment analysis, inferring user attributes, and predicting interpersonal relations.
In this thesis, we present empirical studies of incorporating social network information from the underlying communication graphs for various text classification tasks. We show different graph representations for different problems. Also, we introduce social network features extracted from these graphs. We use and extend graph embedding models for text classification.
Our contributions are as follows. First, we have annotated large datasets of emails with fine-grained business and personal labels. Second, we propose graph representations for the social networks induced from documents and users and apply them on different text classification tasks. Third, we propose social network features extracted from these structures for documents and users. Fourth, we exploit different methods for modeling the social network of communication for four tasks: email classification into business and personal, overt display of power detection in emails, hierarchical power detection in emails, and Reddit post classification.
Our main findings are: incorporating the social network information using our proposed methods improves the classification performance for all of the four tasks, and we beat the state-of-the-art graph embedding based model on the three tasks on email; additionally, for the fourth task (Reddit post classification), we argue that simple methods with the proper representation for the task can outperform a state-of-the-art generic model
Keywords at Work: Investigating Keyword Extraction in Social Media Applications
This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns.
While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd
A heuristic algorithm for trust-oriented service provider selection in complex social networks
In a service-oriented online social network consisting of service providers and consumers, a service consumer can search trustworthy service providers via the social network. This requires the evaluation of the trustworthiness of a service provider along a certain social trust path from the service consumer to the service provider. However, there are usually many social trust paths between participants in social networks. Thus, a challenging problem is which social trust path is the optimal one that can yield the most trustworthy evaluation result In this paper, we first present a novel complex social network structure and a new concept, Quality of Trust (QoT). We then model the optimal social trust path selection with multiple end-to-end QoT constraints as a Multi-Constrained Optimal Path (MCOP) selection problem which is NP-Complete. For solving this challenging problem, we propose an efficient heuristic algorithm, H_OSTP. The results of our experiments conducted on a large real dataset of online social networks illustrate that our proposed algorithm significantly outperforms existing approaches.8 page(s
- …