102 research outputs found

    Towards Population of Knowledge Bases from Conversational Sources

    Get PDF
    With an increasing amount of data created daily, it is challenging for users to organize and discover information from massive collections of digital content (e.g., text and speech). The population of knowledge bases requires linking information from unstructured sources (e.g., news articles and web pages) to structured external knowledge bases (e.g., Wikipedia), which has the potential to advance information archiving and access, and to support knowledge discovery and reasoning. Because of the complexity of this task, knowledge base population is composed of multiple sub-tasks, including the entity linking task, defined as linking the mention of entities (e.g., persons, organizations, and locations) found in documents to their referents in external knowledge bases and the event task, defined as extracting related information for events that should be entered in the knowledge base. Most prior work on tasks related to knowledge base population has focused on dissemination-oriented sources written in the third person (e.g., new articles) that benefit from two characteristics: the content is written in formal language and is to some degree self-contextualized, and the entities mentioned (e.g., persons) are likely to be widely known to the public so that rich information can be found from existing general knowledge bases (e.g., Wikipedia and DBpedia). The work proposed in this thesis focuses on tasks related to knowledge base population for conversational sources written in the first person (e.g., emails and phone recordings), which offers new challenges. One challenge is that most conversations (e.g., 68% of the person names and 53% of the organization names in Enron emails) refer to entities that are known to the conversational participants but not widely known. Thus, existing entity linking techniques relying on general knowledge bases are not appropriate. Another challenge is that some of the shared context between participants in first-person conversations may be implicit and thus challenging to model, increasing the difficulty, even for human annotators, of identifying the true referents. This thesis focuses on several tasks relating to the population of knowledge bases for conversational content: the population of collection-specific knowledge bases for organization entities and meetings from email collections; the entity linking task that resolves the mention of three types of entities (person, organization, and location) found in both conversational text (emails) and speech (phone recordings) sources to multiple knowledge bases, including a general knowledge base built from Wikipedia and collection-specific knowledge bases; the meeting linking task that links meeting-related email messages to the referenced meeting entries in the collection-specific meeting knowledge base; and speaker identification techniques to improve the entity linking task for phone recordings without known speakers. Following the model-based evaluation paradigm, three collections (namely, Enron emails, Avocado emails, and Enron phone recordings) are used as the representations of conversational sources, new test collections are created for each task, and experiments are conducted for each task to evaluate the efficacy of the proposed methods and to provide a comparison to existing state-of-the-art systems. This work has implications in the research fields of e-discovery, scientific collaboration, speaker identification, speech retrieval, and privacy protection

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    Influence Level Prediction on Social Media through Multi-Task and Sociolinguistic User Characteristics Modeling

    Full text link
    Prediction of a user’s influence level on social networks has attracted a lot of attention as human interactions move online. Influential users have the ability to influence others’ behavior to achieve their own agenda. As a result, predicting users’ level of influence online can help to understand social networks, forecast trends, prevent misinformation, etc. The research on user influence in social networks has attracted much attention across multiple disciplines, from social sciences to mathematics, yet it is still not well understood. One of the difficulties is that the definition of influence is specific to a particular problem or a domain, and it does not generalize well. Another challenge arises from the fact that all user interactions occur through text. Textual data limits access to non-verbal communication such as voice. These facts make the problem challenging. In this work, we define user influence level as a function of community endorsement, create a strong baseline, and develop new methods that significantly outperform our baseline by leveraging demographic and personality data. This dissertation is divided into three parts. In part one, we introduce the problem of influence level prediction, review influential research across different disciplines, and introduce our hypothesis that leverages user-centric information to improve user influence level prediction on social media. In part two, we answer the question of whether the language provides sufficient information to predict user- related information. We develop new methods that achieve good results on three tasks: relationship prediction, demographic prediction, and hedge sentence detection. In part three, we introduce our dataset, a new ranking algorithm, RankDCG, to assess the performance of ranking problems, and develop new user-centric models for user influence level prediction. These models show significant improvements across eight different domains ranging from politics and news to fitness

    Stylistics versus Statistics: A corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails

    Get PDF
    This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own distinctive idiolect (Coulthard 2004: 431). However, the notion of idiolect has come under scrutiny in forensic linguistics over recent years for being too abstract to be of practical use (Grant 2010; Turell 2010). At the same time, two competing methodologies have developed in authorship analysis. On the one hand, there are qualitative stylistic approaches, and on the other there are statistical ‘stylometric’ techniques. This study uses a corpus of over 60,000 emails and 2.5 million words written by 176 employees of the former American company Enron to tackle these issues in the contexts of both authorship attribution (identifying authors using linguistic evidence) and author profiling (predicting authors’ social characteristics using linguistic evidence). Analyses reveal that even in shared communicative contexts, and when using very common lexical items, individual Enron employees produce distinctive collocation patterns and lexical co-selections. In turn, these idiolectal elements of linguistic output can be captured and quantified by word n-grams (strings of n words). An attribution experiment is performed using word n-grams to identify the authors of anonymised email samples. Results of the experiment are encouraging, and it is argued that the approach developed here offers a means by which stylistic and statistical techniques can complement each other. Finally, quantitative and qualitative analyses are combined in the sociolinguistic profiling of Enron employees by gender and occupation. Current author profiling research is exclusively statistical in nature. However, the findings here demonstrate that when statistical results are augmented by qualitative evidence, the complex relationship between language use and author identity can be more accurately observed

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies

    Examining the institutional work of sustainability reporting managers and sustainability assurance providers: An institutional work perspective

    Get PDF
    Sustainability reporting and sustainability assurance are new accounting technologies which have been introduced to assist organisations in transitioning to a sustainable growth model. The overarching research objective guiding this study is to understand how sustainability reporting managers (SRMs) prepare sustainability reports and how sustainability assurance providers (SAPs) undertake sustainability assurance. The study draws on Lawrence and Suddaby’s (2006) typology to understand the forms of institutional work SRMs and SAPs undertake as they perform their roles and how these efforts affect the institutionalisation of sustainability reporting and sustainability assurance. Given the interpretive nature of this research the tenants of hermeneutic theory are used to provide the research methodology and research method to guide the investigation. Data comprises of semi-structured interviews with SRMs and SAPs based in Australia and New Zealand. From the overarching research objective, three research questions are addressed. The first research question explores the supply-side of the sustainability assurance market. The institutional efforts of accounting sustainability assurance (ASAPs) are directed at institutionalising sustainability assurance as similar to or the same as a traditional financial statements audit. In comparison, the institutional efforts of non-accounting sustainability assurance providers (NASAPs) are directed towards institutionalising sustainability assurance as a vehicle designed to drive sustainability within reporting organisations. The second research question explores the institutional work of SRMs as they attempt to institutionalise sustainability reporting within their organisations. SRMs play the role of sustainability reporting champions and sustainability reporting experts. These efforts occur within the backdrop of the new GRI G4 reporting guidelines. As a result, SRMs are changing the normative foundations underlying sustainability reporting from bigger is better to more focused materiality assessment driven reporting. However, while SRMs have been successful in embedding and routinising sustainability reporting these efforts have had a lesser immediate impact in promoting balanced sustainability reporting practices. The third research question focuses on the demand-side of the sustainability assurance market. Given the voluntary nature of sustainability assurance, SAPs institutional efforts are aimed at achieving the dual objectives of enhancing the credibility of sustainability reports and promoting the sustainability assurance as a value added service. However, due to the voluntary nature of sustainability assurance, the efforts of SAPs have had a relatively greater impact in promoting reliable sustainability reporting and less success in promoting balanced sustainability reporting. Finally, the efforts of SAPs in promoting sustainability assurance as a value added activity has also met with difficulties as this study finds that the engagement suffers from diminishing returns. The contributions from this study are both practical and academic. At a practical level, the findings will prove beneficial to inexperienced SRMs. The study recommends that given the voluntary nature of the engagement, there is a need for greater regulation designed to strengthen the position of SAPs. At an academic level, the findings build on the limited body of interpretive research examining the phenomena of sustainability reporting and sustainability assurance. Finally, the findings contribute to the literature on institutional work, building on Lawrence and Suddaby’s (2006) typology of forms of institutional work

    The C-SPAN Archives: An Interdisciplinary Resource for Discovery, Learning, and Engagement

    Get PDF
    The C-SPAN Archives records, indexes, and preserves all C-SPAN programming for historical, educational, and research uses. Every C-SPAN program aired since 1987, from all House and Senate sessions in the US Congress, to hearings, presidential speeches, conventions, and campaign events, totaling over 200,000 hours, is contained in the video library and is immediately and freely accessible through the database and electronic archival systems developed and maintained by staff. Whereas C-SPAN is best known as a resource for political processes and policy information, the Archives also offers rich educational research and teaching opportunities. This book provides guidance and inspiration to scholars who may be interested in using the Archives to illuminate concepts and processes in varied communication and political science subfields using a range of methodologies for discovery, learning, and engagement. Applications described range from teaching rhetoric to enhancing TV audience’s viewing experience. The book links to illustrative clips from the Archives to help readers appreciate the usability and richness of the source material and the pedagogical possibilities it offers. Many of the essays are authored by faculty connected with the Purdue University School of Communication, named after the founder of C-SPAN Brian Lamb. The book is divided into four parts: Part 1 consists of an overview of the C-SPAN Archives, the technology involved in establishing and updating its online presence, and the C-SPAN copyright and use policy. Featured are the ways in which the collection is indexed and tips on how individuals can find particular materials. This section provides an essential foundation for scholars’ and practitioners’ increased use of this valuable resource. Parts 2 and 3 contain case studies describing how scholars use the Archives in their research, teaching, and engagement activities. Some case studies were first presented during a preconference at the National Communication Association (NCA) convention in November 2013, while others have been invited or solicited through open calls. Part 4 explores future directions for C-SPAN Archive use as a window into American life and global politics
    • 

    corecore