1,575 research outputs found

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Keywords at Work: Investigating Keyword Extraction in Social Media Applications

    Full text link
    This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns. While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies

    Domain-Focused Summarization of Polarized Debates

    Get PDF
    Due to the exponential growth of Internet use, textual content is increasingly published in online media. In everyday, more and more news content, blog posts, and scientific articles are published to the online volumes and thus open doors for the text summarization research community to conduct research on those areas. Whilst there are freely accessible repositories for such content, online debates which have recently become popular have remained largely unexplored. This thesis addresses the challenge in applying text summarization to summarize online debates. We view that the task of summarizing online debates should not only focus on summarization techniques but also should look further on presenting the summaries into the formats favored by users. In this thesis, we present how a summarization system is developed to generate online debate summaries in accordance with a designed output, called the Combination 2. It is the combination of two summaries. The primary objective of the first summary, Chart Summary, is to visualize the debate summary as a bar chart in high-level view. The chart consists of the bars conveying clusters of the salient sentences, labels showing short descriptions of the bars, and numbers of salient sentences conversed in the two opposing sides. The other part, Side-By-Side Summary, linked to the Chart Summary, shows a more detailed summary of an online debate related to a bar clicked by a user. The development of the summarization system is divided into three processes. In the first process, we create a gold standard dataset of online debates. The dataset contains a collection of debate comments that have been subjectively annotated by 5 judgments. We develop a summarization system with key features to help identify salient sentences in the comments. The sentences selected by the system are evaluated against the annotation results. We found that the system performance outperforms the baseline. The second process begins with the generation of Chart Summary from the salient sentences selected by the system. We propose a framework with two branches where each branch presents either a term-based clustering and the term-based labeling method or X-means based clustering and the MI labeling strategy. Our evaluation results indicate that the X-means clustering approach is a better alternative for clustering. In the last process, we view the generation of Side-By-Side Summary as a contradiction detection task. We create two debate entailment datasets derived from the two clustering approaches and annotate them with the Contradiction and Non-Contradiction relations. We develop a classifier and investigate combinations of features that maximize the F1 scores. Based on the proposed features, we discovered that the combinations of at least two features to the maximum of eight features yield good results

    Analysis of Family-Health-Related Topics on Wikipedia

    Get PDF
    New concepts, terms, and topics always emerge; and meanings of existing terms and topics keep changing all the time. These phenomena occur more frequently on social media than on conventional media because social media allows a huge number of users to generate information online. Retrieving relevant results in different time periods of a fast-changing topic becomes one of the most difficult challenges in the information retrieval field. Among numerous topics discussed on social media, health-related topics are a major category which attracts increasing attention from the general public. This study investigated and explored the evolution patterns of family-health-related topics on Wikipedia. Three family-health-related topics (Child Maltreatment, Family Planning, and Women’s Health) were selected from the World Health Organization Website and their associated entries were retrieved on Wikipedia. Historical numeric and text data of the entries from 2010 to 2017 were collected from a Wikipedia data dump and the Wikipedia Web pages. Four periods were defined: 2010 to 2011, 2012 to 2013, 2014 to 2015, and 2016 to 2017. Coding, subject analysis, descriptive statistical analysis, inferential statistical analysis, SOM approach, and n-gram approach were employed to explore the internal characteristics and external popularity evolutions of the topics. The findings illustrate that the external popularities of the family-health-related topics declined from 2010 to 2017, although their content on Wikipedia kept increasing. The emerged entries had three features: specialization, summarization, and internationalization. The subjects derived from the entries became increasingly diverse during the investigated periods. Meanwhile, the developing trajectories of the subjects varied from one to another. According to the developing trajectories, the subjects were grouped into three categories: growing subject, diminishing subject, and fluctuating subject. The popularities of the topics among the Wikipedia viewers were consistent, while among the editors were not. For each topic, its popularity trend among the editors and the viewers was inconsistent. Child Maltreatment was the most popular among the three topics, Women’s Health was the second most popular, while Family Planning was the least popular among the three. The implications of this study include: (1) helping health professionals and general users get a more comprehensive understanding of the investigated topics; (2) contributing to the developments of health ontologies and consumer health vocabularies; (3) assisting Website designers in organizing online health information and helping them identify popular family-health-related topics; (4) providing a new approach for query recommendation in information retrieval systems; (5) supporting temporal information retrieval by presenting the temporal changes of family-health-related topics; and (6) providing a new combination of data collection and analysis methods for researchers

    Textual Analysis of Intangible Information

    Get PDF
    Traditionally, equity investors have relied upon the information reported in firms’ financial accounts to make their investment decisions. Due to the conservative nature of accounting standards, firms cannot value their intangible assets such as corporate culture, brand value and reputation. Investors’ efforts to collect such information have been hampered by the voluntary nature of Corporate Social Responsibility (CSR) reporting standards, which have resulted in the publication of inconsistent, stale and incomplete information across firms. In short, information on intangible assets is less salient to investors compared to accounting information because it is more costly to collect, process and analyse. In this thesis we design an automated approach to collect and quantify information on firms’ intangible assets by drawing upon techniques commonly adopted in the fields of Natural Language Processing (NLP) and Information Retrieval. The exploitation of unstructured data available on the Web holds promise for investors seeking to integrate a wider variety of information into their investment processes. The objectives of this research are: 1) to draw upon textual analysis methodologies to measure intangible information from a range of unstructured data sources, 2) to integrate intangible information and accounting information into an investment analysis framework, 3) evaluate the merits of unstructured data for the prediction of firms’ future earnings

    Chamic and beyond : studies in mainland Austronesian languages

    Get PDF

    Phonological issues in the production of prosody by francophone and sinophone learners of english as a second language

    Get PDF
    Un accent de non-natif peut mener Ă  une incomprĂ©hension ou Ă  la perception de degrĂ©s diffĂ©rents d'accent d'Ă©trangetĂ©. La prosodie, qui est maintenant reconnue comme un Ă©lĂ©ment important de l'impression d'Ă©trangetĂ©, est relativement peu abordĂ©e en recherche en acquisition des langues Ă©trangĂšres. Ceci contraste avec l'intĂ©rĂȘt grandissant envers la prosodie en tant qu'Ă©lĂ©ment de la langue maternelle. Dans cette thĂšse, la recherche phonologique est Ă©valuĂ©e quant Ă  sa pertinence dans la recherche sur la prosodie des langues Ă©trangĂšres. Deux aspects de la thĂ©orie phonologique sont Ă©tudiĂ©s: la typologie et l'organisation phonologique. Ce choix est justifiĂ© par la prĂ©somption gĂ©nĂ©rale que l'Ă©trangetĂ© prosodique est crĂ©Ă©e soit par une diffĂ©rence de typologie entre langue maternelle (L1) et langue Ă©trangĂšre (L2) soit par un transfert de traits prosodiques de la L1. La critique de la recherche en typologie phonologique conclut que, Ă  ce stade, aucun modĂšle de classification prosodique n'est applicable Ă  l'acquisition d'une L2. En particulier, l'Ă©tude dĂ©montre que certaines typologies, en particulier la thĂ©orie de l'isochronie accentuelle/l'isochronie syllabique de Pike, devraient ĂȘtre exclues parce qu'elles entravent les progrĂšs en recherche sur l'acquisition et la production de la prosodie des langues Ă©trangĂšres. Le second aspect de la thĂ©orie phonologique Ă©tudiĂ© dans cette thĂšse est l'organisation phonologique. La prĂ©misse est que les diffĂ©rences sous-jacentes Ă  l'organisation prosodique plutĂŽt que les diffĂ©rences phonologiques de surface sont transfĂ©rĂ©es de L1 Ă  L2. Les analyses approfondies de l'anglais nord amĂ©ricain, le français et le chinois standard rĂ©vĂšlent d'importantes diffĂ©rences phonologiques entre l'anglais nord amĂ©ricain et les deux autres langues. Quatre expĂ©riences Ă©valuent certaines de ces diffĂ©rences. La prosodie de l'anglais produite par des locuteurs natifs du français est analysĂ©e dans des phrases rythmiquement simples et des phrases rythmiquement plus complexes. Les rĂ©sultats dĂ©montrent que l'accentuation lexicale est moins problĂ©matique que l'accentuation prosodique supra-lexicale. En particulier, il est dĂ©montrĂ© que les montĂ©es de frĂ©quence fondamentale (F0) de dĂ©but et de fin de syntagme accentuel (SA), typiques du français, sont source d'erreur dans la prosodie de l'anglais langue seconde. Il est cependant montrĂ© que cette erreur, bien que remarquĂ©e par les locuteurs natifs de l'anglais, n'affecte pas la perception de placement d'accentuation par ces derniers. La prosodie de l'anglais produite par des locuteurs natifs du chinois est analysĂ©e en termes de transfert de ton et d'alignement de pic de F0. Les rĂ©sultats indiquent que les locuteurs du chinois utilisent les tons chinois quand ils produisent des tons accentuels de l'anglais; plus spĂ©cifiquement, la majoritĂ© des locuteurs utilisent le ton 2 (ton montant) quand ils produisent un ton accentuel montant. La derniĂšre expĂ©rience rĂ©vĂšle que les locuteurs natifs du chinois alignent le ton accentuel avec la syllabe accentuĂ©e Ă  laquelle elle correspond de maniĂšre plus stricte que les locuteurs natifs de l'anglais nord amĂ©ricain le font. Les rĂ©sultats de cette thĂšse gĂ©nĂšrent un aperçu de la progression de la performance de la prosodie d'une langue Ă©trangĂšre. Les conclusions comportent des implications sur le contenu pĂ©dagogique et le format de l'enseignement de la prononciation. ______________________________________________________________________________ MOTS-CLÉS DE L’AUTEUR : Phonologie, PhonĂ©tique, Phonologie prosodique, Prosodie, Rythme, ESL, Français du QuĂ©bec, Français de France, Chinois
    • 

    corecore