11 research outputs found

    Multi-word expression-sensitive word alignment

    Get PDF
    This paper presents a new word alignment method which incorporates knowledge about Bilingual Multi-Word Expressions (BMWEs). Our method of word alignment first extracts such BMWEs in a bidirectional way for a given corpus and then starts conventional word alignment, considering the properties of BMWEs in their grouping as well as their alignment links. We give partial annotation of alignment links as prior knowledge to the word alignment process; by replacing the maximum likelihood estimate in the M-step of the IBM Models with the Maximum A Posteriori (MAP) estimate, prior knowledge about BMWEs is embedded in the prior in this MAP estimate. In our experiments, we saw an improvement of 0.77 Bleu points absolute in JP–EN. Except for one case, our method gave better results than the method using only BMWEs grouping. Even though this paper does not directly address the issues in Cross-Lingual Information Retrieval (CLIR), it discusses an approach of direct relevance to the field. This approach could be viewed as the opposite of current trends in CLIR on semantic space that incorporate a notion of order in the bag-of-words model (e.g. co-occurences)

    Word association models and search strategies for discriminative word alignment

    Get PDF
    Abstract. This paper deals with core aspects of discriminative word alignment systems, namely basic word association models as well as search strategies. We compare various low-computational-cost word association models: χ 2 score, log-likelihood ratio and IBM model 1. We also compare three beam-search strategies. We show that it is more flexible and accurate to let links to the same word compete together, than introducing them sequentially in the alignment hypotheses, which is the strategy followed in several systems

    Investigating Frequency and Type of Lexical Collocations in Applied Linguistics Journal Articles Written in English by Iranian and Norwegian Scholars

    Get PDF
    Master's thesis in Literacy StudiesIn today’s academic world, the research interest in corpus linguistics has shifted towards word co-occurrence rather than single words. Accordingly, a great body of literature has been devoted to investigations of recurrent word combinations in academic prose using frequency and dispersion parameters. This has resulted in analysis of corpus in different fields of study to collect comprehensive lists of academic collocations. Moreover, many contrastive studies have been conducted to compare the collocations used by native and non-native speakers of English. However, to the author’s knowledge, few studies have been conducted to compare the most frequent collocations in two corpora of research articles written by non-native speakers of English published in international journals in the field of applied linguistics. To fill this gap in the literature, the current study investigated the most frequent collocations used by Iranian and Norwegian scholars in a corpus of 17 articles published in the Journal of Pragmatics through a frequency-based approach. Nine out of 17 articles were written by Iranian scholars including 67,673 words and eight out of 17 articles were written by Norwegian scholars comprising of 64,682 words. The data of this study were collected using Collocation Extract software. The results of the study were presented in three phases. In the first phase, 15 most frequent lexical collocations in both corpora were identified which were classified under three types of lexical collocations. Based on what was obtained, Adj+N collocation type had the most proportion in the corpora while Adv+Adj type had the least proportion. In the second phase, the lexical collocations of the Iranian corpus were presented including a total of 818 collocations classified under five types. According to the results, Adj+N was the most frequent type while N+V was the least frequent one. Similar to the Iranian corpus, lexical collocations of the Norwegian corpus were identified. They were classified under four types including a total of 462, among which Adj+N was the most frequent type while Adv+Adj was the least frequent one. In the third phase, frequencies of lexical collocations were compared in the two corpora. According to the obtained results, the two corpora did not have any had significant difference in the use of all types of collocation except for Adj+N type of lexical collocations

    Data-driven Communicative Behaviour Generation: A Survey

    Get PDF
    The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of human–agent interaction and human–robot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual human–human interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.</jats:p

    Social Role Transitions and Technology: Societal Change and Coping in Online Communities

    Full text link
    Technological and societal changes unfold in relation to one another. Many events like becoming a parent, getting divorced, or getting a medical diagnosis dictate a change in one’s social role. Social role transition can have negative consequences including stress, stigmatization, and disempowerment. Social interactions, especially communicating with allies and those facing similar conditions, can alleviate the psychological burden of these challenges. The goal of this dissertation is to understand how people use technology to cope with social role change, and how the features of different online communities provide a range of ways to make sense of their social role transition, find support, and advocate for change. In the first study (Chapter 3), I qualitatively analyze interviews with fathers and a sam- ple of father blogs to show how fathers use do-it-yourself (DIY) language on blogs and in their online interactions as a means of redefining fatherhood. Fathers use the DIY concept to build their own father-centric online communities in order to manage some of the disad- vantages associated with the lack of parenting online communities that cater to them. This new framing of fatherhood allows fathers to make sense of their new role as parents, and at the same time, to redefine the social norms around fatherhood. In Chapter 4, I study how parents use social media sites at scale using natural language processing. The focus of the analysis is on Reddit, a social media site that allows users to comment under pseudonyms. I find that parents use pseudonymous social media sites to discuss topics that might otherwise be considered too sensitive to discuss on real-name social media sites such as Facebook (e.g., breastfeeding and sleep training). This study also outlines similarities and differences in discussion topics among mothers and father on Reddit (e.g., mothers discussing breastfeeding and fathers discussing divorce and custody). Finally, in Chapter 5, I use computational and qualitative methods to study how anony- mous accounts on Reddit (throwaway accounts) provide parents with varying levels of anonymity as they cope with social role changes by sharing potentially stigmatizing infor- mation (e.g., postpartum depression) or advocating for stigmatized identities (e.g., divorced fathers). Finally, based on my findings, I present design recommendations that could pro- mote better social support on platforms beyond Reddit.PHDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162933/1/tawfiqam_1.pd

    European Approaches to Japanese Language and Linguistics

    Get PDF
    In this volume European specialists of Japanese language present new and original research into Japanese over a wide spectrum of topics which include descriptive, sociolinguistic, pragmatic and didactic accounts. The articles share a focus on contemporary issues and adopt new approaches to the study of Japanese that often are specific to European traditions of language study. The articles address an audience that includes both Japanese Studies and Linguistics. They are representative of the wide range of topics that are currently studied in European universities, and they address scholars and students alike

    Connecting Documents, Words, and Languages Using Topic Models

    Get PDF
    Topic models discover latent topics in documents and summarize documents at a high level. To improve topic models' topic quality and extrinsic performance, external knowledge is often incorporated as part of the generative story. One form of external knowledge is weighted text links that indicate similarity or relatedness between the connected objects. This dissertation 1) uncovers the latent structures in observed weighted links and integrates them into topic modeling, and 2) learns latent weighted links from other external knowledge to improve topic modeling. We consider incorporating links at three different levels: documents, words, and topics. We first look at binary document links, e.g., citation links of papers. Document links indicate topic similarity of the connected documents. Past methods model the document links separately, ignoring the entire link density. We instead uncover latent document blocks in which documents are densely connected and tend to talk about similar topics. We introduce LBH-RTM, a relational topic model with lexical weights, block priors, and hinge loss. It extracts informative topic priors from the document blocks for documents' topic generation. It predicts unseen document links with block and lexical features and hinge loss, in addition to topical features. It outperforms past methods in link prediction and gives more coherent topics. Like document links, words are also linked, but usually with real-valued weights. Word links are known as word associations and indicate the semantic relatedness of the connected words. They provide more information about word relationships in addition to the co-occurrence patterns in the training corpora. To extract and incorporate the knowledge in word associations, we introduce methods to find the most salient word pairs. The methods organize the words in a tree structure, which serves as a prior (i.e., tree prior) for tree LDA. The methods are straightforward but effective, yielding more coherent topics than vanilla LDA, and slightly improving the extrinsic classification performance. Weighted topic links are different. Topics are latent, so it is difficult to obtain ground-truth topic links, but learned weighted topic links could bridge the topics across languages. We introduce a multilingual topic model (MTM) that assumes each language has its own topic distributions over the words only in that language and learns weighted topic links based on word translations and words' topic distributions. It does not force the topic spaces of different languages to be aligned and is more robust than previous MTMs that do. It outperforms past MTMs in classification while still giving coherent topics on less comparable and smaller corpora

    Unrestricted Bridging Resolution

    Get PDF
    Anaphora plays a major role in discourse comprehension and accounts for the coherence of a text. In contrast to identity anaphora which indicates that a noun phrase refers back to the same entity introduced by previous descriptions in the discourse, bridging anaphora or associative anaphora links anaphors and antecedents via lexico-semantic, frame or encyclopedic relations. In recent years, various computational approaches have been developed for bridging resolution. However, most of them only consider antecedent selection, assuming that bridging anaphora recognition has been performed. Moreover, they often focus on subproblems, e.g., only part-of bridging or definite noun phrase anaphora. This thesis addresses the problem of unrestricted bridging resolution, i.e., recognizing bridging anaphora and finding links to antecedents where bridging anaphors are not limited to definite noun phrases and semantic relations between anaphors and their antecedents are not restricted to meronymic relations. In this thesis, we solve the problem using a two-stage statistical model. Given all mentions in a document, the first stage predicts bridging anaphors by exploring a cascading collective classification model. We cast bridging anaphora recognition as a subtask of learning fine-grained information status (IS). Each mention in a text gets assigned one IS class, bridging being one possible class. The model combines the binary classifiers for minority categories and a collective classifier for all categories in a cascaded way. It addresses the multi-class imbalance problem (e.g., the wide variation of bridging anaphora and their relative rarity compared to many other IS classes) within a multi-class setting while still keeping the strength of the collective classifier by investigating relational autocorrelation among several IS classes. The second stage finds the antecedents for all predicted bridging anaphors at the same time by exploring a joint inference model. The approach models two mutually supportive tasks (i.e., bridging anaphora resolution and sibling anaphors clustering) jointly, on the basis of the observation that semantically/syntactically related anaphors are likely to be sibling anaphors, and hence share the same antecedent. Both components are based on rich linguistically-motivated features and discriminatively trained on a corpus (ISNotes) where bridging is reliably annotated. Our approaches achieve substantial improvements over the reimplementations of previous systems for all three tasks, i.e., bridging anaphora recognition, bridging anaphora resolution and full bridging resolution. The work is – to our knowledge – the first bridging resolution system that handles the unrestricted phenomenon in a realistic setting. The methods in this dissertation were originally presented in Markert et al. (2012) and Hou et al. (2013a; 2013b; 2014). The thesis gives a detailed exposition, carrying out a thorough corpus analysis of bridging and conducting a detailed comparison of our models to others in the literature, and also presents several extensions of the aforementioned papers
    corecore