14 research outputs found

    Automatic animacy classification for Dutch

    Get PDF
    We present an automatic animacy classifier for Dutch that can determine the animacy status of nouns -- how alive the noun's referent is (human, inanimate, etc.). Animacy is a semantic property that has been shown to play a role in human sentence processing, felicity and grammaticality. Although animacy is not marked explicitly in Dutch, we expect knowledge about animacy to be helpful for parsing, translation and other NLP tasks. Only a few animacy classifiers and animacy- annotated corpora exist internationally. For Dutch, animacy information is only available in the Cornetto lexical-semantic database. We augment this lexical information with context information from the Dutch Lassy Large treebank, to create training data for an animacy classifier that uses a novel kind of context features. We use the k-nearest neighbour algorithm with distributional lexical features, e.g. how frequently the noun occurs as a subject of the verb `to think' in a corpus, to decide on the (pre-dominant) animacy class. The size of the Lassy Large corpus makes this possible, and the high level of detail these word association features provide, results in accurate Dutch-language animacy classification

    Multilingual Animacy Classification by Sparse Logistic Regression

    Get PDF
    This paper presents results from three experiments on automatic animacy classification in Japanese and English. We present experiments that focus on solutions to the problem of reliably classifying a large set of infrequent items using a small number of automatically extracted features. We labeled a set of Japanese nouns as ±animate on the basis of reliable, surface-obvious morphological features, producing an accurately but sparsely labeled data set. To classify these nouns, and to achieve good generalization to other nouns for which we do not have labels, we used feature vectors based on frequency counts of verbargument relations that abstract away from item identity and into class-wide distributional tendencies of the feature set. Grouping items into suffix-based equivalence classes prior to classification increased data coverage and improved classification accuracy. For the items that occur at least once with our feature set, we obtained 95% classification accuracy. We used loanwords to transfer automatically acquired labels from English to classify items that are zerofrequency in the Japanese data set, giving increased precision on inanimate items and increased recall on animate items

    Tracing thick and thin concepts through corpora

    Get PDF
    Philosophers and linguists currently lack the means to reliably identify evaluative concepts and measure their evaluative intensity. Using a corpus-based approach, we present a new method to distinguish evaluatively thick and thin adjectives like ‘courageous’ and ‘awful’ from descriptive adjectives like ‘narrow,’ and from value-associated adjectives like ‘sunny.’ Our study suggests that the modifiers ‘truly’ and ‘really’ frequently highlight the evaluative dimension of thick and thin adjectives, allowing for them to be uniquely classified. Based on these results, we believe our operationalization may pave the way for a more quantitative approach to the study of thick and thin concepts

    RAGAM IRONI DALAM NOVEL KARYA IKA NATASSA “CRITICAL ELEVEN”

    Get PDF
     This study aims to analyze the use of irony in the novel "Critical Eleven" by Ika Natassa using a semantic approach. Irony is a form of indirect language, used when the speaker or writer expresses one thing, but implies another (Grice, 1975). Irony is a literary device often used in novels to create an additional dimension to the narrative. In the context of the novel by Ika Natassa, irony is present as a strong element in conveying messages and arousing the minds of readers. In this analysis, the researcher identifies and analyzes various examples of irony that are present in the novel. The semantic approach helps in understanding the shift in the meaning of words or phrases used by the author to create irony in the story. Through a semantic approach, the investigator hides the contrast between the literal meaning and the actual meaning intended in this novel. Our semantic analysis includes an understanding of how irony is used to convey conflicting messages, create humorous effects, or convey social criticism. The results of our analysis reveal that irony in "Critical Eleven" provides a strong dimension in shaping the reader's understanding of the characters, conflicts, and themes in this novel. Irony exists as an effective means of arousing feelings, generating reflection, and providing a different perspective in reading and analyzing literary works

    Tracing Thick and Thin Concepts Through Corpora

    Get PDF
    Philosophers and linguists currently lack the means to reliably identify evaluative concepts and to measure their evaluative intensity. Using a corpus-based approach, we present a new method to distinguish evaluatively thick adjectives like 'courageous' from descriptive adjectives like 'narrow', and from value-associated adjectives like 'sunny'. Our study reveals that the modifiers 'truly' and 'really' frequently highlight the evaluative dimension of thick and thin adjectives, allowing for them to be uniquely classified. Based on these results, we believe the operationalization we suggest may pave the way for a more quantitative approach to the study of thick and thin concepts

    Tracing thick and thin concepts through corpora

    Get PDF
    Philosophers and linguists currently lack the means to reliably identify evaluative concepts and measure their evaluative intensity. Using a corpus-based approach, we present a new method to distinguish evaluatively thick and thin adjectives like ‘courageous’ and ‘awful’ from descriptive adjectives like ‘narrow,’ and from value-associated adjectives like ‘sunny.’ Our study suggests that the modifiers ‘truly’ and ‘really’ frequently highlight the evaluative dimension of thick and thin adjectives, allowing for them to be uniquely classified. Based on these results, we believe our operationalization may pave the way for a more quantitative approach to the study of thick and thin concepts

    Inducing Stereotypical Character Roles from Plot Structure

    Get PDF
    If we are to understand stories, we must understand characters: characters are central to every narrative and drive the action forward. Critically, many stories (especially cultural ones) employ stereotypical character roles in their stories for different purposes, including efficient communication among bundles of default characteristics and associations, ease understanding of those characters\u27 role in the overall narrative, and many more. These roles include ideas such as hero, villain, or victim, as well as culturally-specific roles such as, for example, the donor (in Russian tales) or the trickster (in Native American tales). My thesis aims to learn these roles automatically, inducing them from data using a clustering technique. The first step of learning character roles, however, is to identify which coreference chains correspond to characters, which are defined by narratologists as animate entities that drive the plot forward. The first part of my work has focused on this character identification problem, specifically focusing on the problem of animacy detection. Prior work treated animacy as a word-level property, and researchers developed statistical models to classify words as either animate or inanimate. I claimed this approach to the problem is ill-posed and presented a new hybrid approach for classifying the animacy of coreference chains that achieved state-of-the-art performance. The next step of my work is to develop approaches first to identify the characters and then a new unsupervised clustering approach to learn stereotypical roles. My character identification system consists of two stages: first, I detect animate chains from the coreference chains using my existing animacy detector; second, I apply a supervised machine learning model that identifies which of those chains qualify as characters. I proposed a narratologically grounded definition of character and built a supervised machine learning model with a small set of features that achieved state-of-the-art performance. In the last step, I successfully implemented a clustering approach with plot and thematic information to cluster the archetypes. This work resulted in a completely new approach to understanding the structure of stories, greatly advancing the state-of-the-art of story understanding

    Eesti keele ĂŒhendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    TĂ€napĂ€eval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapĂ€evaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki tĂ€iuslik. Keele automaattöötluse kĂ”ige rohkem kasutust leidev rakendus on ilmselt masintĂ”lge. Ikka ja jĂ€lle jagatakse sotsiaalmeedias, kuidas tuntud sĂŒsteemid (nĂ€iteks Google Translate) midagi valesti tĂ”lgivad. Enamasti tekitavad absurdse olukorra mitmest sĂ”nast koosnevad fraasid vĂ”i laused. NĂ€iteks ei suuda tĂ”lkesĂŒsteemid tabada lauses „Ta lĂ€ks lepinguga alt“ ĂŒhendi alt minema tĂ€hendust petta saama, sest Ă”ige tĂ€henduse edastamiseks ei saa selle ĂŒhendi komponente sĂ”na-sĂ”nalt tĂ”lkida ja seetĂ”ttu satubki arvuti hĂ€tta. Selleks et nii masintĂ”lkesĂŒsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse vĂ”i kĂŒsimus-vastus sĂŒsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesĂ”nalisi ĂŒksuseid ja nende eri tĂ€hendusi, mida inimesed konteksti pĂ”hjal ĂŒpriski lihtalt teha suudavad. PĂŒsiĂŒhendite (tĂ€henduse) automaattuvastus on oluline kĂ”ikides keeltes ja on seetĂ”ttu pĂ€lvinud arvutilingvistikas rohkelt tĂ€helepanu. Seega on eriti inglise keele pĂ”hjal vĂ€lja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele pĂŒsiĂŒhendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinĂ”ppe meetodeid, mis on teiste keelte pĂŒsiĂŒhendite tuvastamisel edukad olnud, ĂŒht liiki eesti keele pĂŒsiĂŒhendi – ĂŒhendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete pĂ”hjal, et seni eesti keele traditsioonilises kĂ€sitluses esitatud eesti keele ĂŒhendverbide jaotus ainukordseteks (ĂŒhendi komponentide koosesinemisel tekib uus tĂ€hendus) ja korrapĂ€rasteks (ĂŒhendi tĂ€hendus on tema komponentide summa) ei ole piisavalt pĂ”hjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et pĂŒsiĂŒhendid (k.a ĂŒhendverbid) jaotuvad skaalale, mille ĂŒhes otsas on ĂŒhendid, mille tĂ€hendus on selgelt komponentide tĂ€henduste summa. ja teises need ĂŒhendid, mis saavad uue tĂ€henduse. Uurimus nĂ€itab, et lisaks kontekstile aitavad arvutil tuvastada ĂŒhendverbi Ă”iget tĂ€hendust mitmed teised tunnuseid, nĂ€iteks subjekti ja objekti elusus ja kÀÀnded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta lĂ€ks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapĂ€rane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S
    corecore