105 research outputs found

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn

    Identifying nocuous ambiguity in natural language requirements

    Get PDF
    This dissertation is an investigation into how ambiguity should be classified for authors and readers of text, and how this process can be automated. Usually, authors and readers disambiguate ambiguity, either consciously or unconsciously. However, disambiguation is not always appropriate. For instance, a linguistic construction may be read differently by different people, with no consensus about which reading is the intended one. This is particularly dangerous if they do not realise that other readings are possible. Misunderstandings may then occur. This is particularly serious in the field of requirements engineering. If requirements are misunderstood, systems may be built incorrectly, and this can prove very costly. Our research uses natural language processing techniques to address ambiguity in requirements. We develop a model of ambiguity, and a method of applying it, which represent a novel approach to the problem described here. Our model is based on the notion that human perception is the only valid criterion for judging ambiguity. If people perceive very differently how an ambiguity should be read, it will cause misunderstandings. Assigning a preferred reading to it is therefore unwise. In text, such ambiguities should be located and rewritten in a less ambiguous form; others need not be reformulated. We classify the former as nocuous and the latter as innocuous. We allow the dividing line between these two classifications to be adjustable. We term this the ambiguity threshold, and it represents a level of intolerance to ambiguity. A nocuous ambiguity can be an unacknowledged or an acknowledged ambiguity for a given set of readers. In the former case, they assign disparate readings to the ambiguity, but each is unaware that the others read it differently. In the latter case, they recognise that the ambiguity has more than one reading, but this fact may be unacknowledged by new readers. We present an automated approach to determine whether ambiguities in text are nocuous or innocuous. We use heuristics to distinguish ambiguities for which there is a strong consensus about how they should be read. These are innocuous ambiguities. The remaining nocuous ambiguities can then be rewritten at a later stage. We find consensus opinions about ambiguities by surveying human perceptions on them. Our heuristics try to predict these perceptions automatically. They utilise various types of linguistic information: generic corpus data, morphology and lexical subcategorisations are the most successful. We use coordination ambiguity as the test case for this research. This occurs where the scope of words such as and and or is unclear. Our research contributes to both the requirements engineering and the natural language processing literatures. Ambiguity is known to be a serious problem in requirements engineering, but has rarely been dealt with effectively and thoroughly. Our approach is an appropriate solution, and our flexible ambiguity threshold is a particularly useful concept. For instance, high ambiguity intolerance can be implemented when writing requirements for safety-critical systems. Coordination ambiguities are widespread and known to cause misunderstandings, but have received comparatively little attention. Our heuristics show that linguistic data can be used successfully to predict preferred readings of very diverse coordinations. Used in combination, these heuristics demonstrate that nocuous ambiguity can be distinguished from innocuous ambiguity under certain conditions. Employing appropriate ambiguity thresholds, accuracy representing 28% improvement on the baselines can be achieved

    Non-canonical subject marking in Romanian : status and evolution of the MIHI EST construction

    Get PDF
    This dissertation deals with the MIHI EST construction in Romanian, illustrated in (1), in which the verb fi ‘be’ combines with a dative experiencer and a state noun. This construction represents in Romanian the most natural way of expressing psychological or physiological states. It traces back to Latin, but it disappeared from all other Romance languages, which use a HABEO structure to express this kind of states. Hence, within the Romance context the MIHI EST construction is a unique phenomenon in Romanian. (1) Mi- e foame / sete / frică me.DAT= is hunger / thirst / fear ‘I am hungry/ thirsty/ afraid’ The present study is a part of a larger project that aims to measure Romanian’s tendency to non-canonical subject marking claimed in the literature. If confirmed, this tendency contradicts the hypothesis that European languages replace non-canonical structures with canonical structures. Within this comprehensive project, my dissertation contributes with an in-depth analysis of the MIHI EST construction. By means of a synchronic and diachronic corpus-based study, I investigate (i) the status of the core arguments of the MIHI EST structure, i.e. the dative experiencer and the nominative state noun, traditionally analyzed as the subject, and (ii) the evolution of the MIHI EST construction from the first texts in Romanian dating from the 16th century until today. My investigation reveals that, with respect to a series of largely accepted syntactic subject criteria, the dative experiencer behaves like nominative subjects. These criteria are the following: word order, non-realization of the subject in subordinate clauses when coreferential with the subject of the main clause, movement of the subject of the subordinate clause to the position of subject of the main clause, deletion of subjects in telegraphic style, bare quantifiers in clause-initial position, and the ability to take secondary predicates. In contrast, a thorough examination of the state noun shows that, although it is nominative-marked and triggers verb agreement, it does not behave like a syntactic subject, but shows predicate behavior. As for the evolution of the MIHI EST structure, the analysis of the data reveals that, throughout the centuries, periods of modernization alternate with periods of stabilization. With other words, periods in which new nouns are accepted in the MIHI EST structure alternate with periods in which the construction gains in stability by a more frequent usage of the same existing combinations. Based on the presented facts, I claim that the MIHI EST construction shows a certain tendency toward expansion, since in present-day Romanian it can coerce nouns coming from other semantic fields into the construction’s psychological or physiological interpretation. The question arises whether the expansion of the MIHI EST construction constitutes sufficient evidence for a propensity in Romanian toward non-canonical marking of core arguments, which would go against the tendency of the European languages toward canonical marking. Further research covering other types of predicates, such as adjectives, adverbs or verbs that occur with non-canonical subjects is required in order to validate this claim

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    Cappadocian kinship

    Get PDF
    Cappadocian kinship systems are very interesting from a sociolinguistic and anthropological perspective because of the mixture of inherited Greek and borrowed Turkish kinship terms. Precisely because the number of Turkish kinship terms differs from one variety to another, it is necessary to talk about Cappadocian kinship systems in the plural rather than about the Cappadocian kinship system in the singular. Although reference will be made to other Cappadocian varieties, this paper will focus on the kinship systems of MiĆĄotika and Aksenitika, the two Central Cappadocian dialects still spoken today in several communities in Greece. Particular attention will be given to the use of borrowed Turkish kinship terms, which sometimes seem to co-exist together with their inherited Greek counterparts, e.g. mĂĄna vs. nĂ©ne ‘mother’, ailfĂł/aelfĂł vs. ÎłardĂĄĆĄ ‘brother’ etc. In the final part of the paper some kinship terms with obscure or hitherto unknown etymology will be discussed, e.g. kĂĄka ‘grandmother’, iĆŸĂĄ ‘aunt’, lĂșva ‘uncle (father’s brother)’ etc

    Leaving no stone unturned: flexible retrieval of idiomatic expressions from a large text corpus

    Get PDF
    Idioms are multi-word expressions whose meaning cannot always be deduced from the literal meaning of constituent words. A key feature of idioms that is central to this paper is their peculiar mixture of fixedness and variability, which poses challenges for their retrieval from large corpora using traditional search approaches. These challenges hinder insights into idiom usage affecting users who are conducting linguistic research as well as those involved in language ed-ucation. To facilitate access to idioms examples taken from real-world contexts, we introduce an information retrieval system designed specifically for idioms. Given a search query that represents an idiom, typically in its canonical form, the system expands it automatically to account for the most common types of idiom variation including inflection, open slots, adjectival or adverbial modification, and passivisation. As a by-product of query expansion, other types of idiom varia-tion captured include derivation, compounding, negation, distribution across multiple clauses as well as other unforeseen types of variation. The system was implemented on top of Elasticsearch, an open-source, distributed, scalable, real-time search engine. Flexible retrieval of idioms is supported by a combination of linguistic pre-processing of the search queries, their translation into a set of query clauses written in a query language called Query DSL, and analysis, an indexing process that involves tokenisation and normalisation. Our system outperformed the phrase search in terms of recall and outperformed the keyword search in terms of precision. Out of the three, our approach was found to provide the best balance between precision and recall. By providing a fast and easy way of finding idioms in large corpora, our approach can facilitate further developments in fields such as linguistics, language education and natural language processing. Keywords: information retrieval; natural language processing; corpus linguistics; multi-word expressions; idiom

    Studies in the Grammar and Lexicon of Neo-Aramaic

    Get PDF
    "The Neo-Aramaic dialects are modern vernacular forms of Aramaic, which has a documented history in the Middle East of over 3,000 years. Due to upheavals in the Middle East over the last one hundred years, thousands of speakers of Neo-Aramaic dialects have been forced to migrate from their homes or have perished in massacres. As a result, the dialects are now highly endangered. The dialects exhibit a remarkable diversity of structures. Moreover, the considerable depth of attestation of Aramaic from earlier periods provides evidence for pathways of change. For these reasons the research of Neo-Aramaic is of importance for more general fields of linguistics, in particular language typology and historical linguistics. The papers in this volume represent the full range of research that is currently being carried out on Neo-Aramaic dialects. They advance the field in numerous ways. In order to allow linguists who are not specialists in Neo-Aramaic to benefit from the papers, the examples are fully glossed.

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    K + K = 120 : Papers dedicated to LĂĄszlĂł KĂĄlmĂĄn and AndrĂĄs Kornai on the occasion of their 60th birthdays

    Get PDF

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail
    • 

    corecore