1,269 research outputs found

    "Is There Choice in Non-Native Voice?" Linguistic Feature Engineering and a Variationist Perspective in Automatic Native Language Identification

    Get PDF
    Is it possible to infer the native language of an author from a non-native text? Can we perform this task fully automatically? The interest in answers to these questions led to the emergence of a research field called Native Language Identification (NLI) in the first decade of this century. The requirement to automatically identify a particular property based on some language data situates the task in the intersection between computer science and linguistics, or in the context of computational linguistics, which combines both disciplines. This thesis targets several relevant research questions in the context of NLI. In particular, what is the role of surface features and more abstract linguistic cues? How to combine different sets of features, and how to optimize the resulting large models? Do the findings generalize across different data sets? Can we benefit from considering the task in the light of the language variation theory? In order to approach these questions, we conduct a range of quantitative and qualitative explorations, employing different machine learning techniques. We show how linguistic insight can advance technology, and how technology can advance linguistic insight, constituting a fruitful and promising interplay

    Native Language Identification Across Text Types: How Special Are Scientists?

    Get PDF
    Native Language Identification (NLI) is the task of recognizing the native language of an author from text that they wrote in another language. In this paper, we investigate the generalizability of NLI models among learner corpora, and from learner corpora to a new text type, namely scientific articles. Our main results are: (a) the science corpus is not harder to model than some learner corpora; (b) it cannot profit as much as learner corpora from corpus combination via domain adaptation; (c) this pattern can be explained in terms of the respective models focusing on language transfer and topic indicators to different extents

    Simple Yet Powerful Native Language Identification on TOEFL11

    Get PDF
    Abstract Native language identification (NLI) is the task to determine the native language of the author based on an essay written in a second language. NLI is often treated as a classification problem. In this paper, we use the TOEFL11 data set which consists of more data, in terms of the amount of essays and languages, and less biased across prompts, i.e., topics, of essays. We demonstrate that even using word level n-grams as features, and support vector machine (SVM) as a classifier can yield nearly 80% accuracy. We observe that the accuracy of a binary-based word level ngram representation (~80%) is much better than the performance of a frequency-based word level n-gram representation (~20%). Notably, comparable results can be achieved without removing punctuation marks, suggesting a very simple baseline system for NLI

    Identifying and Disentangling Interleaved Activities of Daily Living from Sensor Data

    Get PDF
    Activity discovery (AD) refers to the unsupervised extraction of structured activity data from a stream of sensor readings in a real-world or virtual environment. Activity discovery is part of the broader topic of activity recognition, which has potential uses in fields as varied as social work and elder care, psychology and intrusion detection. Since activity recognition datasets are both hard to come by, and very time consuming to label, the development of reliable activity discovery systems could be of significant utility to the researchers and developers working in the field, as well as to the wider machine learning community. This thesis focuses on the investigation of activity discovery systems that can deal with interleaving, which refers to the phenomenon of continuous switching between multiple high-level activities over a short period of time. This is a common characteristic of the real-world datastreams that activity discovery systems have to deal with, but it is one that is unfortunately often left unaddressed in the existing literature. As part of the research presented in this thesis, the fact that activities exist at multiple levels of abstraction is highlighted. A single activity is often a constituent element of a larger, more complex activity, and in turn has constituents of its own that are activities. Thus this investigation necessarily considers activity discovery systems that can find these hierarchies. The primary contribution of this thesis is the development and evaluation of an activity discovery system that is capable of identifying interleaved activities in sequential data. Starting from a baseline system implemented using a topic model, novel approaches are proposed making use of modern language models taken from the field of natural language processing, before moving on to more advanced language modelling that can handle complex, interleaved data. As well as the identification of activities, the thesis also proposes the abstraction of activities into larger, more complex activities. This allows for the construction of hierarchies of activities that more closely reflect the complex inherent structure of activities present in real-world datasets compared to other approaches. The thesis also discusses a number of important issues relating to the evaluation of activity discovery systems, and examines how existing evaluation metrics may at times be misleading. This includes highlighting the existence of differing abstraction issues in activity discovery evaluation, and suggestions for how this problem can be mitigated. Finally, alternative evaluation metrics are investigated. Naturally, this dissertation does not fully solve the problem of activity discovery, and work remains to be done. However, a number of the most pressing issues that affect real-world activity discovery systems are tackled head-on, and show that useful progress can indeed be made on them. This work aims to benefit systems that are as “clean slate as possible, and hence incorporate no domain-specific knowledge. This is perhaps somewhat of an artificial handicap to impose in this problem domain, but it does have the advantage of making this work applicable to as broad a range of domains as possible

    EOOLT 2007 – Proceedings of the 1st International Workshop on Equation-Based Object-Oriented Languages and Tools

    Get PDF
    Computer aided modeling and simulation of complex systems, using components from multiple application domains, such as electrical, mechanical, hydraulic, control, etc., have in recent years witness0065d a significant growth of interest. In the last decade, novel equation-based object-oriented (EOO) modeling languages, (e.g. Mode- lica, gPROMS, and VHDL-AMS) based on acausal modeling using equations have appeared. Using such languages, it has become possible to model complex systems covering multiple application domains at a high level of abstraction through reusable model components. The interest in EOO languages and tools is rapidly growing in the industry because of their increasing importance in modeling, simulation, and specification of complex systems. There exist several different EOO language communities today that grew out of different application areas (multi-body system dynamics, electronic circuit simula- tion, chemical process engineering). The members of these disparate communities rarely talk to each other in spite of the similarities of their modeling and simulation needs. The EOOLT workshop series aims at bringing these different communities together to discuss their common needs and goals as well as the algorithms and tools that best support them. Despite the short deadlines and the fact that this is a new not very established workshop series, there was a good response to the call-for-papers. Thirteen papers and one presentation were accepted to the workshop program. All papers were subject to reviews by the program committee, and are present in these electronic proceedings. The workshop program started with a welcome and introduction to the area of equa- tion-based object-oriented languages, followed by paper presentations and discussion sessions after presentations of each set of related papers. On behalf of the program committee, the Program Chairmen would like to thank all those who submitted papers to EOOLT'2007. Special thanks go to David Broman who created the web page and helped with organization of the workshop. Many thanks to the program committee for reviewing the papers. EOOLT'2007 was hosted by the Technical University of Berlin, in conjunction with the ECOOP'2007 conference

    Quantitative determinants of prefabs: A corpus-based, experimental study of multiword units in the lexicon

    Get PDF
    In recent years many researchers have been rethinking the Words and Rules\u27 model of syntax (Pinker 1999), instead arguing that language processing relies on a large number of preassembled multiword units, or \u27prefabs\u27 (Bolinger 1976). A usage-based perspective predicts that linguistic units, including prefabs, arise via repeated use, and prefabs should thus be associated with the frequency with which words co-occur (Langacker 1987). Indeed, in several recent experiments, corpus analysis is found to be associated with behavioral measures for multiword sequences (Kapatsinski and Radicke 2009, Ellis and Simpson-Vlach 2009). This dissertation supplements such findings with two new psycholinguistic investigations of prefabs. Study 1 revisits a dictation experiment by Schmitt et al. (2004), in which participants are asked to listen to stretches of speech and repeat the input verbatim, after performing a distractor task intended to encourage reliance on prefabs. I describe the results of an updated experiment which demonstrates that participants are less likely to interrupt or partially alter high-frequency multiword sequences. Although the original study by Schmitt et al. (2004) reported null findings, the revised methodology suggests that frequency indeed plays a role in the creation of prefabs. Study 2 investigates the distribution of affix positioning errors (he go aheads) which give evidence that some multiword sequences (e.g., go ahead) are retrieved from memory as a unit. As part of this study, I describe a novel methodology which elicits the errors of interest in an experimental setting. Errors evincing holistic retrieval are induced more often among multiword sequences that are high in Mutual Dependency, a corpus measure that weighs a sequence\u27s frequency against the frequencies of its component words. Followup analyses indicate that sequence frequency is positively associated with affix errors, but only if component-word frequencies are included as variables in the model. In sum, the studies in this dissertation provide evidence that prefabricated, multiword units are associated with high frequency of a sequence, in addition to statistical measures that take component words\u27 frequency into account. These findings provide further support for a usage-based model of the lexicon, in which linguistic units are both gradient and changeable with experience

    A journey through learner language: tracking development using POS tag sequences in large-scale learner data

    Get PDF
    This PhD study comes at a cross-roads of SLA studies and corpus linguistics methodology, using a bottom-up data-first approach to throw light on second language development. Taking POS tag n-gram sequences as a starting point, searching the data from the outermost syntactic layer available in corpus tools, it is an investigation of grammatical development in learner language across the six proficiency levels in the 52-million-word CEFR-benchmarked quasi-longitudinal Cambridge Learner Corpus. It takes a mixed methods approach, first examining the frequency and distribution of POS tag sequences by level, identifying convergence and divergence, and secondly looking qualitatively at form-meaning mappings of sequences at differing levels. It seeks to observe if there are sequences which characterise levels and which might index the transition between levels. It investigates sequence use at a lexical and functional level and explores whether this can contribute to our understanding of how a generic repertoire of learner language develops. It aims to contribute to the theoretical debate by looking critically at how current theories of language development and description might account for learner language development. It responds to the call to look at largescale learner data, and benefits from privileged access to such longitudinal data, acknowledging the limitations of any corpus data and the need to triangulate across different datasets. It seeks to illustrate how L2 language use converges and diverges across proficiency levels and to investigate convergence and divergence between L1 and L2 usage.N

    Mixing Methods: Practical Insights from the Humanities in the Digital Age

    Get PDF
    The digital transformation is accompanied by two simultaneous processes: digital humanities challenging the humanities, their theories, methodologies and disciplinary identities, and pushing computer science to get involved in new fields. But how can qualitative and quantitative methods be usefully combined in one research project? What are the theoretical and methodological principles across all disciplinary digital approaches? This volume focusses on driving innovation and conceptualising the humanities in the 21st century. Building on the results of 10 research projects, it serves as a useful tool for designing cutting-edge research that goes beyond conventional strategies

    Computer Vision and Architectural History at Eye Level:Mixed Methods for Linking Research in the Humanities and in Information Technology

    Get PDF
    Information on the history of architecture is embedded in our daily surroundings, in vernacular and heritage buildings and in physical objects, photographs and plans. Historians study these tangible and intangible artefacts and the communities that built and used them. Thus valuableinsights are gained into the past and the present as they also provide a foundation for designing the future. Given that our understanding of the past is limited by the inadequate availability of data, the article demonstrates that advanced computer tools can help gain more and well-linked data from the past. Computer vision can make a decisive contribution to the identification of image content in historical photographs. This application is particularly interesting for architectural history, where visual sources play an essential role in understanding the built environment of the past, yet lack of reliable metadata often hinders the use of materials. The automated recognition contributes to making a variety of image sources usable forresearch.<br/
    • …
    corecore