4,012 research outputs found

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    Deep Active Learning for Named Entity Recognition

    Get PDF
    Deep learning has yielded state-of-the-art performance on many natural language processing tasks including named entity recognition (NER). However, this typically requires large amounts of labeled data. In this work, we demonstrate that the amount of labeled training data can be drastically reduced when deep learning is combined with active learning. While active learning is sample-efficient, it can be computationally expensive since it requires iterative retraining. To speed this up, we introduce a lightweight architecture for NER, viz., the CNN-CNN-LSTM model consisting of convolutional character and word encoders and a long short term memory (LSTM) tag decoder. The model achieves nearly state-of-the-art performance on standard datasets for the task while being computationally much more efficient than best performing models. We carry out incremental active learning, during the training process, and are able to nearly match state-of-the-art performance with just 25\% of the original training data

    Spatial evolution of human dialects

    Get PDF
    The geographical pattern of human dialects is a result of history. Here, we formulate a simple spatial model of language change which shows that the final result of this historical evolution may, to some extent, be predictable. The model shows that the boundaries of language dialect regions are controlled by a length minimizing effect analogous to surface tension, mediated by variations in population density which can induce curvature, and by the shape of coastline or similar borders. The predictability of dialect regions arises because these effects will drive many complex, randomized early states toward one of a smaller number of stable final configurations. The model is able to reproduce observations and predictions of dialectologists. These include dialect continua, isogloss bundling, fanning, the wave-like spread of dialect features from cities, and the impact of human movement on the number of dialects that an area can support. The model also provides an analytical form for S\'{e}guy's Curve giving the relationship between geographical and linguistic distance, and a generalisation of the curve to account for the presence of a population centre. A simple modification allows us to analytically characterize the variation of language use by age in an area undergoing linguistic change

    Mind the Orthography: Revisiting the Contribution of Prereading Phonological Awareness to Reading Acquisition

    Get PDF
    published Online First March 21, 2022.Reading acquisition is based on a set of preliteracy skills that lay the foundation for future reading abilities. Phonological awareness—the ability to identify and manipulate the sound units of oral language— has been reported to play a central role in reading acquisition. However, current evidence is mixed with respect to its universal contribution to reading acquisition across orthographies. This longitudinal study examines the development and contribution of phonological awareness to early reading skills in Spanish, a transparent orthography. The results of a comprehensive battery of phonological awareness skills in a large sample of children (Time 1 n = 616, 296 females, mean age 5.6, from middle to high socioeconomic backgrounds; Time 2 n = 397) with no reading experience at study onset suggest that the development of phonological awareness is delayed in Spanish. Furthermore, our results show that phonological awareness does not contribute to the prediction of reading acquisition above and beyond other preliteracy skills. Letter knowledge indexes children’s ability to identify phonemes and thus takes a more central role in the prediction of early reading skills. Therefore, we underscore the need to thoughtfully address the distinctive features of the reading acquisition process across orthographies, which should be taken into account in models of reading and learning to read.This project was funded by ANII FSED_2_2015_1_120741 and ANII FSED_2_2016_1_131230 Grants. Camila Zugarramurdi received a PhD Scholarship from Fundación Carolina

    The source ambiguity problem: Distinguishing the effects of grammar and processing on acceptability judgments

    Get PDF
    Judgments of linguistic unacceptability may theoretically arise from either grammatical deviance or significant processing difficulty. Acceptability data are thus naturally ambiguous in theories that explicitly distinguish formal and functional constraints. Here, we consider this source ambiguity problem in the context of Superiority effects: the dispreference for ordering a wh-phrase in front of a syntactically “superior” wh-phrase in multiple wh-questions, e.g., What did who buy? More specifically, we consider the acceptability contrast between such examples and so-called D-linked examples, e.g., Which toys did which parents buy? Evidence from acceptability and self-paced reading experiments demonstrates that (i) judgments and processing times for Superiority violations vary in parallel, as determined by the kind of wh-phrases they contain, (ii) judgments increase with exposure, while processing times decrease, (iii) reading times are highly predictive of acceptability judgments for the same items, and (iv) the effects of the complexity of the wh-phrases combine in both acceptability judgments and reading times. This evidence supports the conclusion that D-linking effects are likely reducible to independently motivated cognitive mechanisms whose effects emerge in a wide range of sentence contexts. This in turn suggests that Superiority effects, in general, may owe their character to differential processing difficulty

    A Deflationary Account of Mental Representation

    Get PDF
    Among the cognitive capacities of evolved creatures is the capacity to represent. Theories in cognitive neuroscience typically explain our manifest representational capacities by positing internal representations, but there is little agreement about how these representations function, especially with the relatively recent proliferation of connectionist, dynamical, embodied, and enactive approaches to cognition. In this talk I sketch an account of the nature and function of representation in cognitive neuroscience that couples a realist construal of representational vehicles with a pragmatic account of mental content. I call the resulting package a deflationary account of mental representation and I argue that it avoids the problems that afflict competing accounts
    corecore