2,328 research outputs found

    To Be or Not To Be?

    Get PDF
    This squib accounts for the inconsistencies in the occurrence of genitive of negation with the Russian verb byt’ ‘to be’ and other genitive verbs by distinguishing two independent lexical entries for byt’ with a specified location that have differing syntactic and semantic characteristics. One is predicative/argument-taking, while the other is the copula in a copular construction with a locational prepositional predicate. Sentential negation invariably assigns genitive of negation to the grammatical subject of the copular construction, whereas the subject of predicative byt’ is in the wrong syntactic configuration to receive genitive of negation and therefore receives nominative case via agreement with the finite Infl

    Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation

    Get PDF
    In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying ‘conservation principle’ makes relevance propagation unique: differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token’s influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions; the training process is non-monotonic with several stages of different nature

    When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

    Get PDF
    Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.Comment: ACL 2019 (camera-ready

    Constructing Temporal Networks of OSS Programming Language Ecosystems

    Full text link
    One of the primary factors that encourage developers to contribute to open source software (OSS) projects is the collaborative nature of OSS development. However, the collaborative structure of these communities largely remains unclear, partly due to the enormous scale of data to be gathered, processed, and analyzed. In this work, we utilize the World Of Code dataset, which contains commit activity data for millions of OSS projects, to build collaboration networks for ten popular programming language ecosystems, containing in total over 290M commits across over 18M projects. We build a collaboration graph representation for each language ecosystem, having authors and projects as nodes, which enables various forms of social network analysis on the scale of language ecosystems. Moreover, we capture the information on the ecosystems' evolution by slicing each network into 30 historical snapshots. Additionally, we calculate multiple collaboration metrics that characterize the ecosystems' states. We make the resulting dataset publicly available, including the constructed graphs and the pipeline enabling the analysis of more ecosystems.Comment: Accepted to SANER 202

    A syntactic typology of topic, focus and contrast

    Get PDF
    In this paper we argue for a typology of various information-structural functions in terms of three privative features: [topic], [focus] and [contrast] (see also Vallduv'i and Vilkuna 1998, Molnar 2002, McCoy 2003, and Giusti 2006). Aboutness topics and contrastive topics share the feature [topic], new-information foci and contrastive foci share the feature [focus], and contrastive topics and contrastive foci share the feature [contrast]. This typology is supported by data from Dutch (where only contrastive elements may undergo A'-scrambling), Japanese (where aboutness topics and contrastive topics must appear sentence-initially), and Russian (where the new-information foci and contrastive foci share the same underlying position). To the best of our knowledge, there are no generalizations over information-structural functions that do not share one of the features adopted here
    • …
    corecore