Search CORE

5 research outputs found

Morphological complexity of languages refle ts the settlement history of the Americas

Author: Bentz Christian
Nichols Johanna
Publication venue: Kerns Verlag
Publication date: 01/01/2019
Field of study

Morphological complexity is widely believed to increase with sociolinguistic isolation, and to decrease with language spreads and absorption of L2 adult learner populations. However, this can be assessed only for communities with well-described histories. Morphological complexity has also been shown to be greater in higher-altitude languages, which are often sociolinguistically isolated, so we use altitude as an empirically determinable proxy for sociolinguistics. In past research, only a very few small locations have been surveyed and the measures of complexity used were family-specific and not easily generalizable. We apply several improved measures of complexity and show that the correlation holds, especially in the Andean regions of South America. We discuss the implications of the South American pattern for the settlement of the Americas and post-settlement prehistoric population formation.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

Author: Chen Jiaao
Li Mu
Smola Alex
Yang Diyi
Zhang Aston
Publication venue
Publication date: 10/04/2023
Field of study

Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for language modeling, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.Comment: Code is available at https://github.com/amazon-science/masked-diffusion-l

arXiv.org e-Print Archive

Molecule Generation by Principal Subgraph Mining and Assembling

Author: Huang Wenbing
Kong Xiangzhe
Liu Yang
Tan Zhixing
Publication venue
Publication date: 30/09/2022
Field of study

Molecule generation is central to a variety of applications. Current attention has been paid to approaching the generation task as subgraph prediction and assembling. Nevertheless, these methods usually rely on hand-crafted or external subgraph construction, and the subgraph assembling depends solely on local arrangement. In this paper, we define a novel notion, principal subgraph, that is closely related to the informative pattern within molecules. Interestingly, our proposed merge-and-update subgraph extraction method can automatically discover frequent principal subgraphs from the dataset, while previous methods are incapable of. Moreover, we develop a two-step subgraph assembling strategy, which first predicts a set of subgraphs in a sequence-wise manner and then assembles all generated subgraphs globally as the final output molecule. Built upon graph variational auto-encoder, our model is demonstrated to be effective in terms of several evaluation metrics and efficiency, compared with state-of-the-art methods on distribution learning and (constrained) property optimization tasks.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Recommended from our members

The entangled cyberspace: an integrated approach for predicting cyber-attacks

Author: Ikwu Ruth Eneyi
Publication venue: Brunel University London
Publication date: 01/01/2018
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonSignificant studies in cyber defence analysis have predominantly revolved around a single linear analysis of information from a single source of evidence (The Network). These studies were limited in their ability to understand the dynamics of entanglements related to cyber-incidents. This research integrates evidence beyond the network in an attempt to understand and predict phases of the kill-chain across the information space. This research provides a multi-dimensional phased analysis of the traditional kill-chain model using structural vector autoregressive models. In the ‘Entangled Cyberspace Framework’, each phase of the kill-chain corresponds to a single dimension of the information space based on time observations of certain events. Events are represented as time signals, where each phase is characterised by multiple time signals representing multiple events on that phase. Multiple time signals are analysed using structural models for multiple time series analysis (Vector Auto-Regressive models). At each phase of the kill-chain, we perform a lagged co-integration analysis of events across the information space. This nature of analysis detects hidden entanglements that characterise events in the kill-chain beyond the network. The measured prediction accuracy and error measured at each stage of the experiment represents the usefulness of selected events in characterising the defined stage of the kill-chain. The entangled cyberspace, in theory, is the fusion of three conceptual foundations: a) A multi-dimensional characterisation of cyberspace, b) A sequential phased model for perpetrating cyber-attacks and c) A structural model for integrating and simultaneously analysing multiple sources of evidence. It starts with the characterisation of the information space into different dimensions of interest. The framework goes further to identify evidence sources across these characterised dimensions and integrates them in the analytical context under consideration (e.g. Malware Injection). The concrete findings show that our approach and analytical methodology are capable of detecting entanglements when applied to a set of entangled activities across the information space. The findings also prove that activities beyond the network have significant effects on the nature of the unfolding cyber-attack vector. The predictive features of events across the kill-chain were also presented in this research as opinion and emotion drivers on the social dimension, packet data details and social and cultural events on the economic layer. Finally, co-integration detected between events across and within dimensions of the information space proves the existence of both inter-dimensional and intra-dimensional entanglements that affect the nature of events unfolding during the kill-chain (from the adversary’s point of view). The novelty of this research rests in the ability to hop across the information space for detecting evidential clues of activities that are related-to cyber-incidents. This research also expands the standard multi-dimensional information space to include SPEC factors as indicators of cyber-incidents. This research improves the current information security management model, specifically in the monitoring, analysis and detection phases. This research provides a methodology that accommodates a robust evidence base for understanding the attack surface. Practically, this research provides a basis for creating applications and tools for protecting critical national infrastructure by integrating data from social platforms, real-world political, cultural and economic events and the cyber-physical

Brunel University Research Archive