5 research outputs found
Morphological complexity of languages refle ts the settlement history of the Americas
Morphological complexity is widely believed to increase with sociolinguistic isolation, and to decrease with language spreads and absorption of L2 adult learner populations. However, this can be assessed only for communities with well-described histories. Morphological complexity has also been shown to be greater in higher-altitude languages, which are often sociolinguistically isolated, so we use altitude as an empirically determinable proxy for sociolinguistics. In past research, only a very few small locations have been surveyed and the measures of complexity used were family-specific and not easily generalizable. We apply several improved measures of complexity and show that the correlation holds, especially in the Andean regions of South America. We discuss the implications of the South American pattern for the settlement of the Americas and post-settlement prehistoric population formation.Peer reviewe
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise
Diffusion models that are based on iterative denoising have been recently
proposed and leveraged in various generation tasks like image generation.
Whereas, as a way inherently built for continuous data, existing diffusion
models still have some limitations in modeling discrete data, e.g., languages.
For example, the generally used Gaussian noise can not handle the discrete
corruption well, and the objectives in continuous spaces fail to be stable for
textual data in the diffusion process especially when the dimension is high. To
alleviate these issues, we introduce a novel diffusion model for language
modeling, Masked-Diffuse LM, with lower training cost and better performances,
inspired by linguistic features in languages. Specifically, we design a
linguistic-informed forward process which adds corruptions to the text through
strategically soft-masking to better noise the textual data. Also, we directly
predict the categorical distribution with cross-entropy loss function in every
diffusion step to connect the continuous space and discrete space in a more
efficient and straightforward way. Through experiments on 5 controlled
generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better
generation quality than the state-of-the-art diffusion models with better
efficiency.Comment: Code is available at
https://github.com/amazon-science/masked-diffusion-l
Molecule Generation by Principal Subgraph Mining and Assembling
Molecule generation is central to a variety of applications. Current
attention has been paid to approaching the generation task as subgraph
prediction and assembling. Nevertheless, these methods usually rely on
hand-crafted or external subgraph construction, and the subgraph assembling
depends solely on local arrangement. In this paper, we define a novel notion,
principal subgraph, that is closely related to the informative pattern within
molecules. Interestingly, our proposed merge-and-update subgraph extraction
method can automatically discover frequent principal subgraphs from the
dataset, while previous methods are incapable of. Moreover, we develop a
two-step subgraph assembling strategy, which first predicts a set of subgraphs
in a sequence-wise manner and then assembles all generated subgraphs globally
as the final output molecule. Built upon graph variational auto-encoder, our
model is demonstrated to be effective in terms of several evaluation metrics
and efficiency, compared with state-of-the-art methods on distribution learning
and (constrained) property optimization tasks.Comment: Accepted by NeurIPS 202
Recommended from our members
The entangled cyberspace: an integrated approach for predicting cyber-attacks
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonSignificant studies in cyber defence analysis have predominantly revolved around a single linear analysis of information from a single source of evidence (The Network). These studies were limited in their ability to understand the dynamics of entanglements related to cyber-incidents. This research integrates evidence beyond the network in an attempt to understand and predict phases of the kill-chain across the information space.
This research provides a multi-dimensional phased analysis of the traditional kill-chain model using structural vector autoregressive models. In the ‘Entangled Cyberspace Framework’, each phase of the kill-chain corresponds to a single dimension of the information space based on time observations of certain events. Events are represented as time signals, where each phase is characterised by multiple time signals representing multiple events on that phase. Multiple time signals are analysed using structural models for multiple time series analysis (Vector Auto-Regressive models). At each phase of the kill-chain, we perform a lagged co-integration analysis of events across the information space. This nature of analysis detects hidden entanglements that characterise events in the kill-chain beyond the network. The measured prediction accuracy and error measured at each stage of the experiment represents the usefulness of selected events in characterising the defined stage of the kill-chain.
The entangled cyberspace, in theory, is the fusion of three conceptual foundations: a) A multi-dimensional characterisation of cyberspace, b) A sequential phased model for perpetrating cyber-attacks and c) A structural model for integrating and simultaneously analysing multiple sources of evidence. It starts with the characterisation of the information space into different dimensions of interest. The framework goes further to identify evidence sources across these characterised dimensions and integrates them in the analytical context under consideration (e.g. Malware Injection).
The concrete findings show that our approach and analytical methodology are capable of detecting entanglements when applied to a set of entangled activities across the information space. The findings also prove that activities beyond the network have significant effects on the nature of the unfolding cyber-attack vector. The predictive features of events across the kill-chain were also presented in this research as opinion and emotion drivers on the social dimension, packet data details and social and cultural events on the economic layer. Finally, co-integration detected between events across and within dimensions of the information space proves the existence of both inter-dimensional and intra-dimensional entanglements that affect the nature of events unfolding during the kill-chain (from the adversary’s point of view).
The novelty of this research rests in the ability to hop across the information space for detecting evidential clues of activities that are related-to cyber-incidents. This research also expands the standard multi-dimensional information space to include SPEC factors as indicators of cyber-incidents. This research improves the current information security management model, specifically in the monitoring, analysis and detection phases. This research provides a methodology that accommodates a robust evidence base for understanding the attack surface. Practically, this research provides a basis for creating applications and tools for protecting critical national infrastructure by integrating data from social platforms, real-world political, cultural and economic events and the cyber-physical