5 research outputs found

    Morphological complexity of languages refle ts the settlement history of the Americas

    Get PDF
    Morphological complexity is widely believed to increase with sociolinguistic isolation, and to decrease with language spreads and absorption of L2 adult learner populations. However, this can be assessed only for communities with well-described histories. Morphological complexity has also been shown to be greater in higher-altitude languages, which are often sociolinguistically isolated, so we use altitude as an empirically determinable proxy for sociolinguistics. In past research, only a very few small locations have been surveyed and the measures of complexity used were family-specific and not easily generalizable. We apply several improved measures of complexity and show that the correlation holds, especially in the Andean regions of South America. We discuss the implications of the South American pattern for the settlement of the Americas and post-settlement prehistoric population formation.Peer reviewe

    A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

    Full text link
    Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for language modeling, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.Comment: Code is available at https://github.com/amazon-science/masked-diffusion-l

    Molecule Generation by Principal Subgraph Mining and Assembling

    Full text link
    Molecule generation is central to a variety of applications. Current attention has been paid to approaching the generation task as subgraph prediction and assembling. Nevertheless, these methods usually rely on hand-crafted or external subgraph construction, and the subgraph assembling depends solely on local arrangement. In this paper, we define a novel notion, principal subgraph, that is closely related to the informative pattern within molecules. Interestingly, our proposed merge-and-update subgraph extraction method can automatically discover frequent principal subgraphs from the dataset, while previous methods are incapable of. Moreover, we develop a two-step subgraph assembling strategy, which first predicts a set of subgraphs in a sequence-wise manner and then assembles all generated subgraphs globally as the final output molecule. Built upon graph variational auto-encoder, our model is demonstrated to be effective in terms of several evaluation metrics and efficiency, compared with state-of-the-art methods on distribution learning and (constrained) property optimization tasks.Comment: Accepted by NeurIPS 202
    corecore