654 research outputs found

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

    Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

    Full text link
    Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates. Source code is at: https://github.com/pranav-ust/cognatesComment: Published at ACL-SRW 201

    Modeling highly pathogenic avian influenza transmission in wild birds and poultry in West Bengal, India.

    Get PDF
    Wild birds are suspected to have played a role in highly pathogenic avian influenza (HPAI) H5N1 outbreaks in West Bengal. Cluster analysis showed that H5N1 was introduced in West Bengal at least 3 times between 2008 and 2010. We simulated the introduction of H5N1 by wild birds and their contact with poultry through a stochastic continuous-time mathematical model. Results showed that reducing contact between wild birds and domestic poultry, and increasing the culling rate of infected domestic poultry communities will reduce the probability of outbreaks. Poultry communities that shared habitat with wild birds or those indistricts with previous outbreaks were more likely to suffer an outbreak. These results indicate that wild birds can introduce HPAI to domestic poultry and that limiting their contact at shared habitats together with swift culling of infected domestic poultry can greatly reduce the likelihood of HPAI outbreaks

    Hierarchical Learning in Euclidean Neural Networks

    Full text link
    Equivariant machine learning methods have shown wide success at 3D learning applications in recent years. These models explicitly build in the reflection, translation and rotation symmetries of Euclidean space and have facilitated large advances in accuracy and data efficiency for a range of applications in the physical sciences. An outstanding question for equivariant models is why they achieve such larger-than-expected advances in these applications. To probe this question, we examine the role of higher order (non-scalar) features in Euclidean Neural Networks (\texttt{e3nn}). We focus on the previously studied application of \texttt{e3nn} to the problem of electron density prediction, which allows for a variety of non-scalar outputs, and examine whether the nature of the output (scalar l=0l=0, vector l=1l=1, or higher order l>1l>1) is relevant to the effectiveness of non-scalar hidden features in the network. Further, we examine the behavior of non-scalar features throughout training, finding a natural hierarchy of features by ll, reminiscent of a multipole expansion. We aim for our work to ultimately inform design principles and choices of domain applications for {\tt e3nn} networks.Comment: 9 pages, 3 figure

    Sphingomyelin and GM1 Influence Huntingtin Binding to, Disruption of, and Aggregation on Lipid Membranes

    Get PDF
    Huntington disease (HD) is an inherited neurodegenerative disease caused by the expansion beyond a critical threshold of a polyglutamine (polyQ) tract near the N-terminus of the huntingtin (htt) protein. Expanded polyQ promotes the formation of a variety of oligomeric and fibrillar aggregates of htt that accumulate into the hallmark proteinaceous inclusion bodies associated with HD. htt is also highly associated with numerous cellular and subcellular membranes that contain a variety of lipids. As lipid homeostasis and metabolism abnormalities are observed in HD patients, we investigated how varying both the sphingomyelin (SM) and ganglioside (GM1) contents modifies the interactions between htt and lipid membranes. SM composition is altered in HD, and GM1 has been shown to have protective effects in animal models of HD. A combination of Langmuir trough monolayer techniques, vesicle permeability and binding assays, and in situ atomic force microscopy (AFM) were used to directly monitor the interaction of a model, synthetic htt peptide and a full-length htt-exon1 recombinant protein with model membranes comprised of total brain lipid extract (TBLE) and varying amounts of exogenously added SM or GM1. The addition of either SM or GM1 decreased htt insertion into the lipid monolayers. However, TBLE vesicles with an increased SM content were more susceptible to htt-induced permeabilization, whereas GM1 had no effect on permeablization. Pure TBLE bilayers and TBLE bilayers enriched with GM1 developed regions of roughened, granular morphologies upon exposure to htt-exon1, but plateau-like domains with a smoother appearance formed in bilayers enriched with SM. Oligomeric aggregates were observed on all bilayer systems regardless of induced morphology. Collectively, these observations suggest that the lipid composition and its subsequent effects on membrane material properties strongly influence htt binding and aggregation on lipid membranes
    • …
    corecore