28,610 research outputs found

    Statistical Augmentation of a Chinese Machine-Readable Dictionary

    Get PDF
    We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.Comment: 17 pages, uuencoded compressed PostScrip

    Unsupervised Time Series Extraction from Controller Area Network Payloads

    Full text link
    This paper introduces a method for unsupervised tokenization of Controller Area Network (CAN) data payloads using bit level transition analysis and a greedy grouping strategy. The primary goal of this proposal is to extract individual time series which have been concatenated together before transmission onto a vehicle's CAN bus. This process is necessary because the documentation for how to properly extract data from a network may not always be available; passenger vehicle CAN configurations are protected as trade secrets. At least one major manufacturer has also been found to deliberately misconfigure their documented extraction methods. Thus, this proposal serves as a critical enabler for robust third-party security auditing and intrusion detection systems which do not rely on manufacturers sharing confidential information.Comment: 2018 IEEE 88th Vehicular Technology Conference (VTC2018-Fall

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202
    corecore