28,610 research outputs found
Statistical Augmentation of a Chinese Machine-Readable Dictionary
We describe a method of using statistically-collected Chinese character
groups from a corpus to augment a Chinese dictionary. The method is
particularly useful for extracting domain-specific and regional words not
readily available in machine-readable dictionaries. Output was evaluated both
using human evaluators and against a previously available dictionary. We also
evaluated performance improvement in automatic Chinese tokenization. Results
show that our method outputs legitimate words, acronymic constructions, idioms,
names and titles, as well as technical compounds, many of which were lacking
from the original dictionary.Comment: 17 pages, uuencoded compressed PostScrip
Unsupervised Time Series Extraction from Controller Area Network Payloads
This paper introduces a method for unsupervised tokenization of Controller
Area Network (CAN) data payloads using bit level transition analysis and a
greedy grouping strategy. The primary goal of this proposal is to extract
individual time series which have been concatenated together before
transmission onto a vehicle's CAN bus. This process is necessary because the
documentation for how to properly extract data from a network may not always be
available; passenger vehicle CAN configurations are protected as trade secrets.
At least one major manufacturer has also been found to deliberately
misconfigure their documented extraction methods. Thus, this proposal serves as
a critical enabler for robust third-party security auditing and intrusion
detection systems which do not rely on manufacturers sharing confidential
information.Comment: 2018 IEEE 88th Vehicular Technology Conference (VTC2018-Fall
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
- …
