12,418 research outputs found
Unsupervised induction of Arabic root and pattern lexicons using machine learning
We describe an approach to building a morphological analyser of Arabic by inducing a lexicon of root and pattern templates from an unannotated corpus. Using maximum entropy modelling, we capture orthographic features from surface words, and cluster the words based on the similarity of their possible roots or patterns. From these clusters, we extract root and pattern lexicons, which allows us to morphologically analyse words. Further enhancements are applied, adjusting for morpheme length and structure. Final root extraction accuracy of 87.2% is achieved. In contrast to previous work on unsupervised learning of Arabic morphology, our approach is applicable to naturally-written, unvowelled Arabic text
MORSE: Semantic-ally Drive-n MORpheme SEgment-er
We present in this paper a novel framework for morpheme segmentation which
uses the morpho-syntactic regularities preserved by word representations, in
addition to orthographic features, to segment words into morphemes. This
framework is the first to consider vocabulary-wide syntactico-semantic
information for this task. We also analyze the deficiencies of available
benchmarking datasets and introduce our own dataset that was created on the
basis of compositionality. We validate our algorithm across datasets and
present state-of-the-art results
What Your Username Says About You
Usernames are ubiquitous on the Internet, and they are often suggestive of
user demographics. This work looks at the degree to which gender and language
can be inferred from a username alone by making use of unsupervised morphology
induction to decompose usernames into sub-units. Experimental results on the
two tasks demonstrate the effectiveness of the proposed morphological features
compared to a character n-gram baseline
- …