1,333 research outputs found
Development of a Hindi Lemmatizer
We live in a translingual society, in order to communicate with people from different parts of the world we need to have an expertise in their respective languages. Learning all these languages is not at all possible; therefore we need a mechanism which can do this task for us. Machine translators have emerged as a tool which can perform this task. In order to develop a machine translator we need to develop several different rules. The very first module that comes in machine translation pipeline is morphological analysis. Stemming and lemmatization comes under morphological analysis. In this paper we have created a lemmatizer which generates rules for removing the affixes along with the addition of rules for creating a proper root word
Stemmer for Serbian language
In linguistic morphology and information retrieval, stemming is the process
for reducing inflected (or sometimes derived) words to their stem, base or root
form; generally a written word form. In this work is presented suffix stripping
stemmer for Serbian language, one of the highly inflectional languages.Comment: 16 pages, 8 figures, code include
Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification
This paper deals with the identification of Multiword Expressions (MWEs) in
Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the
Eight Schedule of Indian Constitution. MWE plays an important role in the
applications of Natural Language Processing(NLP) like Machine Translation, Part
of Speech tagging, Information Retrieval, Question Answering etc. Feature
selection is an important factor in the recognition of Manipuri MWEs using
Conditional Random Field (CRF). The disadvantage of manual selection and
choosing of the appropriate features for running CRF motivates us to think of
Genetic Algorithm (GA). Using GA we are able to find the optimal features to
run the CRF. We have tried with fifty generations in feature selection along
with three fold cross validation as fitness function. This model demonstrated
the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%,
showing an improvement over the CRF based Manipuri MWE identification without
GA application.Comment: 14 pages, 6 figures, see
http://airccse.org/journal/jcsit/1011csit05.pd
ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing
In this paper, we present a novel unsupervised algorithm for word sense
disambiguation (WSD) at the document level. Our algorithm is inspired by a
widely-used approach in the field of genetics for whole genome sequencing,
known as the Shotgun sequencing technique. The proposed WSD algorithm is based
on three main steps. First, a brute-force WSD algorithm is applied to short
context windows (up to 10 words) selected from the document in order to
generate a short list of likely sense configurations for each window. In the
second step, these local sense configurations are assembled into longer
composite configurations based on suffix and prefix matching. The resulted
configurations are ranked by their length, and the sense of each word is chosen
based on a voting scheme that considers only the top k configurations in which
the word appears. We compare our algorithm with other state-of-the-art
unsupervised WSD algorithms and demonstrate better performance, sometimes by a
very large margin. We also show that our algorithm can yield better performance
than the Most Common Sense (MCS) baseline on one data set. Moreover, our
algorithm has a very small number of parameters, is robust to parameter tuning,
and, unlike other bio-inspired methods, it gives a deterministic solution (it
does not involve random choices).Comment: In Proceedings of EACL 201
Time complexity in rejang language stemming
Stemming is the process of separating the root word from an affixed word in a sentence by separating the base word and affixes which can consist of prefixes (prefixes), insertions (infixes), and suffixes (suffixes). Between one language and another, there are differences in the algorithm, especially the stemming process, in morphology. The time complexity of the Rejang algorithm is determined based on the affix group. To find out the time complexity of the stemming algorithm in the Rejang language using the method of making a digital word dictionary of the Rejang language, studying and analyzing the morphology of the Rejang language, making the Rejang language stemming algorithm based on the results of the Rejang language morphology analysis, analyzing the algorithm's performance and calculating the time complexity of the stemming results. The result of this research is to produce an efficient and effective Rejang Language stemming algorithm, where efficiency is indicated by the algorithm's time complexity of O(log n), and the effectiveness is shown from the results of accuracy of 99% against the test of 9000 affixed words. This accuracy value indicates that the over stemming and under stemming processes are 1%. Test results on 15 text documents with an average stemming failure rate of 1%
Mapping Subsets of Scholarly Information
We illustrate the use of machine learning techniques to analyze, structure,
maintain, and evolve a large online corpus of academic literature. An emerging
field of research can be identified as part of an existing corpus, permitting
the implementation of a more coherent community structure for its
practitioners.Comment: 10 pages, 4 figures, presented at Arthur M. Sackler Colloquium on
"Mapping Knowledge Domains", 9--11 May 2003, Beckman Center, Irvine, CA,
proceedings to appear in PNA
Text segmentation on multilabel documents: A distant-supervised approach
Segmenting text into semantically coherent segments is an important task with
applications in information retrieval and text summarization. Developing
accurate topical segmentation requires the availability of training data with
ground truth information at the segment level. However, generating such labeled
datasets, especially for applications in which the meaning of the labels is
user-defined, is expensive and time-consuming. In this paper, we develop an
approach that instead of using segment-level ground truth information, it
instead uses the set of labels that are associated with a document and are
easier to obtain as the training data essentially corresponds to a multilabel
dataset. Our method, which can be thought of as an instance of distant
supervision, improves upon the previous approaches by exploiting the fact that
consecutive sentences in a document tend to talk about the same topic, and
hence, probably belong to the same class. Experiments on the text segmentation
task on a variety of datasets show that the segmentation produced by our method
beats the competing approaches on four out of five datasets and performs at par
on the fifth dataset. On the multilabel text classification task, our method
performs at par with the competing approaches, while requiring significantly
less time to estimate than the competing approaches.Comment: Accepted in 2018 IEEE International Conference on Data Mining (ICDM
- …