69 research outputs found
Chunking with Max-Margin Markov Networks
PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200
Authorship Identification in Bengali Literature: a Comparative Analysis
Stylometry is the study of the unique linguistic styles and writing behaviors
of individuals. It belongs to the core task of text categorization like
authorship identification, plagiarism detection etc. Though reasonable number
of studies have been conducted in English language, no major work has been done
so far in Bengali. In this work, We will present a demonstration of authorship
identification of the documents written in Bengali. We adopt a set of
fine-grained stylistic features for the analysis of the text and use them to
develop two different models: statistical similarity model consisting of three
measures and their combination, and machine learning model with Decision Tree,
Neural Network and SVM. Experimental results show that SVM outperforms other
state-of-the-art methods after 10-fold cross validations. We also validate the
relative importance of each stylistic feature to show that some of them remain
consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture
Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron
Este artÃculo presenta sistemas de identificación de chunks y cláusulas para el
euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más
concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, MÃ rquez
y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas
parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite
incorporar nuevos atributos, y posibilita asà el uso de información de diferentes fuentes. De esta
manera, hemos añadido información lingüÃstica en los algoritmos de aprendizaje. AsÃ, los
resultados del identificador de chunks han mejorado considerablemente y se ha compensado la
influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera.
En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos,
debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos
actualmente.This paper presents systems for syntactic chunking and clause identification for
Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used
Filtering-Ranking with Perceptrons (Carreras, MÃ rquez and Castro, 2005): a learning model that
recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for
these tasks in English. This model allows incorporating a rich set of features to represent
syntactic phrases, making possible to use information from different sources. We used this
property in order to include more linguistic features in the learning model and the results
obtained in chunking have been improved greatly. This way, we have made up for the relatively
small training data available for Basque to learn a chunking model. In the case of clause
identification, our preliminary results are low, which suggest that this is due to the free order of
Basque and to the small corpus available.Research partly funded by the Basque
Government (Department of Education,
University and Research, IT-397-07), the
Spanish Ministry of Education and Science
(TIN2007-63173) and the ETORTEK-ANHITZ
project from the Basque Government
(Department of Culture and Industry, IE06-
185)
Memory-Based Shallow Parsing
We present memory-based learning approaches to shallow parsing and apply
these to five tasks: base noun phrase identification, arbitrary base phrase
recognition, clause detection, noun phrase parsing and full parsing. We use
feature selection techniques and system combination methods for improving the
performance of the memory-based learner. Our approach is evaluated on standard
data sets and the results are compared with that of other systems. This reveals
that our approach works well for base phrase identification while its
application towards recognizing embedded structures leaves some room for
improvement
- …