17 research outputs found
Ensemble Morphosyntactic Analyser for Classical Arabic
Classical Arabic (CA) is an influential language for Muslim lives around the
world. It is the language of two sources of Islamic laws: the Quran and the Sunnah,
the collection of traditions and sayings attributed to the prophet Mohammed.
However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic.
Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from
the Sunnah books. Inspired by the advances in deep learning and the promising
results of ensemble methods, we developed a systematic method for transferring
morphological analysis that is capable of handling different labelling systems and
various sequence lengths.
In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset.
Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available
On Clustering and Evaluation of Narrow Domain Short-Test Corpora
En este trabajo de tesis doctoral se investiga el problema del agrupamiento de conjuntos especiales de documentos llamados textos cortos de dominios restringidos.
Para llevar a cabo esta tarea, se han analizados diversos corpora y métodos de agrupamiento. Mas aún, se han introducido algunas medidas de evaluación de corpus, técnicas de selección de términos y medidas para la validez de agrupamiento con la finalidad de estudiar los siguientes problemas:
-Determinar la relativa dificultad de un corpus para ser agrupado y estudiar algunas de sus características como longitud de los textos, amplitud del dominio, estilometría, desequilibrio de clases y estructura.
-Contribuir en el estado del arte sobre el agrupamiento de corpora compuesto de textos cortos de dominios restringidos
El trabajo de investigación que se ha llevado a cabo se encuentra parcialmente enfocado en el "agrupamiento de textos cortos". Este tema se considera relevante dado el modo actual y futuro en que las personas tienden a usar un "lenguaje reducido" constituidos por textos cortos (por ejemplo, blogs, snippets, noticias y generación de mensajes de textos como el correo electrónico y el chat).
Adicionalmente, se estudia la amplitud del dominio de corpora. En este sentido, un corpus puede ser considerado como restringido o amplio si el grado de traslape de vocabulario es alto o bajo, respectivamente. En la tarea de categorización, es bastante complejo lidiar con corpora de dominio restringido tales como artículos científicos, reportes técnicos, patentes, etc.
El objetivo principal de este trabajo consiste en estudiar las posibles estrategias para tratar con los siguientes dos problemas:
a) las bajas frecuencias de los términos del vocabulario en textos cortos, y
b) el alto traslape de vocabulario asociado a dominios restringidos.
Si bien, cada uno de los problemas anteriores es un reto suficientemente alto, cuando se trata con textos cortos de dominios restringidos, la complejidad del problema se incrPinto Avendaño, DE. (2008). On Clustering and Evaluation of Narrow Domain Short-Test Corpora [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2641Palanci
Proceedings of the Conference on Natural Language Processing 2010
This book contains state-of-the-art contributions to the 10th
conference on Natural Language Processing, KONVENS 2010
(Konferenz zur Verarbeitung natürlicher Sprache), with a focus
on semantic processing.
The KONVENS in general aims at offering a broad perspective
on current research and developments within the interdisciplinary
field of natural language processing. The central theme
draws specific attention towards addressing linguistic aspects
ofmeaning, covering deep as well as shallow approaches to semantic
processing. The contributions address both knowledgebased
and data-driven methods for modelling and acquiring
semantic information, and discuss the role of semantic information
in applications of language technology.
The articles demonstrate the importance of semantic processing,
and present novel and creative approaches to natural
language processing in general. Some contributions put their
focus on developing and improving NLP systems for tasks like
Named Entity Recognition or Word Sense Disambiguation, or
focus on semantic knowledge acquisition and exploitation with
respect to collaboratively built ressources, or harvesting semantic
information in virtual games. Others are set within the
context of real-world applications, such as Authoring Aids, Text
Summarisation and Information Retrieval. The collection highlights
the importance of semantic processing for different areas
and applications in Natural Language Processing, and provides
the reader with an overview of current research in this field
Automatic medical term generation for a low-resource language: translation of SNOMED CT into Basque
211 p. (eusk.)
148 p. (eng.)Tesi-lan honetan, terminoak automatikoki euskaratzeko sistemak garatu eta ebaluatu ditugu. Horretarako,SNOMED CT, terminologia kliniko zabala barnebiltzen duen ontologia hartu dugu abiapuntutzat, etaEuSnomed deritzon sistema garatu dugu horren euskaratzea kudeatzeko. EuSnomedek lau urratsekoalgoritmoa inplementatzen du terminoen euskarazko ordainak lortzeko: Lehenengo urratsak baliabidelexikalak erabiltzen ditu SNOMED CTren terminoei euskarazko ordainak zuzenean esleitzeko. Besteakbeste, Euskalterm banku terminologikoa, Zientzia eta Teknologiaren Hiztegi Entziklopedikoa, eta GizaAnatomiako Atlasa erabili ditugu. Bigarren urratserako, ingelesezko termino neoklasikoak euskaratzekoNeoTerm sistema garatu dugu. Sistema horrek, afixu neoklasikoen baliokidetzak eta transliterazio erregelakerabiltzen ditu euskarazko ordainak sortzeko. Hirugarrenerako, ingelesezko termino konplexuak euskaratzendituen KabiTerm sistema garatu dugu. KabiTermek termino konplexuetan agertzen diren habiaratutakoterminoen egiturak erabiltzen ditu euskarazko egiturak sortzeko, eta horrela termino konplexuakosatzeko. Azken urratsean, erregeletan oinarritzen den Matxin itzultzaile automatikoa osasun-zientziendomeinura egokitu dugu, MatxinMed sortuz. Horretarako Matxin domeinura egokitzeko prestatu dugu,eta besteak beste, hiztegia zabaldu diogu osasun-zientzietako testuak itzuli ahal izateko. Garatutako lauurratsak ebaluatuak izan dira metodo ezberdinak erabiliz. Alde batetik, aditu talde txiki batekin egin dugulehenengo bi urratsen ebaluazioa, eta bestetik, osasun-zientzietako euskal komunitateari esker egin dugunMedbaluatoia kanpainaren baitan azkeneko bi urratsetako sistemen ebaluazioa egin da
Automatic medical term generation for a low-resource language: translation of SNOMED CT into Basque
211 p. (eusk.)
148 p. (eng.)Tesi-lan honetan, terminoak automatikoki euskaratzeko sistemak garatu eta ebaluatu ditugu. Horretarako,SNOMED CT, terminologia kliniko zabala barnebiltzen duen ontologia hartu dugu abiapuntutzat, etaEuSnomed deritzon sistema garatu dugu horren euskaratzea kudeatzeko. EuSnomedek lau urratsekoalgoritmoa inplementatzen du terminoen euskarazko ordainak lortzeko: Lehenengo urratsak baliabidelexikalak erabiltzen ditu SNOMED CTren terminoei euskarazko ordainak zuzenean esleitzeko. Besteakbeste, Euskalterm banku terminologikoa, Zientzia eta Teknologiaren Hiztegi Entziklopedikoa, eta GizaAnatomiako Atlasa erabili ditugu. Bigarren urratserako, ingelesezko termino neoklasikoak euskaratzekoNeoTerm sistema garatu dugu. Sistema horrek, afixu neoklasikoen baliokidetzak eta transliterazio erregelakerabiltzen ditu euskarazko ordainak sortzeko. Hirugarrenerako, ingelesezko termino konplexuak euskaratzendituen KabiTerm sistema garatu dugu. KabiTermek termino konplexuetan agertzen diren habiaratutakoterminoen egiturak erabiltzen ditu euskarazko egiturak sortzeko, eta horrela termino konplexuakosatzeko. Azken urratsean, erregeletan oinarritzen den Matxin itzultzaile automatikoa osasun-zientziendomeinura egokitu dugu, MatxinMed sortuz. Horretarako Matxin domeinura egokitzeko prestatu dugu,eta besteak beste, hiztegia zabaldu diogu osasun-zientzietako testuak itzuli ahal izateko. Garatutako lauurratsak ebaluatuak izan dira metodo ezberdinak erabiliz. Alde batetik, aditu talde txiki batekin egin dugulehenengo bi urratsen ebaluazioa, eta bestetik, osasun-zientzietako euskal komunitateari esker egin dugunMedbaluatoia kanpainaren baitan azkeneko bi urratsetako sistemen ebaluazioa egin da
A Hybrid Framework for Text Analysis
2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists
and Computer Scientists. The rst ones, with a strong knowledge of
language structures, have not engineering skills. The second ones, contrariwise,
expert in computer and mathematics skills, do not assign values to basic
mechanisms and structures of language. Moreover, this discrepancy, especially
in the last decades, has increased due to the growth of computational
resources and to the gradual computerization of the world; the use of Machine
Learning technologies in Arti cial Intelligence problems solving, which
allows for example the machines to learn , starting from manually generated
examples, has been more and more often used in Computational Linguistics
in order to overcome the obstacle represented by language structures and its
formal representation.
The dichotomy has resulted in the birth of two main approaches to Computational
Linguistics that respectively prefers:
rule-based methods, that try to imitate the way in which man uses and
understands the language, reproducing syntactic structures on which
the understanding process is based on, building lexical resources as electronic
dictionaries, taxonomies or ontologies;
statistic-based methods that, conversely, treat language as a group of
elements, quantifying words in a mathematical way and trying to extract
information without identifying syntactic structures or, in some
algorithms, trying to confer to the machine the ability to learn these
structures.
One of the main problems is the lack of communication between these two
di erent approaches, due to substantial di erences characterizing them: on
the one hand there is a strong focus on how language works and on language
characteristics, there is a tendency to analytical and manual work. From other
hand, engineering perspective nds in language an obstacle, and recognizes in
the algorithms the fastest way to overcome this problem.
However, the lack of communication is not only an incompatibility: following
Harris, the best way to approach natural language, could result by taking the
best of both.
At the moment, there is a large number of open-source tools that perform
text analysis and Natural Language Processing. A great part of these tools are
based on statistical models and consist on separated modules which could be
combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User
Interface) and they result impossible to use for users without programming
skills. Furthermore, the vast majority of these open-source tools support only
English language and, when Italian language is included, the performances
of the tools decrease signi cantly. On the other hand, open source tools for
Italian language are very few.
In this work we want to ll this gap by present a new hybrid framework
for the analysis of Italian texts. It must not be intended as a commercial tool,
but the purpose for which it was built is to help linguists and other scholars to
perform rapid text analysis and to produce linguistic data. The framework,
that performs both statistical and rule-based analysis, is called LG-Starship.
The idea is to built a modular software that includes, in the beginning, the
basic algorithms to perform di erent kind of analysis. Modules will perform
the following tasks:
Preprocessing Module: a module with which it is possible to charge a
text, normalize it or delete stop-words. As output, the module presents
the list of tokens and letters which compose the texts with respective
occurrences count and the processed text.
Mr. Ling Module: a module with which POS tagging and Lemmatization
are performed. The module also returns the table of lemmas
with the count of occurrences and the table with the quanti cation of
grammatical tags.
Statistic Module: with which it is possible to calculate Term Frequency
and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units
and export results as tables.
Semantic Module: which use The Hyperspace Analogue to Language
algorithm to calculate semantic similarity between words. The module
returns similarity matrices of words per word which can be exported
and analyzed.
SyntacticModule: which analyze syntax structures of a selected sentence
and tag the verbs and its arguments with semantic labels.
The objective of the Framework is to build an all-in-one platform for NLP
which allows any kind of users to perform basic and advanced text analysis.
With the purpose of make the Framework accessible to users who have not
speci c computer science and programming language skills, the modules have
been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained
in the previous lines, it uses both statistical and rule/based methods, by relying
on standard statistical algorithms or techniques, and, at the same time,
on Lexicon-Grammar syntactic theory. In addition, it has been written in
both Java and Python programming languages. LG-Starship Framework has
a simple Graphic User Interface but will be also released as separated modules
which may be included in any NLP pipelines independently.
There are many resources of this kind, but the large majority works for English.
There are very few free resources for Italian language and this work tries
to cover this need by proposing a tool which can be used both by linguists
or other scientist interested in language and text analysis who have no idea
about programming languages, as by computer scientists, who can use free
modules in their own code or in combination with di erent NLP algorithms.
The Framework takes the start from a text or corpus written directly by
the user or charged from an external resource. The LG-Starship Framework
work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original
imported or generated text in order to produce a clean and normalized preprocessed
text. This module includes a function for text splitting, a stop-word
list and a tokenization method. On the text preprocessed the Statistic Module
or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces
as output databases of lexical and numerical data which can be used to
produce charts or perform more external analysis; the second one, is divided
in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?]
and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of-
Speech Tagging and produce an annotated text. A lemmatization method,
which relies on a set of electronic dictionaries developed at the University of
Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and
produces a new lemmatized version of original text with information about
syntactic and semantic properties.
This lemmatized text, which can also be processed with the Statistic Module,
serves as input for two deeper level of text analysis carried out by both
the Syntactic Module and the Semantic Module.
The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and
use a database of Predicate structures in development at the Department of
Political, Social and Communication Science. Its objective is to produce a
Dependency Graph of the sentences that compose the text.
The Semantic Module uses the Hyperspace Analogue to Language distributional
semantics algorithm [Lund and Burgess, 1996] trained on the Paisà
Corpus to produce a semantic network of the words of the text.
These work ow has been included in two di erent experiments in which
two User Generated Corpora have been involved.
The rst experiment represent a statistical study of the language of Rap
Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded
from on line databases of user generated lyrics.
The second experiment is a Feature-Based Sentiment Analysis project performed
on user product reviews. For this project we integrated a large domain
database of linguistic resources for Sentiment Analysis, developed in the past
years by the Department of Political, Social and Communication Science of
the University of Salerno, which consists of polarized dictionaries of Verbs,
Adjectives, Adverbs and Nouns.
These two experiment underline how the linguistic framework can be applied
to di erent level of analysis and to produce both Qualitative data and Quantitative
data.
For what concern the obtained results, the Framework, which is only at
a Beta Version, obtain discrete results both in terms of processing time that
in terms of precision. Nevertheless, the work is far from being considered
complete. More algorithms will be added to the Statistic Module and the
Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version
of the modules will be published. [edited by author]XV n.s