33 research outputs found
Proceedings of the Morpho Challenge 2010 Workshop
In natural language processing many practical tasks, such as speech recognition, information retrieval and machine translation depend on a large vocabulary and statistical language models. For morphologically rich languages, such as Finnish and Turkish, the construction of a vocabulary and language models that have a sufficient coverage is particularly difficult, because of the huge amount of different word forms. In Morpho Challenge 2010 unsupervised and semi-supervised algorithms are suggested to provide morpheme analyses for words in different languages and evaluated in various practical applications. As a research theme, unsupervised morphological analysis has received wide attention in conferences and scientific journals focused on computational linguistic and its applications. This is the proceedings of the Morpho Challenge 2010 Workshop that contains one introduction article with a description of the tasks, evaluation and results and six articles describing the participating unsupervised and supervised learning algorithms. The Morpho Challenge 2010 Workshop was held at Espoo, Finland in 2-3 September, 2010.reviewe
Recommended from our members
Integrating Machine Learning Into Language Documentation and Description
At least 40% of the world’s 7000+ languages are believed to be in danger of disappearing from human use by the end of this century. Many languages will disappear with almost no record of their existence because efforts to document and describe these languages are encountering an “annotation bottleneck” at early stages of analysis and annotation. Current annotation methods are too slow and expensive to counteract the pace of language endangerment and loss. Annotation could be sped and improved by machine learning. However, state-of-the-art supervised machine learning depends heavily on large amounts of annotated data.
This dissertation explores how to train supervised machine learning systems for morphological analysis during language documentation and description. The systems are applied to nine languages. The research investigates ways that linguists and NLP scientists may want to adjust their expectations and workflows so that both can achieve optimal results with endangered data.
New methods for tasks in morphological analysis are explored. First, various approaches to automating morpheme segmentation and glossing are compared. Second, a new task is presented for learning morphological paradigms and automatically generating new morphological resources: IGT-to-paradigms (IGT2P). Third, the impact of POS tags on segmentation, glossing, and paradigm induction is examined, showing that the presence or absence of POS tags does not have a significant bearing on the performance of machine learning systems. The results indicate that Natural Language Processing (NLP) systems could be successfully integrated into the documentary and descriptive workflow. At the same time, the relatively high accuracy achieved from noisy field data with little or no additional human annotation hints that NLP may benefit from limited documentary linguistic data which may be the only or largest linguistically annotated resource available for some languages.</p
Wide-coverage parsing for Turkish
Wide-coverage parsing is an area that attracts much attention in natural language processing
research. This is due to the fact that it is the first step tomany other applications
in natural language understanding, such as question answering.
Supervised learning using human-labelled data is currently the best performing
method. Therefore, there is great demand for annotated data. However, human annotation
is very expensive and always, the amount of annotated data is much less than
is needed to train well-performing parsers. This is the motivation behind making the
best use of data available. Turkish presents a challenge both because syntactically
annotated Turkish data is relatively small and Turkish is highly agglutinative, hence
unusually sparse at the whole word level.
METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface
dependency relations and morphological analyses for words. We show that including
even the crudest forms of morphological information extracted from the data boosts
the performance of both generative and discriminative parsers, contrary to received
opinion concerning English.
We induce word-based and morpheme-based CCG grammars from Turkish dependency
treebank. We use these grammars to train a state-of-the-art CCG parser that
predicts long-distance dependencies in addition to the ones that other parsers are capable
of predicting. We also use the correct CCG categories as simple features in a
graph-based dependency parser and show that this improves the parsing results.
We show that a morpheme-based CCG lexicon for Turkish is able to solve many
problems such as conflicts of semantic scope, recovering long-range dependencies,
and obtaining smoother statistics from the models. CCG handles linguistic phenomena
i.e. local and long-range dependencies more naturally and effectively than other linguistic
theories while potentially supporting semantic interpretation in parallel. Using
morphological information and a morpheme-cluster based lexicon improve the performance
both quantitatively and qualitatively for Turkish.
We also provide an improved version of the treebank which will be released by
kind permission of METU and Sabancı
Aprendizaje automático no supervisado en segmentadores morfológicos para una lengua de escasos recursos caso de estudio: SHIWILU
El Shiwilu es considerada ‘seriamente en peligro’ porque es hablada principalmente por
adultos mayores de forma parcial, poco frecuente y en contextos restringidos; además, no
continúa siendo transmitida a nuevas generaciones. Este tipo de lenguas necesitan pasar por un
proceso de revitalización (fortalecimiento) para garantizar que no se extingan y así fomentar el
interés de sus hablantes. Además, su documentación es muy escasa debido a los pocos estudios
lingüísticos realizados. A fin de elevar su status, se sugiere la creación de recursos y tecnología
de corte lingüístico, como corpus monolingüe y bilingüe, diccionarios, reconocimiento de
categorías gramaticales, analizadores morfológicos, etc. Sin embargo, la mayoría de las
lenguas existentes no se beneficia con alguno de estos recursos y/o tecnologías, y por ello son
consideradas como lenguas de escasos recursos. Debido a la falta de inversión, se requiere un
enfoque en el que se busquen soluciones robustas a un bajo costo a través de herramientas
independientes de la lengua, modelos de desarrollo de código abierto o algoritmos de
aprendizaje automático no supervisado. Bajo este contexto, se identifica como problema
central el desconocimiento de un enfoque adecuado para la segmentación morfológica de una
lengua de escasos recursos; y para ello, el presente proyecto propone realizar una segmentación
morfológica automática no supervisada en una lengua con estas características a partir de la
identificación del tipo de enfoque, monolingüe o multilingüe, que ofrece mejores resultados en
esta tarea
Fuzzy Natural Logic in IFSA-EUSFLAT 2021
The present book contains five papers accepted and published in the Special Issue, “Fuzzy Natural Logic in IFSA-EUSFLAT 2021”, of the journal Mathematics (MDPI). These papers are extended versions of the contributions presented in the conference “The 19th World Congress of the International Fuzzy Systems Association and the 12th Conference of the European Society for Fuzzy Logic and Technology jointly with the AGOP, IJCRS, and FQAS conferences”, which took place in Bratislava (Slovakia) from September 19 to September 24, 2021. Fuzzy Natural Logic (FNL) is a system of mathematical fuzzy logic theories that enables us to model natural language terms and rules while accounting for their inherent vagueness and allows us to reason and argue using the tools developed in them. FNL includes, among others, the theory of evaluative linguistic expressions (e.g., small, very large, etc.), the theory of fuzzy and intermediate quantifiers (e.g., most, few, many, etc.), and the theory of fuzzy/linguistic IF–THEN rules and logical inference. The papers in this Special Issue use the various aspects and concepts of FNL mentioned above and apply them to a wide range of problems both theoretically and practically oriented. This book will be of interest for researchers working in the areas of fuzzy logic, applied linguistics, generalized quantifiers, and their applications