12 research outputs found
Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods
In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.Peer reviewe
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Multilingual language models are widely used to extend NLP systems to
low-resource languages. However, concrete evidence for the effects of
multilinguality on language modeling performance in individual languages
remains scarce. Here, we pre-train over 10,000 monolingual and multilingual
language models for over 250 languages, including multiple language families
that are under-studied in NLP. We assess how language modeling performance in
each language varies as a function of (1) monolingual dataset size, (2) added
multilingual dataset size, (3) linguistic similarity of the added languages,
and (4) model size (up to 45M parameters). We find that in moderation, adding
multilingual data improves low-resource language modeling performance, similar
to increasing low-resource dataset sizes by up to 33%. Improvements depend on
the syntactic similarity of the added multilingual data, with marginal
additional effects of vocabulary overlap. However, high-resource languages
consistently perform worse in multilingual pre-training scenarios. As dataset
sizes increase, adding multilingual data begins to hurt performance for both
low-resource and high-resource languages, likely due to limited model capacity
(the "curse of multilinguality"). These results suggest that massively
multilingual pre-training may not be optimal for any languages involved, but
that more targeted models can significantly improve performance
Reducing out-of-vocabulary in morphology to improve the accuracy in Arabic dialects speech recognition
This thesis has two aims: developing resources for Arabic dialects and improving the speech recognition of Arabic dialects. Two important components are considered: Pronunciation Dictionary (PD) and Language Model (LM). Six parts are involved, which relate to building and evaluating dialects resources and improving the performance of systems for the speech recognition of dialects.
Three resources are built and evaluated: one tool and two corpora. The methodology that was used for building the multi-dialect morphology analyser involves the proposal and evaluation of linguistic and statistic bases. We obtained an overall accuracy of 94%. The dialect text corpora have four sub-dialects, with more than 50 million tokens. The multi-dialect speech corpora have 32 speech hours, which were collected from 52 participants. The resultant speech corpora have more than 67,000 speech files.
The main objective is improvement in the PDs and LMs of Arabic dialects. The use of incremental methodology made it possible to check orthography and phonology rules incrementally. We were able to distinguish the rules that positively affected the PDs. The Word Error Rate (WER) improved by an accuracy of 5.3% in MSA and 5% in Levantine.
Three levels of morphemes were used to improve the LMs of dialects: stem, prefix+stem and stem+suffix. We checked the three forms using two different types of LMs. Eighteen experiments are carried out on MSA, Gulf dialect and Egyptian dialect, all of which yielded positive results, showing that WERs were reduced by 0.5% to 6.8%
Extraction of Arabic word roots: An Approach Based on Computational Model and Multi-Backpropagation Neural Networks
Stemming is a process of extracting the root of a given word, by stripping
off the affixes attached to this word. Many attempts have been made
to address the stemming of Arabic words problem. The majority of the
existing Arabic stemming algorithms require a complete set of morphological
rules and large vocabulary lookup tables. Furthermore, many of them give
more than one potential stem or root for a given Arabic word. According to
Ahmad [11], the Arabic stemming process based on the language morphological
rules is still a very difficult task due to the nature of the language itself.
The limitations of the current Arabic stemming methods have motivated this
research in which we investigate a novel approach to extract the word roots
of Arabic language named here as MUAIDI-STEMMER 2. This approach attempts
to exploit numerical relations between Arabic letters, avoiding having a list
of the root and pattern of each word in the language, and giving one root solution.
This approach is composed of two phases. Phase I depends on a basic
calculations extracted from linguistic analysis of Arabic patterns and affixes.
Phase II is based on artificial neural network trained by backpropagation
learning rule. In this proposed phase, we formulate the root extraction problem
as a classification problem and the neural network as a classifier tool.
This study demonstrates that a neural network can be effectively used to ex- tract the word roots of Arabic language
The stemmer developed is tested using 46,895 Arabic word types3. Error counting accuracy evaluation was employed to evaluate the performance of
the stemmer. It was successful in producing the stems of 44,107 Arabic words
from the given test datasets with accuracy of 94.81%.
2.Muaidi is the author father's name.
3.Types mean distinct or unique words
Development of Corn Kernel-based Biocomposite Films for Food Packaging Applications
Most of the current and active food packaging resources and methods are nonbiodegradable and nonrenewable therefore harmful to the environment. Due to this, alternate sources of food packaging materials are in high demand. In this study, a bio-composite film has been developed, with Corn kernel powder as fiber reinforcement which is mixed with gelatin, and lignin two biopolymers as the matrix. The effect of Corn Kernel (CK) reinforcement on the Gelatin/Lignin (G/L) matrix on mechanical and barrier properties has been studied. CK has shown great potential as reinforcement to natural polymer, gelatin, and lignin (G/L) for food packaging applications as well as equating its unique attributes to biodegradability. Gelatin has significant limitations on barrier properties, hence choosing to crosslink polymer Lignin to minimize limitations. The higher particle size of CK affected the composite, hence it was further ground to a smaller size (Image analysis via. Digital Microscope). Four different mixtures at CK w% were used to prepare the composite film, CK (10%) – G/L (5%, 10%, 15%, 20%). Two G/L (5%, 10%) films without fiber were also produced to study performance comparison. The prepared composite films were subjected to morphological analysis, mechanical strength analysis, film thickness analysis, water vapor permeability analysis, and water uptake analysis. It has been observed that CK is well dispersed in the G/L matrix (Image analysis via. SEM). Mechanical properties of the CK composite film evaluated that with an increase of w% of CK the strength of the composite increases. A film with more matrix showed less absorption of water as well as less water vapor permeability. The WVP test and WU test revealed that film CK (10%) – G/L (20%) possesses the best barrier properties
Unsupervised learning for text-to-speech synthesis
This thesis introduces a general method for incorporating the distributional analysis
of textual and linguistic objects into text-to-speech (TTS) conversion systems.
Conventional TTS conversion uses intermediate layers of representation to bridge
the gap between text and speech. Collecting the annotated data needed to produce
these intermediate layers is a far from trivial task, possibly prohibitively so
for languages in which no such resources are in existence. Distributional analysis,
in contrast, proceeds in an unsupervised manner, and so enables the creation of
systems using textual data that are not annotated. The method therefore aids
the building of systems for languages in which conventional linguistic resources
are scarce, but is not restricted to these languages.
The distributional analysis proposed here places the textual objects analysed
in a continuous-valued space, rather than specifying a hard categorisation of those
objects. This space is then partitioned during the training of acoustic models for
synthesis, so that the models generalise over objects' surface forms in a way that
is acoustically relevant.
The method is applied to three levels of textual analysis: to the characterisation
of sub-syllabic units, word units and utterances. Entire systems for three
languages (English, Finnish and Romanian) are built with no reliance on manually
labelled data or language-specific expertise. Results of a subjective evaluation
are presented
Livestock and sustainable nutrient cycling in mixed farming systems of sub-Saharan Africa. Volume II: Technical papers. Proceedings of an international conference
Achieving sustainable increases in agricultural production in sub-Saharan Africa is both a regional and a worldwide concern. High human and animal population densities in some areas have surpassed
land-carrying capacities causing environmental degradation and undermining the long-term stability of these production systems. In attempts to meet the increasing food demands of larger populations,
farmers are cultivating more land permanently, grazing lands have diminished and many traditional farming practices that formerly allowed land to rejuvenate are disappearing.
An efficient cycling of nutrients among crops, animals and soil is crucial to the sustained productivity of low-input mixed farming systems in sub-Saharan Africa. Access to agricultural inputs such as fertiliser and improved seed is limited. Nutrient balances, or the difference between nutrient inputs and harvests, are negative for many production systems. Although animal manures are perhaps the most important fertility amendment that many farmers apply to cropland, livestock can also contribute to these nutrient imbalances. Excessive removal of vegetation by grazing animals or
harvesting feeds can deplete soil-nutrient reserves and result in decreases in soil productivity. A major portion of nutrients consumed by livestock may also be unavailable for recycling due to volatilisation, erosion and leaching losses, and uneven deposition of nutrients by animals in the landscape.
The climatic and socio-economic changes currently taking place in many parts of sub-Saharan Africa suggest that sustainable increases in agricultural production from an increasingly fragile ecosystem require new and innovative crop, livestock, and soil-management strategies. To further this
objective, the International Livestock Centre for Africa (ILCA) and its cosponsors convened this conference to bring together national and international experts in livestock (cattle, sheep and goats)
nutrition and management, ecology, agronomy, soil science and socio-economics to address
fundamental issues of nutrient balances, agricultural productivity and the well being of the people, livestock and environment of sub-Saharan Africa