12 research outputs found

    Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

    Get PDF
    In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.Peer reviewe

    When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

    Full text link
    Multilingual language models are widely used to extend NLP systems to low-resource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are under-studied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance

    Reducing out-of-vocabulary in morphology to improve the accuracy in Arabic dialects speech recognition

    Get PDF
    This thesis has two aims: developing resources for Arabic dialects and improving the speech recognition of Arabic dialects. Two important components are considered: Pronunciation Dictionary (PD) and Language Model (LM). Six parts are involved, which relate to building and evaluating dialects resources and improving the performance of systems for the speech recognition of dialects. Three resources are built and evaluated: one tool and two corpora. The methodology that was used for building the multi-dialect morphology analyser involves the proposal and evaluation of linguistic and statistic bases. We obtained an overall accuracy of 94%. The dialect text corpora have four sub-dialects, with more than 50 million tokens. The multi-dialect speech corpora have 32 speech hours, which were collected from 52 participants. The resultant speech corpora have more than 67,000 speech files. The main objective is improvement in the PDs and LMs of Arabic dialects. The use of incremental methodology made it possible to check orthography and phonology rules incrementally. We were able to distinguish the rules that positively affected the PDs. The Word Error Rate (WER) improved by an accuracy of 5.3% in MSA and 5% in Levantine. Three levels of morphemes were used to improve the LMs of dialects: stem, prefix+stem and stem+suffix. We checked the three forms using two different types of LMs. Eighteen experiments are carried out on MSA, Gulf dialect and Egyptian dialect, all of which yielded positive results, showing that WERs were reduced by 0.5% to 6.8%

    Extraction of Arabic word roots: An Approach Based on Computational Model and Multi-Backpropagation Neural Networks

    Get PDF
    Stemming is a process of extracting the root of a given word, by stripping off the affixes attached to this word. Many attempts have been made to address the stemming of Arabic words problem. The majority of the existing Arabic stemming algorithms require a complete set of morphological rules and large vocabulary lookup tables. Furthermore, many of them give more than one potential stem or root for a given Arabic word. According to Ahmad [11], the Arabic stemming process based on the language morphological rules is still a very difficult task due to the nature of the language itself. The limitations of the current Arabic stemming methods have motivated this research in which we investigate a novel approach to extract the word roots of Arabic language named here as MUAIDI-STEMMER 2. This approach attempts to exploit numerical relations between Arabic letters, avoiding having a list of the root and pattern of each word in the language, and giving one root solution. This approach is composed of two phases. Phase I depends on a basic calculations extracted from linguistic analysis of Arabic patterns and affixes. Phase II is based on artificial neural network trained by backpropagation learning rule. In this proposed phase, we formulate the root extraction problem as a classification problem and the neural network as a classifier tool. This study demonstrates that a neural network can be effectively used to ex- tract the word roots of Arabic language The stemmer developed is tested using 46,895 Arabic word types3. Error counting accuracy evaluation was employed to evaluate the performance of the stemmer. It was successful in producing the stems of 44,107 Arabic words from the given test datasets with accuracy of 94.81%. 2.Muaidi is the author father's name. 3.Types mean distinct or unique words

    Development of Corn Kernel-based Biocomposite Films for Food Packaging Applications

    Get PDF
    Most of the current and active food packaging resources and methods are nonbiodegradable and nonrenewable therefore harmful to the environment. Due to this, alternate sources of food packaging materials are in high demand. In this study, a bio-composite film has been developed, with Corn kernel powder as fiber reinforcement which is mixed with gelatin, and lignin two biopolymers as the matrix. The effect of Corn Kernel (CK) reinforcement on the Gelatin/Lignin (G/L) matrix on mechanical and barrier properties has been studied. CK has shown great potential as reinforcement to natural polymer, gelatin, and lignin (G/L) for food packaging applications as well as equating its unique attributes to biodegradability. Gelatin has significant limitations on barrier properties, hence choosing to crosslink polymer Lignin to minimize limitations. The higher particle size of CK affected the composite, hence it was further ground to a smaller size (Image analysis via. Digital Microscope). Four different mixtures at CK w% were used to prepare the composite film, CK (10%) – G/L (5%, 10%, 15%, 20%). Two G/L (5%, 10%) films without fiber were also produced to study performance comparison. The prepared composite films were subjected to morphological analysis, mechanical strength analysis, film thickness analysis, water vapor permeability analysis, and water uptake analysis. It has been observed that CK is well dispersed in the G/L matrix (Image analysis via. SEM). Mechanical properties of the CK composite film evaluated that with an increase of w% of CK the strength of the composite increases. A film with more matrix showed less absorption of water as well as less water vapor permeability. The WVP test and WU test revealed that film CK (10%) – G/L (20%) possesses the best barrier properties

    Unsupervised learning for text-to-speech synthesis

    Get PDF
    This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented

    Livestock and sustainable nutrient cycling in mixed farming systems of sub-Saharan Africa. Volume II: Technical papers. Proceedings of an international conference

    Get PDF
    Achieving sustainable increases in agricultural production in sub-Saharan Africa is both a regional and a worldwide concern. High human and animal population densities in some areas have surpassed land-carrying capacities causing environmental degradation and undermining the long-term stability of these production systems. In attempts to meet the increasing food demands of larger populations, farmers are cultivating more land permanently, grazing lands have diminished and many traditional farming practices that formerly allowed land to rejuvenate are disappearing. An efficient cycling of nutrients among crops, animals and soil is crucial to the sustained productivity of low-input mixed farming systems in sub-Saharan Africa. Access to agricultural inputs such as fertiliser and improved seed is limited. Nutrient balances, or the difference between nutrient inputs and harvests, are negative for many production systems. Although animal manures are perhaps the most important fertility amendment that many farmers apply to cropland, livestock can also contribute to these nutrient imbalances. Excessive removal of vegetation by grazing animals or harvesting feeds can deplete soil-nutrient reserves and result in decreases in soil productivity. A major portion of nutrients consumed by livestock may also be unavailable for recycling due to volatilisation, erosion and leaching losses, and uneven deposition of nutrients by animals in the landscape. The climatic and socio-economic changes currently taking place in many parts of sub-Saharan Africa suggest that sustainable increases in agricultural production from an increasingly fragile ecosystem require new and innovative crop, livestock, and soil-management strategies. To further this objective, the International Livestock Centre for Africa (ILCA) and its cosponsors convened this conference to bring together national and international experts in livestock (cattle, sheep and goats) nutrition and management, ecology, agronomy, soil science and socio-economics to address fundamental issues of nutrient balances, agricultural productivity and the well being of the people, livestock and environment of sub-Saharan Africa