106 research outputs found

    Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

    Get PDF
    The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments

    EVALUATING EXTENSIVE SHEEP FARMING SYSTEMS

    Get PDF
    Data from each of 5 commercial, extensive sheep farms in Cumbria, UK were used as parameters in a linear program (LP) representing labour and grazing management in such farming systems. The LP maximised ewe enterprise gross margin subject to constraints dictated by the labour availability and land types on each farm. Under the assumptions used, labour availability and price restricted ewe numbers well below those observed in practice on 2 farms i.e. land resources were adequate for the farming system practiced. On two other farms stocking levels and hence returns were limited by the availability of forage and hence feed input prices relative to output. On one farm, greater grassland productivity was the key determinant of system performance. It was concluded that a holistic systems approach was needed to properly evaluate these farming systems in terms of their potential contribution to animal welfare, land use, profit and hence their sustainabilityLivestock Production/Industries, Extensive, Sheep, Economics, LP,

    Impacts of labour on interactions between economics and animal welfare in extensive sheep farms

    Get PDF
    This study quantified interactions between animal welfare and farm profitability in British extensive sheep farming systems. Qualitative welfare assessment methodology was used to assess welfare from the animal's perspective in 20 commercial extensive sheep farms and to estimate labour demand for welfare, based on the assessed welfare scores using data collected from farm inventories. The estimated labour demand was then used as a coefficient in a linear program based model to establish the gross margin maximising farm management strategy for given farm situations, subject to constraints that reflected current resource limitations including labour supply. Regression analysis showed a significant relationship between the qualitative welfare assessment scores and labour supply on the inventoried farms but there was no significant relationship between current gross margin and assessed welfare scores. However, to meet the labour demand of the best welfare score, a reduction in flock size and in the average maximum farm gross margin was often required. These findings supported the hypothesis that trade-offs between animal welfare and farm profitability are necessary in providing maximum animal welfare via on-farm labour and sustainable British extensive sheep farming systems.Sheep, Labour, Animal Welfare, Linear Programme, Livestock Production/Industries, C6, Q10, Q19, Q57,

    Handwritten digit recognition by bio-inspired hierarchical networks

    Full text link
    The human brain processes information showing learning and prediction abilities but the underlying neuronal mechanisms still remain unknown. Recently, many studies prove that neuronal networks are able of both generalizations and associations of sensory inputs. In this paper, following a set of neurophysiological evidences, we propose a learning framework with a strong biological plausibility that mimics prominent functions of cortical circuitries. We developed the Inductive Conceptual Network (ICN), that is a hierarchical bio-inspired network, able to learn invariant patterns by Variable-order Markov Models implemented in its nodes. The outputs of the top-most node of ICN hierarchy, representing the highest input generalization, allow for automatic classification of inputs. We found that the ICN clusterized MNIST images with an error of 5.73% and USPS images with an error of 12.56%

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

    Get PDF
    The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora
    • …
    corecore