92 research outputs found

    Adaptive Communication: Languages with More Non-Native Speakers Tend to Have Fewer Word Forms.

    Get PDF
    Explaining the diversity of languages across the world is one of the central aims of typological, historical, and evolutionary linguistics. We consider the effect of language contact-the number of non-native speakers a language has-on the way languages change and evolve. By analysing hundreds of languages within and across language families, regions, and text types, we show that languages with greater levels of contact typically employ fewer word forms to encode the same information content (a property we refer to as lexical diversity). Based on three types of statistical analyses, we demonstrate that this variance can in part be explained by the impact of non-native speakers on information encoding strategies. Finally, we argue that languages are information encoding systems shaped by the varying needs of their speakers. Language evolution and change should be modeled as the co-evolution of multiple intertwined adaptive systems: On one hand, the structure of human societies and human learning capabilities, and on the other, the structure of language.CB is funded by an Arts and Humanities Research Council (UK) doctoral grant (reference number: 04325), a grant from the Cambridge Home and European Scholarship Scheme, and by Cambridge English, University of Cambridge. AV is supported by ERC grant 'The evolution of human languages' (reference number: 268744). DK is supported by EPSRC grant EP/I037512/1. FH is funded by a Benefactor's Scholarship of St. John's College, Cambridge. PB is supported by Cambridge English, University of Cambridge.This is the final version. It first appeared at http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128254

    Entropy of printed Bengali language texts.

    Get PDF
    One of the most important sources of information is written and spoken human language. The language that is spoken, written, or signed by humans for general-purpose communication is referred to as natural language. Determining the entropy of natural language text is a fundamentally important problem in natural language processing. The study and analysis of the entropy of a language can be a meaningful resource for researchers in linguistics and communication theory. For the purpose of this research we have taken printed Bengali language text as our source of natural language. We have collected a sufficient number of printed Bengali language text samples and divided them into two classes, newspaper and literature. We have studied each class in order to come up with specific entropy for each category and analyzed their characteristics. As a separate study, we collected printed religious Bengali language texts, divided them into two classes, Islamic and Hindu, found their entropy and studied and analyzed their characteristics. From our research, we have found the zero and first-order entropy of Bengali language to be 5.52 and 4.55 respectively. The language uncertainty and redundancy are 0.8242 and 17.58% respectively. These entropy and redundancy results of the language will be useful to researchers to help find a better text compression method for Bengali language.The original print copy of this thesis may be available here: http://wizard.unbc.ca/record=b146606

    Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents

    Get PDF
    Author identification is an important aspect of literary analysis, studied in natural language processing (NLP). It aids identify the most probable author of articles, news texts or social media comments and tweets, for example. It can be applied to other domains such as criminal and civil cases, cybersecurity, forensics, identification of plagiarizer, and many more. An automated system in this context can thus be very beneficial for society. In this paper, we propose a convolutional neural network (CNN)-based author identification system from literary articles. This system uses visual features along with a five-layer convolutional neural network for the identification of authors. The prime motivation behind this approach was the feasibility to identify distinct writing styles through a visualization of the writing patterns. Experiments were performed on 1200 articles from 50 authors achieving a maximum accuracy of 93.58%. Furthermore, to see how the system performed on different volumes of data, the experiments were performed on partitions of the dataset. The system outperformed standard handcrafted feature-based techniques as well as established works on publicly available datasets

    Investigating features and techniques for Arabic authoriship attribution

    Get PDF
    Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors. The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach. The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results

    Attentional Control Processing in Working Memory: Effects of Aging and Bilingualism

    Get PDF
    Selective attention is required for working memory and is theorized to underlie the process of selecting between two active languages in bilinguals. Studies of working memory performance and bilingualism have produced divergent results and neural investigations are still in the early stages. The purpose of the current series of studies using older and younger bilingual and monolingual adults was to examine working memory processing by manipulating attentional control demands and task domain. It was hypothesized that bilinguals in both age groups will outperform monolinguals when verbal demands are low and when attentional control demands are high. Study 1 included behavioural tasks that varied by domain and attentional control. Study 2 addressed these factors by examining the neural correlates of maintenance and updating using ERPs. A third analytic approach using partial least squares (PLS) analysis was performed on the recognition data from Study 2 to assess contrasting group patterns of amplitude and signal variability using multiscale entropy (MSE). Bilingual performance was poorer than monolingual when the task involved verbal production, but bilinguals outperformed monolinguals when the task involved nonverbal interference resolution. P3 amplitude was largely impacted by attentional demands and aging, whereas language group differences were limited. Extensive language and age group differences emerged once whole brain neural patterns were examined. Bilingual older adults displayed a neural signature similar to younger adults for both amplitude and MSE measures. Older adult monolinguals did not show these patterns and required additional frontal resources for the difficult spatial update condition. Younger bilinguals showed long-range, frontal-parietal MSE patterns for updating in working memory. These results are consistent with the interpretation of brain functional reorganization for bilingual working memory processing and may represent adaptations to a top-down attentional control mechanism

    Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

    Full text link
    Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which can not meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
    corecore