1,045 research outputs found

    Understanding Word Embedding Stability Across Languages and Applications

    Full text link
    Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this thesis, we consider several aspects of embedding spaces, including their stability. First, we propose a definition of stability, and show that common English word embeddings are surprisingly unstable. We explore how properties of data, words, and algorithms relate to instability. We extend this work to approximately 100 world languages, considering how linguistic typology relates to stability. Additionally, we consider contextualized output embedding spaces. Using paraphrases, we explore properties and assumptions of BERT, a popular embedding algorithm. Second, we consider how stability and other word embedding properties affect tasks where embeddings are commonly used. We consider both word embeddings used as features in downstream applications and corpus-centered applications, where embeddings are used to study characteristics of language and individual writers. In addition to stability, we also consider other word embedding properties, specifically batching and curriculum learning, and how methodological choices made for these properties affect downstream tasks. Finally, we consider how knowledge of stability affects how we use word embeddings. Throughout this thesis, we discuss strategies to mitigate instability and provide analyses highlighting the strengths and weaknesses of word embeddings in different scenarios and languages. We show areas where more work is needed to improve embeddings, and we show where embeddings are already a strong tool.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162917/1/lburdick_1.pd

    Supporting the Chinese Language in Oracle Text

    Get PDF
    Gegenstand dieser Arbeit sind die Problematik von chinesischem Information Retrieval (IR) sowie die Faktoren, die die Leistung eines chinesischen IR-System beeinflussen können. Experimente wurden im Rahmen des Bewertungsmodells von „TREC-5 Chinese Track“ und der Nutzung eines großen Korpusses von über 160.000 chinesischen Nachrichtenartikeln auf einer Oracle10g (Beta Version) Datenbank durchgeführt. Schließlich wurde die Leistung von Oracle® Text in einem so genannten „Benchmarking“ Prozess gegenüber den Ergebnissen der Teilnehmer von TREC-5 verglichen. Die Hauptergebnisse dieser Arbeit sind: (a) Die Wirksamkeit eines chinesischen IR Systems ist durch die Art und Weise der Formulierung einer Abfrage stark beeinflusst. Besonders sollte man während der Formulierung einer Anfrage die Vielzahl von Abkürzungen und die regionalen Unterschiede in der chinesischen Sprache, sowie die verschiedenen Transkriptionen der nicht-chinesischen Eigennamen beachten; (b) Stopwords haben keinen Einfluss auf die Leistungsfähigkeit eines chinesischen IR Systems; (c) die Benutzer neigen dazu, kürzere Abfragen zu formulieren, und die Suchergebnisse sind besonders schlecht, wenn Feedback und Expansion von Anfragen („query expansion“) nicht genutzt werden; (d) im Vergleich zu dem Chinese_Vgram_Lexer, hat der Chinese_Lexer den Vorteil, reale Wörter und einen kleineren Index zu erzeugen, sowie höhere Präzision in den Suchergebnissen zu erzielen; und (e) die Leistung von Oracle® Text für chinesisches IR ist vergleichbar mit den Ergebnissen von TREC-5

    A Pointillism Approach for Natural Language Processing of Social Media

    Get PDF
    Natural language processing tasks typically start with the basic unit of words, and then from words and their meanings a big picture is constructed about what the meanings of documents or other larger constructs are in terms of the topics discussed. Social media is very challenging for natural language processing because it challenges the notion of a word. Social media users regularly use words that are not in even the most comprehensive lexicons. These new words can be unknown named entities that have suddenly risen in prominence because of a current event, or they might be neologisms newly created to emphasize meaning or evade keyword filtering. Chinese social media is particularly challenging. The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. Thus, even knowing what the boundaries of words are in a social media corpus is a difficult proposition. For these reasons, in this document I propose the Pointillism approach to natural language processing. In the pointillism approach, language is viewed as a time series, or sequence of points that represent the grams\u27 usage over time. Time is an important aspect of the Pointillism approach. Detailed timing information, such as timestamps of when posts were posted, contain correlations based on human patterns and current events. This timing information provides the necessary context to build words and phrases out of trigrams and then group those words and phrases into topical clusters. Rather than words that have individual meanings, the basic unit of the pointillism approach is trigrams of characters. These grams take on meaning in aggregate when they appear together in a way that is correlated over time. I anticipate that the pointillism approach can perform well in a variety of natural language processing tasks for many different languages, but in this document my focus is on trend analysis for Chinese microblogging. Microblog posts have a timestamp of when posts were posted, that is accurate to the minute or second (though, in this dissertation, I bin posts by the hour). To show that trigrams supplemented with frequency information do collect scattered information into meaningful pieces, I first use the pointillism approach to extract phrases. I conducted experiments on 4-character idioms, a set of 500 phrases that are longer than 3 characters taken from the Chinese-language version of Wiktionary, and also on Weibo\u27s hot keywords. My results show that when words and topics do have a meme-like trend, they can be reconstructed from only trigrams. For example, for 4-character idioms that appear at least 99 times in one day in my data, the unconstrained precision (that is, precision that allows for deviation from a lexicon when the result is just as correct as the lexicon version of the word or phrase) is 0.93. For longer words and phrases collected from Wiktionary, including neologisms, the unconstrained precision is 0.87. I consider these results to be very promising, because they suggest that it is feasible for a machine to reconstruct complex idioms, phrases, and neologisms with good precision without any notion of words. Next, I examine the potential of the pointillism approach for extracting topical trends from microblog posts that are related to environmental issues. Independent Component Analysis (ICA) is utilized to find the trigrams which have the same independent signal source, i.e., topics. Contrast this with probabilistic topic models, which leverage co-occurrence to classify the documents into the topics they have learned, so it is hard for it to extract topics in real-time. However, pointillism approach can extract trends in real-time, whether those trends have been discussed before or not. This is more challenging because in phrase extraction, order information is used to narrow down the candidates, whereas for trend extraction only the frequency of the trigrams are considered. The proposed approach is compared against a state of the art topic extraction technique, Latent Dirichlet Allocation (LDA), on 9,147 labelled posts with timestamps. The experimental results show that the highest F1 score of the pointillism approach with ICA is 4% better than that of LDA. Thus, using the pointillism approach, the colorful and baroque uses of language that typify social media in challenging languages such as Chinese may in fact be accessible to machines. The thesis that my dissertation tests is this: For topic extraction for scenarios where no adequate lexicon is available, such as social media, the Pointillism approach uses timing information to out-perform traditional techniques that are based on co-occurrence

    A framework for ancient and machine-printed manuscripts categorization

    Get PDF
    Document image understanding (DIU) has attracted a lot of attention and became an of active fields of research. Although, the ultimate goal of DIU is extracting textual information of a document image, many steps are involved in a such a process such as categorization, segmentation and layout analysis. All of these steps are needed in order to obtain an accurate result from character recognition or word recognition of a document image. One of the important steps in DIU is document image categorization (DIC) that is needed in many situations such as document image written or printed in more than one script, font or language. This step provides useful information for recognition system and helps in reducing its error by allowing to incorporate a category-specific Optical Character Recognition (OCR) system or word recognition (WR) system. This research focuses on the problem of DIC in different categories of scripts, styles and languages and establishes a framework for flexible representation and feature extraction that can be adapted to many DIC problem. The current methods for DIC have many limitations and drawbacks that restrict the practical usage of these methods. We proposed an efficient framework for categorization of document image based on patch representation and Non-negative Matrix Factorization (NMF). This framework is flexible and can be adapted to different categorization problem. Many methods exist for script identification of document image but few of them addressed the problem in handwritten manuscripts and they have many limitations and drawbacks. Therefore, our first goal is to introduce a novel method for script identification of ancient manuscripts. The proposed method is based on patch representation in which the patches are extracted using skeleton map of a document images. This representation overcomes the limitation of the current methods about the fixed level of layout. The proposed feature extraction scheme based on Projective Non-negative Matrix Factorization (PNMF) is robust against noise and handwriting variation and can be used for different scripts. The proposed method has higher performance compared to state of the art methods and can be applied to different levels of layout. The current methods for font (style) identification are mostly proposed to be applied on machine-printed document image and many of them can only be used for a specific level of layout. Therefore, we proposed new method for font and style identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The images are represented by overlapping patches obtained from the foreground pixels. The position of these patches are set based on skeleton map to reduce the number of patches. Non-Negative Matrix Tri-Factorization is used to learn bases from each fonts (style) and then these bases are used to classify a new image based on minimum representation error. The proposed method can easily be extended to new fonts as the bases for each font are learned separately from the other fonts. This method is tested on two datasets of machine-printed and ancient manuscript and the results confirmed its performance compared to the state of the art methods. Finally, we proposed a novel method for language identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The current methods for language identification are based on textual data obtained by OCR engine or images data through coding and comparing with textual data. The OCR based method needs lots of processing and the current image based method are not applicable to cursive scripts such as Arabic. In this work we introduced a new method for language identification of machine-printed and handwritten manuscripts based on patch representation and NMTF. The patch representation provides the component of the Arabic script (letters) that can not be extracted simply by segmentation methods. Then NMTF is used for dictionary learning and generating codebooks that will be used to represent document image with a histogram. The proposed method is tested on two datasets of machine-printed and handwritten manuscripts and compared to n-gram features (text-based), texture features and codebook features (imagebased) to validate the performance. The above proposed methods are robust against variation in handwritings, changes in the font (handwriting style) and presence of degradation and are flexible that can be used to various levels of layout (from a textline to paragraph). The methods in this research have been tested on datasets of handwritten and machine-printed manuscripts and compared to state-of-the-art methods. All of the evaluations show the efficiency, robustness and flexibility of the proposed methods for categorization of document image. As mentioned before the proposed strategies provide a framework for efficient and flexible representation and feature extraction for document image categorization. This frame work can be applied to different levels of layout, the information from different levels of layout can be merged and mixed and this framework can be extended to more complex situations and different tasks

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

    Power System Simulation, Control and Optimization

    Get PDF
    This Special Issue “Power System Simulation, Control and Optimization” offers valuable insights into the most recent research developments in these topics. The analysis, operation, and control of power systems are increasingly complex tasks that require advanced simulation models to analyze and control the effects of transformations concerning electricity grids today: Massive integration of renewable energies, progressive implementation of electric vehicles, development of intelligent networks, and progressive evolution of the applications of artificial intelligence

    Lexicography of coronavirus-related neologisms

    Get PDF
    This volume brings together contributions by international experts reflecting on Covid19-related neologisms and their lexicographic processing and representation. The papers analyze new words, new meanings of existing words, and new multiword units, where they come from, how they are transmitted (or differ) across languages, and how their use and meaning are reflected in dictionaries of all sorts. Recent trends in as many as ten languages are considered, including general and specialized language, monolingual as well as bilingual and printed as well as online dictionaries

    Measuring Linguistic and Cultural Evolution Using Books and Tweets

    Get PDF
    Written language provides a snapshot of linguistic, cultural, and current events information for a given time period. Aggregating these snapshots by studying many texts over time reveals trends in the evolution of language, culture, and society. The ever-increasing amount of electronic text, both from the digitization of books and other paper documents to the increasing frequency with which electronic text is used as a means of communication, has given us an unprecedented opportunity to study these trends. In this dissertation, we use hundreds of thousands of books spanning two centuries scanned by Google, and over 100 billion messages, or ‘tweets’, posted to the social media platform, Twitter, over the course of a decade to study the English language, as well as study the evolution of culture and society as inferred from the changes in language. We begin by studying the current state of verb regularization and how this compares between the more formal writing of books and the more colloquial writing of tweets on Twitter. We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books, and also for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables. Next, we study stretchable words, a fundamental aspect of spoken language that, until the advent of social media, was rarely observed within written language. We examine the frequency distributions of stretchable words and introduce two central parameters that capture their main characteristics of balance and stretch. We explore their dynamics by creating visual tools we call ‘balance plots’ and ‘spelling trees’. We also discuss how the tools and methods we develop could be used to study mistypings and misspellings, and may have further applications both within and beyond language. Finally, we take a closer look at the English Fiction n-gram dataset created by Google. We begin by explaining why using token counts as a proxy of word, or more generally, ‘n-gram’, importance is fundamentally flawed. We then devise a method to rebuild the Google Books corpus so that meaningful linguistic and cultural trends may be reliably discerned. We use book counts as the primary ranking for an n-gram and use subsampling to normalize across time to mitigate the extraneous results created by the underlying exponential increase in data volume over time. We also combine the subsampled data over a number of years as a method of smoothing. We then use these improved methods to study linguistic and cultural evolution across the last two centuries. We examine the dynamics of Zipf distributions for n-grams by measuring the churn of language reflected in the flux of n-grams across rank boundaries. Finally, we examine linguistic change using wordshift plots and a rank divergence measure with a tunable parameter to compare the language of two different time periods. Our results address several methodological shortcomings associated with the raw Google Books data, strengthening the potential for cultural inference by word changes

    Lexicography of Coronavirus-related Neologisms

    Get PDF
    This volume brings together contributions by international experts reflecting on Covid19-related neologisms and their lexicographic processing and representation. The papers analyze new words, new meanings of existing words, and new multiword units in as many as ten languages, considering both specialized and general language, monolingual as well as bilingual and printed as well as online dictionaries
    corecore