2,214 research outputs found

    Weak supervision and label noise handling for Natural language processing in low-resource scenarios

    Get PDF
    The lack of large amounts of labeled data is a significant factor blocking many low-resource languages and domains from catching up with recent advancements in natural language processing. To reduce this dependency on labeled instances, weak supervision (semi-)automatically annotates unlabeled data. These labels can be obtained more quickly and cheaply than manual, gold-standard annotations. They also, however, contain more errors. Handling these noisy labels is often required to leverage the weakly supervised data successfully. In this dissertation, we study the whole weak supervision pipeline with a focus on the task of named entity recognition. We develop a tool for automatic annotation, and we propose an approach to model label noise when a small amount of clean data is available. We study the factors that influence the noise model's quality from a theoretic perspective, and we validate this approach empirically on several different tasks and languages. An important aspect is the aim for a realistic evaluation. We perform our analysis, among others, on several African low-resource languages. We show the performance benefits that can be achieved using weak supervision and label noise modeling. But we also highlight open issues that the field still has to overcome. For the low-resource settings, we expand the analysis to few-shot learning. For classification errors, we present a novel approach to obtain interpretable insights of where classifiers fail.Der Mangel an annotierten Daten ist ein wesentlicher Faktor, der viele Sprachen und Domänen mit geringen Ressourcen daran hindert, mit den jüngsten Fortschritten in der digitalen Textverarbeitung Schritt zu halten. Um diese Abhängigkeit von gelabelten Trainingsdaten zu verringern, werden bei Weak Supervision nicht gelabelte Daten (halb-)automatisch annotiert. Diese Annotationen sind schneller und günstiger zu erhalten. Sie enthalten jedoch auch mehr Fehler. Oft ist eine besondere Behandlung dieser Noisy Labels notwendig, um die Daten erfolgreich nutzen zu können. In dieser Dissertation untersuchen wir die gesamte Weak Supervision Pipeline mit einem Schwerpunkt auf den Einsatz für die Erkennung von Entitäten. Wir entwickeln ein Tool zur automatischen Annotation und präsentieren einen neuen Ansatz zur Modellierung von Noisy Labels. Wir untersuchen die Faktoren, die die Qualität dieses Modells aus theoretischer Sicht beeinflussen, und wir validieren den Ansatz empirisch für verschiedene Aufgaben und Sprachen. Ein wichtiger Aspekt dieser Arbeit ist das Ziel einer realistischen Analyse. Die Untersuchung führen wir unter anderem an mehreren afrikanischen Sprachen durch und zeigen die Leistungsvorteile, die durch Weak Supervision und die Modellierung von Label Noise erreicht werden können. Auch erweitern wir die Analyse auf das Lernen mit wenigen Beispielen. In Bezug auf Klassifizierungsfehler, stellen wir zudem einen neuen Ansatz vor, um interpretierbare Erkenntnisse zu gewinnen

    Enhancing scene text recognition with visual context information

    Get PDF
    This thesis addresses the problem of improving text spotting systems, which aim to detect and recognize text in unrestricted images (e.g. a street sign, an advertisement, a bus destination, etc.). The goal is to improve the performance of off-the-shelf vision systems by exploiting the semantic information derived from the image itself. The rationale is that knowing the content of the image or the visual context can help to decide which words are the correct andidate words. For example, the fact that an image shows a coffee shop makes it more likely that a word on a signboard reads as Dunkin and not unkind. We address this problem by drawing on successful developments in natural language processing and machine learning, in particular, learning to re-rank and neural networks, to present post-process frameworks that improve state-of-the-art text spotting systems without the need for costly data-driven re-training or tuning procedures. Discovering the degree of semantic relatedness of candidate words and their image context is a task related to assessing the semantic similarity between words or text fragments. However, semantic relatedness is more general than similarity (e.g. car, road, and traffic light are related but not similar) and requires certain adaptations. To meet the requirements of these broader perspectives of semantic similarity, we develop two approaches to learn the semantic related-ness of the spotted word and its environmental context: word-to-word (object) or word-to-sentence (caption). In the word-to-word approach, word embed-ding based re-rankers are developed. The re-ranker takes the words from the text spotting baseline and re-ranks them based on the visual context from the object classifier. For the second, an end-to-end neural approach is designed to drive image description (caption) at the sentence-level as well as the word-level (objects) and re-rank them based not only on the visual context but also on the co-occurrence between them. As an additional contribution, to meet the requirements of data-driven ap-proaches such as neural networks, we propose a visual context dataset for this task, in which the publicly available COCO-text dataset [Veit et al. 2016] has been extended with information about the scene (including the objects and places appearing in the image) to enable researchers to include the semantic relations between texts and scene in their Text Spotting systems, and to offer a common evaluation baseline for such approaches.Aquesta tesi aborda el problema de millorar els sistemes de reconeixement de text, que permeten detectar i reconèixer text en imatges no restringides (per exemple, un cartell al carrer, un anunci, una destinació d’autobús, etc.). L’objectiu és millorar el rendiment dels sistemes de visió existents explotant la informació semàntica derivada de la pròpia imatge. La idea principal és que conèixer el contingut de la imatge o el context visual en el que un text apareix, pot ajudar a decidir quines són les paraules correctes. Per exemple, el fet que una imatge mostri una cafeteria fa que sigui més probable que una paraula en un rètol es llegeixi com a Dunkin que no pas com unkind. Abordem aquest problema recorrent a avenços en el processament del llenguatge natural i l’aprenentatge automàtic, en particular, aprenent re-rankers i xarxes neuronals, per presentar solucions de postprocés que milloren els sistemes de l’estat de l’art de reconeixement de text, sense necessitat de costosos procediments de reentrenament o afinació que requereixin grans quantitats de dades. Descobrir el grau de relació semàntica entre les paraules candidates i el seu context d’imatge és una tasca relacionada amb l’avaluació de la semblança semàntica entre paraules o fragments de text. Tanmateix, determinar l’existència d’una relació semàntica és una tasca més general que avaluar la semblança (per exemple, cotxe, carretera i semàfor estan relacionats però no són similars) i per tant els mètodes existents requereixen certes adaptacions. Per satisfer els requisits d’aquestes perspectives més àmplies de relació semàntica, desenvolupem dos enfocaments per aprendre la relació semàntica de la paraula reconeguda i el seu context: paraula-a-paraula (amb els objectes a la imatge) o paraula-a-frase (subtítol de la imatge). En l’enfocament de paraula-a-paraula s’usen re-rankers basats en word-embeddings. El re-ranker pren les paraules proposades pel sistema base i les torna a reordenar en funció del context visual proporcionat pel classificador d’objectes. Per al segon cas, s’ha dissenyat un enfocament neuronal d’extrem a extrem per explotar la descripció de la imatge (subtítol) tant a nivell de frase com a nivell de paraula i re-ordenar les paraules candidates basant-se tant en el context visual com en les co-ocurrències amb el subtítol. Com a contribució addicional, per satisfer els requisits dels enfocs basats en dades com ara les xarxes neuronals, presentem un conjunt de dades de contextos visuals per a aquesta tasca, en el què el conjunt de dades COCO-text disponible públicament [Veit et al. 2016] s’ha ampliat amb informació sobre l’escena (inclosos els objectes i els llocs que apareixen a la imatge) per permetre als investigadors incloure les relacions semàntiques entre textos i escena als seus sistemes de reconeixement de text, i oferir una base d’avaluació comuna per a aquests enfocaments

    Evolution of A Common Vector Space Approach to Multi-Modal Problems

    Get PDF
    A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion. With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented

    Volume 6 Number 3

    Get PDF

    Volume 6 Number 3

    Get PDF

    The Half-Fairness of Google\u27s Plan to Make the World\u27s Collection of Books Searchable

    Get PDF
    Google\u27s major new initiative is to undertake the task of digitizing the world\u27s collection of books so as to make them searchable. The very idea is audacious, but what is more so is that Google plans to copy without first seeking the permission of the owners of these works. Google Print would make available what is, by conventional measures at least, the highest grade of information--books produced by millions of the world\u27s leading scholars. This is in stark contrast to the inconsistent quality spectrum one encounters through other online sources such as peer-to-peer networks and blogs, where there currently exists little mechanism for peer review or other means of quality control. What Google proposes to do is either the largest example of copyright infringement in history or the largest example of fair use in history.[...] Two major lawsuits have been filed against Google. The American Association of University Presses, which represents 125 university presses, has sued Google, seeking a declaration that Google is committing copyright infringement by scanning books and an injunction against Google Print. A second lawsuit, a class action representing published authors and The Authors Guild, seeks declaratory and injunctive relief and money damages as well. The outcome of these lawsuits is far from clear and the stakes are huge.[...] Part I of this Article will set out the complex set of facts leading up to the filing of the Google Print lawsuit. Part II will examine the legal and doctrinal issues presented by these facts. I will argue that plaintiffs have a solid prima facie case for massive copyright infringement on a scale never before seen. Google, however, will be able to counter with a compelling and innovative use of the fair use defense. Part III will begin to develop an economic and policy framework for examining and debating the various policy issues raised by the Google Print project and by internet search engines more generally. This analysis will seek to answer important questions regarding the shape and structure that regulation of internet search engines should take. I will argue that courts seeking to maximize social welfare should adopt a bifurcated approach under which fair use rights are accorded to Google with respect to the copyright holders of orphan works, but not with respect to the holders of non-orphan works. This approach is necessary to deal with the legacy problem presented by orphan works created with non-digital technologies, and thus, associated with a more onerous set of transaction costs attached to their accessibility. On a going-forward basis, however, creators of works will be properly incentivized under the approach developed here to protect their works. Thus, over time, a non-bifurcated regime of regulation will emerge. This will perhaps delay, but not impede, the development of Google Print, or a functional equivalent, and will foster the development of a richer market in books and creative works more generally

    The Half-Fairness of Google\u27s Plan to Make the World\u27s Collection of Books Searchable

    Get PDF
    Google\u27s major new initiative is to undertake the task of digitizing the world\u27s collection of books so as to make them searchable. The very idea is audacious, but what is more so is that Google plans to copy without first seeking the permission of the owners of these works. Google Print would make available what is, by conventional measures at least, the highest grade of information--books produced by millions of the world\u27s leading scholars. This is in stark contrast to the inconsistent quality spectrum one encounters through other online sources such as peer-to-peer networks and blogs, where there currently exists little mechanism for peer review or other means of quality control. What Google proposes to do is either the largest example of copyright infringement in history or the largest example of fair use in history.[...] Two major lawsuits have been filed against Google. The American Association of University Presses, which represents 125 university presses, has sued Google, seeking a declaration that Google is committing copyright infringement by scanning books and an injunction against Google Print. A second lawsuit, a class action representing published authors and The Authors Guild, seeks declaratory and injunctive relief and money damages as well. The outcome of these lawsuits is far from clear and the stakes are huge.[...] Part I of this Article will set out the complex set of facts leading up to the filing of the Google Print lawsuit. Part II will examine the legal and doctrinal issues presented by these facts. I will argue that plaintiffs have a solid prima facie case for massive copyright infringement on a scale never before seen. Google, however, will be able to counter with a compelling and innovative use of the fair use defense. Part III will begin to develop an economic and policy framework for examining and debating the various policy issues raised by the Google Print project and by internet search engines more generally. This analysis will seek to answer important questions regarding the shape and structure that regulation of internet search engines should take. I will argue that courts seeking to maximize social welfare should adopt a bifurcated approach under which fair use rights are accorded to Google with respect to the copyright holders of orphan works, but not with respect to the holders of non-orphan works. This approach is necessary to deal with the legacy problem presented by orphan works created with non-digital technologies, and thus, associated with a more onerous set of transaction costs attached to their accessibility. On a going-forward basis, however, creators of works will be properly incentivized under the approach developed here to protect their works. Thus, over time, a non-bifurcated regime of regulation will emerge. This will perhaps delay, but not impede, the development of Google Print, or a functional equivalent, and will foster the development of a richer market in books and creative works more generally
    • …
    corecore