13 research outputs found
AI-tocracy
Can frontier innovation be sustained under autocracy? We argue that innovation and autocracy can be mutually reinforcing when: (i) the new technology bolsters the autocrat's power; and (ii) the autocrat's demand for the technology stimulates further innovation in applications beyond those benefiting it directly. We test for such a mutually reinforcing relationship in the context of facial recognition AI in China. To do so, we gather comprehensive data on AI firms and government procurement contracts, as well as on social unrest across China during the last decade. We first show that autocrats benefit from AI: local unrest leads to greater government procurement of facial recognition AI, and increased AI procurement suppresses subsequent unrest. We then show that AI innovation benefits from autocrats' suppression of unrest: the contracted AI firms innovate more both for the government and commercial markets. Taken together, these results suggest the possibility of sustained AI innovation under the Chinese regime: AI innovation entrenches the regime, and the regime's investment in AI for political control stimulates further frontier innovation
Recommended from our members
Cross-Lingual Transfer of Natural Language Processing Systems
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
* We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
* We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
* We develop different syntactic reordering models that can change the source treebanks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
* We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain
Recommended from our members
Cross-Lingual and Low-Resource Sentiment Analysis
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment
Um estudo comparativo das abordagens de detecção e reconhecimento de texto para cenários de computação restrita
Orientadores: Ricardo da Silva Torres, Allan da Silva PintoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Textos são elementos fundamentais para uma efetiva comunicação em nosso cotidiano. A mobilidade de pessoas e veículos em ambientes urbanos e a busca por um produto de interesse em uma prateleira de supermercado são exemplos de atividades em que o entendimento dos elementos textuais presentes no ambiente são essenciais para a execução da tarefa. Recentemente, diversos avanços na área de visão computacional têm sido reportados na literatura, com o desenvolvimento de algoritmos e métodos que objetivam reconhecer objetos e textos em cenas. Entretanto, a detecção e reconhecimento de textos são problemas considerados em aberto devido a diversos fatores que atuam como fontes de variabilidades durante a geração e captura de textos em cenas, o que podem impactar as taxas de detecção e reconhecimento de maneira significativa. Exemplo destes fatores incluem diferentes formas dos elementos textuais (e.g., circular ou em linha curva), estilos e tamanhos da fonte, textura, cor, variação de brilho e contraste, entre outros. Além disso, os recentes métodos considerados estado-da-arte, baseados em aprendizagem profunda, demandam altos custos de processamento computacional, o que dificulta a utilização de tais métodos em cenários de computação restritiva. Esta dissertação apresenta um estudo comparativo de técnicas de detecção e reconhecimento de texto, considerando tanto os métodos baseados em aprendizado profundo quanto os métodos que utilizam algoritmos clássicos de aprendizado de máquina. Esta dissertação também apresenta um método de fusão de caixas delimitadoras, baseado em programação genética (GP), desenvolvido para atuar tanto como uma etapa de pós-processamento, posterior a etapa de detecção, quanto para explorar a complementariedade dos algoritmos de detecção de texto investigados nesta dissertação. De acordo com o estudo comparativo apresentado neste trabalho, os métodos baseados em aprendizagem profunda são mais eficazes e menos eficientes, em comparação com os métodos clássicos da literatura e considerando as métricas adotadas. Além disso, o algoritmo de fusão proposto foi capaz de aprender informações complementares entre os métodos investigados nesta dissertação, o que resultou em uma melhora das taxas de precisão e revocação. Os experimentos foram conduzidos considerando os problemas de detecção de textos horizontais, verticais e de orientação arbitráriaAbstract: Texts are fundamental elements for effective communication in our daily lives. The mobility of people and vehicles in urban environments and the search for a product of interest on a supermarket shelf are examples of activities in which the understanding of the textual elements present in the environment is essential to succeed in such tasks. Recently, several advances in computer vision have been reported in the literature, with the development of algorithms and methods that aim to recognize objects and texts in scenes. However, text detection and recognition are still open problems due to several factors that act as sources of variability during scene text generation and capture, which can significantly impact detection and recognition rates of current algorithms. Examples of these factors include different shapes of textual elements (e.g., circular or curved), font styles and sizes, texture, color, brightness and contrast variation, among others. Besides, recent state-of-the-art methods based on deep learning demand high computational processing costs, which difficult their use in restricted computing scenarios. This dissertation presents a comparative study of text detection and recognition techniques, considering methods based on deep learning and methods that use classical machine learning algorithms. This dissertation also presents an algorithm for fusing bounding boxes, based on genetic programming (GP), developed to act as a post-processing step for a single text detector and to explore the complementarity of text detection algorithms investigated in this dissertation. According to the comparative study presented in this work, the methods based on deep learning are more effective and less efficient, in comparison to classic methods for text detection investigated in this work, considering the adopted metrics. Furthermore, the proposed GP-based fusion algorithm was able to learn complementary information from the methods investigated in this dissertation, which resulted in an improvement of precision and recall rates. The experiments were conducted considering text detection problems involving horizontal, vertical and arbitrary orientationsMestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoCAPE
Viability of Sequence Labeling Encodings for Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents new methods for recasting dependency parsing as
a sequence labeling task yielding a viable alternative to the traditional
transition- and graph-based approaches. It is shown that sequence labeling
parsers provide several advantages for dependency parsing, such
as: (i) a good trade-off between accuracy and parsing speed, (ii) genericity
which enables running a parser in generic sequence labeling software
and (iii) pluggability which allows using full parse trees as features to
downstream tasks.
The backbone of dependency parsing as sequence labeling are the encodings
which serve as linearization methods for mapping dependency
trees into discrete labels, such that each token in a sentence is associated
with a label. We introduce three encoding families comprising: (i)
head selection, (ii) bracketing-based and (iii) transition-based encodings
which are differentiated by the way they represent a dependency
tree as a sequence of labels. We empirically examine the viability of
the encodings and provide an analysis of their facets.
Furthermore, we explore the feasibility of leveraging external complementary
data in order to enhance parsing performance. Our sequence
labeling parser is endowed with two kinds of representations. First,
we exploit the complementary nature of dependency and constituency
parsing paradigms and enrich the parser with representations from both
syntactic abstractions. Secondly, we use human language processing
data to guide our parser with representations from eye movements.
Overall, the results show that recasting dependency parsing as sequence
labeling is a viable approach that is fast and accurate and provides
a practical alternative for integrating syntax in NLP tasks.[Resumen]
Esta tesis presenta nuevos métodos para reformular el análisis sintáctico
de dependencias como una tarea de etiquetado secuencial, lo
que supone una alternativa viable a los enfoques tradicionales basados
en transiciones y grafos. Se demuestra que los analizadores de etiquetado
secuencial ofrecen varias ventajas para el análisis sintáctico de
dependencias, como por ejemplo (i) un buen equilibrio entre la precisión
y la velocidad de análisis, (ii) la genericidad que permite ejecutar
un analizador en un software genérico de etiquetado secuencial y (iii)
la conectividad que permite utilizar el árbol de análisis completo como
características para las tareas posteriores.
El pilar del análisis sintáctico de dependencias como etiquetado secuencial
son las codificaciones que sirven como métodos de linealización
para transformar los árboles de dependencias en etiquetas discretas, de
forma que cada token de una frase se asocia con una etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de
núcleos, (ii) codificaciones basadas en corchetes y (iii) codificaciones basadas
en transiciones que se diferencian por la forma en que representan
un árbol de dependencias como una secuencia de etiquetas. Examinamos
empíricamente la viabilidad de las codificaciones y ofrecemos un
análisis de sus facetas.
Además, exploramos la viabilidad de aprovechar datos complementarios
externos para mejorar el rendimiento del análisis sintáctico. Dotamos
a nuestro analizador sintáctico de dos tipos de representaciones. En
primer lugar, explotamos la naturaleza complementaria de los paradigmas
de análisis sintáctico de dependencias y constituyentes, enriqueciendo
el analizador sintáctico con representaciones de ambas abstracciones
sintácticas. En segundo lugar, utilizamos datos de procesamiento del
lenguaje humano para guiar nuestro analizador con representaciones de
los movimientos oculares.
En general, los resultados muestran que la reformulación del análisis
sintáctico de dependencias como etiquetado de secuencias es un enfoque
viable, rápido y preciso, y ofrece una alternativa práctica para integrar
la sintaxis en las tareas de PLN.[Resumo]
Esta tese presenta novos métodos para reformular a análise sintáctica
de dependencias como unha tarefa de etiquetaxe secuencial, o que
supón unha alternativa viable aos enfoques tradicionais baseados en
transicións e grafos. Demóstrase que os analizadores de etiquetaxe secuencial
ofrecen varias vantaxes para a análise sintáctica de dependencias,
por exemplo (i) un bo equilibrio entre a precisión e a velocidade
de análise, (ii) a xenericidade que permite executar un analizador nun
software xenérico de etiquetaxe secuencial e (iii) a conectividade que
permite empregar a árbore de análise completa como características
para as tarefas posteriores.
O piar da análise sintáctica de dependencias como etiquetaxe secuencial
son as codificacións que serven como métodos de linealización para
transformar as árbores de dependencias en etiquetas discretas, de forma
que cada token dunha frase se asocia cunha etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de núcleos,
(ii) codificacións baseadas en corchetes e (iii) codificacións baseadas en
transicións que se diferencian pola forma en que representan unha árbore
de dependencia como unha secuencia de etiquetas. Examinamos
empíricamente a viabilidade das codificacións e ofrecemos unha análise
das súas facetas.
Ademais, exploramos a viabilidade de aproveitar datos complementarios
externos para mellorar o rendemento da análise sintáctica. O noso
analizador sintáctico de etiquetaxe secuencial está dotado de dous tipos
de representacións. En primeiro lugar, explotamos a natureza complementaria
dos paradigmas de análise sintáctica de dependencias e constituíntes
e enriquecemos o analizador sintáctico con representacións de
ambas abstraccións sintácticas. En segundo lugar, empregamos datos
de procesamento da linguaxe humana para guiar o noso analizador con
representacións dos movementos oculares.
En xeral, os resultados mostran que a reformulación da análise sintáctico
de dependencias como etiquetaxe de secuencias é un enfoque
viable, rápido e preciso, e ofrece unha alternativa práctica para integrar
a sintaxe nas tarefas de PLN.This work has been carried out thanks to the funding from
the European Research Council (ERC), under the European Union’s
Horizon 2020 research and innovation programme (FASTPARSE, grant
agreement No 714150)
Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP
Transfer learning, particularly approaches that combine multi-task learning
with pre-trained contextualized embeddings and fine-tuning, have advanced the
field of Natural Language Processing tremendously in recent years. In this
paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized
embeddings in multi-task settings. The benefits of MaChAmp are its flexible
configuration options, and the support of a variety of natural language
processing tasks in a uniform toolkit, from text classification and sequence
labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io
Cold-start universal information extraction
Who? What? When? Where? Why? are fundamental questions asked when gathering knowledge about and understanding a concept, topic, or event. The answers to these questions underpin the key information conveyed in the overwhelming majority, if not all, of language-based communication. At the core of my research in Information Extraction (IE) is the desire to endow machines with the ability to automatically extract, assess, and understand text in order to answer these fundamental questions. IE has been serving as one of the most important components for many downstream natural language processing (NLP) tasks, such as knowledge base completion, machine reading comprehension, machine translation and so on. The proliferation of the Web also intensifies the need of dealing with enormous amount of unstructured data from various sources, such as languages, genres and domains.
When building an IE system, the conventional pipeline is to (1) ask expert linguists to rigorously define a target set of knowledge types we wish to extract by examining a large data set, (2) collect resources and human annotations for each type, and (3) design features and train machine learning models to extract knowledge elements. In practice, this process is very expensive as each step involves extensive human effort which is not always available, for example, to specify the knowledge types for a particular scenario, both consumers and expert linguists need to examine a lot of data from that domain and write detailed annotation guidelines for each type. Hand-crafted schemas, which define the types and complex templates of the expected knowledge elements, often provide low coverage and fail to generalize to new domains. For example, none of the traditional event extraction programs, such as ACE (Automatic Content Extraction) and TAC-KBP, include "donation'' and "evacuation'' in their schemas in spite of their potential relevance to natural disaster management users. Additionally, these approaches are highly dependent on linguistic resources and human labeled data tuned to pre-defined types, so they suffer from poor scalability and portability when moving to a new language, domain, or genre.
The focus of this thesis is to develop effective theories and algorithms for IE which not only yield satisfactory quality by incorporating prior linguistic and semantic knowledge, but also greater portability and scalability by moving away from the high cost and narrow focus of large-scale manual annotation. This thesis opens up a new research direction called Cold-Start Universal Information Extraction, where the full extraction and analysis starts from scratch and requires little or no prior manual annotation or pre-defined type schema. In addition to this new research paradigm, we also contribute effective algorithms and models towards resolving the following three challenges:
How can machines extract knowledge without any pre-defined types or any human annotated data? We develop an effective bottom-up and unsupervised Liberal Information Extraction framework based on the hypothesis that the meaning and underlying knowledge conveyed by linguistic expressions is usually embodied by their usages in language, which makes it possible to automatically induces a type schema based on rich contextual representations of all knowledge elements by combining their symbolic and distributional semantics using unsupervised hierarchical clustering.
How can machines benefit from available resources, e.g., large-scale ontologies or existing human annotations? My research has shown that pre-defined types can also be encoded by rich contextual or structured representations, through which knowledge elements can be mapped to their appropriate types. Therefore, we design a weakly supervised Zero-shot Learning and a Semi-Supervised Vector Quantized Variational Auto-Encoder approach that frames IE as a grounding problem instead of classification, where knowledge elements are grounded into any types from an extensible and large-scale target ontology or induced from the corpora, with available annotations for a few types.
How can IE approaches be extent to low-resource languages without any extra human effort? There are more than 6000 living languages in the real world while public gold-standard annotations are only available for a few dominant languages. To facilitate the adaptation of these IE frameworks to other languages, especially low resource languages, a Multilingual Common Semantic Space is further proposed to serve as a bridge for transferring existing resources and annotated data from dominant languages to more than 300 low resource languages. Moreover, a Multi-Level Adversarial Transfer framework is also designed to learn language-agnostic features across various languages