8 research outputs found

    Rich and Scalable Models for Text

    Get PDF
    Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity. First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge. Second, we enrich the anchor word algorithm with metadata-based word representation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models. Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Représentations robustes de documents bruités dans des espaces homogènes

    Get PDF
    In the Information Retrieval field, documents are usually considered as a "bagof-words". This model does not take into account the temporal structure of thedocument and is sensitive to noises which can alter its lexical form. These noisescan be produced by different sources : uncontrolled form of documents in microbloggingplatforms, automatic transcription of speech documents which are errorprone,lexical and grammatical variabilities in Web forums. . . The work presented inthis thesis addresses issues related to document representations from noisy sources.The thesis consists of three parts in which different representations of content areavailable. The first one compares a classical representation based on a term-frequencyrepresentation to a higher level representation based on a topic space. The abstractionof the document content allows us to limit the alteration of the noisy document byrepresenting its content with a set of high-level features. Our experiments confirm thatmapping a noisy document into a topic space allows us to improve the results obtainedduring different information retrieval tasks compared to a classical approach based onterm frequency. The major problem with such a high-level representation is that it isbased on a space theme whose parameters are chosen empirically.The second part presents a novel representation based on multiple topic spaces thatallow us to solve three main problems : the closeness of the subjects discussed in thedocument, the tricky choice of the "right" values of the topic space parameters and therobustness of the topic-based representation. Based on the idea that a single representationof the contents cannot capture all the relevant information, we propose to increasethe number of views on a single document. This multiplication of views generates "artificial"observations that contain fragments of useful information. The first experimentvalidated the multi-view approach to represent noisy texts. However, it has the disadvantageof being very large and redundant and of containing additional variability associatedwith the diversity of views. In the second step, we propose a method based onfactor analysis to compact the different views and to obtain a new robust representationof low dimension which contains only the informative part of the document whilethe noisy variabilities are compensated. During a dialogue classification task, the compressionprocess confirmed that this compact representation allows us to improve therobustness of noisy document representation.Nonetheless, during the learning process of topic spaces, the document is consideredas a "bag-of-words" while many studies have showed that the word position in a7document is useful. A representation which takes into account the temporal structureof the document based on hyper-complex numbers is proposed in the third part. Thisrepresentation is based on the hyper-complex numbers of dimension four named quaternions.Our experiments on a classification task have showed the effectiveness of theproposed approach compared to a conventional "bag-of-words" representation.En recherche d’information, les documents sont le plus souvent considérés comme des "sacs-de-mots". Ce modèle ne tient pas compte de la structure temporelle du document et est sensible aux bruits qui peuvent altérer la forme lexicale. Ces bruits peuvent être produits par différentes sources : forme peu contrôlée des messages des sites de micro-blogging, messages vocaux dont la transcription automatique contient des erreurs, variabilités lexicales et grammaticales dans les forums du Web. . . Le travail présenté dans cette thèse s’intéresse au problème de la représentation de documents issus de sources bruitées.La thèse comporte trois parties dans lesquelles différentes représentations des contenus sont proposées. La première partie compare une représentation classique utilisant la fréquence des mots à une représentation de haut-niveau s’appuyant sur un espace de thèmes. Cette abstraction du contenu permet de limiter l’altération de la forme de surface du document bruité en le représentant par un ensemble de caractéristiques de haut-niveau. Nos expériences confirment que cette projection dans un espace de thèmes permet d’améliorer les résultats obtenus sur diverses tâches de recherche d’information en comparaison d’une représentation plus classique utilisant la fréquence des mots.Le problème majeur d’une telle représentation est qu’elle est fondée sur un espace de thèmes dont les paramètres sont choisis empiriquement.La deuxième partie décrit une nouvelle représentation s’appuyant sur des espaces multiples et permettant de résoudre trois problèmes majeurs : la proximité des sujets traités dans le document, le choix difficile des paramètres du modèle de thèmes ainsi que la robustesse de la représentation. Partant de l’idée qu’une seule représentation des contenus ne peut pas capturer l’ensemble des informations utiles, nous proposons d’augmenter le nombre de vues sur un même document. Cette multiplication des vues permet de générer des observations "artificielles" qui contiennent des fragments de l’information utile. Une première expérience a validé cette approche multi-vues de la représentation de textes bruités. Elle a cependant l’inconvénient d’être très volumineuse,redondante, et de contenir une variabilité additionnelle liée à la diversité des vues. Dans un deuxième temps, nous proposons une méthode s’appuyant sur l’analyse factorielle pour fusionner les vues multiples et obtenir une nouvelle représentation robuste,de dimension réduite, ne contenant que la partie "utile" du document tout en réduisant les variabilités "parasites". Lors d’une tâche de catégorisation de conversations,ce processus de compression a confirmé qu’il permettait d’augmenter la robustesse de la représentation du document bruité.Cependant, lors de l’élaboration des espaces de thèmes, le document reste considéré comme un "sac-de-mots" alors que plusieurs études montrent que la position d’un terme au sein du document est importante. Une représentation tenant compte de cette structure temporelle du document est proposée dans la troisième partie. Cette représentation s’appuie sur les nombres hyper-complexes de dimension appelés quaternions. Nos expériences menées sur une tâche de catégorisation ont montré l’efficacité de cette méthode comparativement aux représentations classiques en "sacs-de-mots"

    Contested Governance in Japan

    Get PDF
    Contested Governance in Japan extends the analysis of governance in contemporary Japan by exploring both the sites and issues of governance above and below the state as well as within it. This volume discusses the contested nature of governance in Japan and the ways in which a range of actors are involved in different sites and issues of governance at home, in the region and the globe. It includes chapters on global governance, local policy-making, democracy, environmental governance, the Japanese financial system, corruption, the family and corporate governance

    Towards an encyclopaedia as a web of knowledge

    Get PDF
    Peter Greenaway kann auf eine lange und erfolgreiche Karriere zurückblicken, die ihn als einen der herausragenden Künstler und Filmemacher der Gegenwart ausweist. Allerdings hat sein stetig wachsendes Gesamtwerk, welches Filme, Gemälde, Ausstellungen, Installationen und Opern gleichermaßen umfasst, in den letzten Jahren zunehmend seine Anziehungskraft auf Publikum und Kritik verloren. Diese Arbeit hat es sich zur Aufgabe gemacht, einerseits Lücken in der kritischen Auseinandersetzung mit Greenaway zu schließen, andererseits eine Gesamtsicht auf sein Werk zu ermöglichen, welche dieses als ein homogenes Ganzes begreift, das durch eine strukturelle Analyse aufbereitet werden kann. Greenaway ist ein Künstler mit einer enzyklopädischen Vielfalt an Interessen und einer ausgeprägten Sammelleidenschaft, der dafür bekannt ist, seine Werke mit einer Vielzahl von Bildern und Ideen auszustatten, die aus so unterschiedlichen Wissensbereichen wie Biologie, Medizin, Geschichte, Mathematik, Philosophie, Theologie oder den Künsten entlehnt sind. Für eine Analyse der Fülle des von Greenaway gesammelten Materials wurden seine Werke in ihre einzelnen Bestandteile aufgebrochen, um so wiederkehrende Elemente (abstrakte Konzepte, materielle Objekte, Bilder oder literarische Motive) zu identifizieren, die als vereinende/bindende Kräfte zwischen den einzelnen Erscheinungen seines Gesamtwerks fungieren. Zusammengefasst in paradigmatische Kategorien werden diese wiederkehrenden Elemente im Rahmen einer enzyklopädischen Sammlung aufbereitet, die gleichzeitig den Mittelpunkt dieser Arbeit bildet. Innerhalb dieses Rahmens wird Greenaways Werk weiter analysiert, indem geschichtliche und kulturgeschichtliche Bezüge sowie Verbindungen zu den Werken anderer Künstler hergestellt werden. Durch die Erforschung einiger der vielen expliziten und impliziten Verbindungen und Pfade, die von Greenaway angelegt wurden, ergibt sich ein Gesamtbild seines Werks als ein verzweigtes Wissensnetz, das uns dazu einlädt, in eine Vielfalt kultureller Formen und Traditionen einzutauchen, und uns dabei die Möglichkeit bietet, bei der Erkundung der Verbindungen zwischen Greenaways Kunst und der Kultur der Vergangenheit und der Gegenwart eigene Wege zu bestreiten.Peter Greenaway can look back on a long and distinguished career that established him as one of the leading artists and filmmakers of our time. Within the last few years, however, his ever-expanding oeuvre, which includes films, paintings, writings, exhibitions, installations, and operas, has largely failed to attract audience interest and scholarly attention. This study not only attempts to fill a considerable gap in the criticism of Greenaway, but also to offer a holistic view that sees his complete work as one homogeneous body, as a system made up of interrelated parts, for which structural theory provides the analytical framework. Greenaway, as an artist with an encyclopaedic range of interests and a strong penchant for collecting, is noted for filling his works with a great variety of images and ideas, borrowed from fields as diverse as biology, medicine, history, mathematics, philosophy, theology, literature, or the fine arts. For an analysis of the wealth of material collected by the artist, his works were disassembled into their constituent parts to identify recurring elements (abstract concepts, material objects, or visual and literary images) that function as unifying/binding forces between the individual emanations of his oeuvre. Grouped together in paradigmatic classes, these recurring elements are presented within the framework of an encyclopaedic collection, which forms the central part of this study. Within this framework, Greenaway’s work is further analysed by contextualising and historicising it, by relating it to a wider context of culture, and by establishing connections to the works of other artists. Thus exploring some of the explicit and implicit paths laid out by Greenaway, his work is outlined as an intricate web of knowledge, which invites us to delve into the depths and richness of cultural traditions, while at the same time allows us to discover our own unique course in an intellectual exploration of the relations between Greenaway’s art and the culture of the past and the present

    Proceedings of the 21st International Congress of Aesthetics, Possible Worlds of Contemporary Aesthetics Aesthetics Between History, Geography and Media

    Get PDF
    The Faculty of Architecture, University of Belgrade and the Society for Aesthetics of Architecture and Visual Arts of Serbia (DEAVUS) are proud to be able to organize the 21st ICA Congress on “Possible Worlds of Contemporary Aesthetics: Aesthetics Between History, Geography and Media”. We are proud to announce that we received over 500 submissions from 56 countries, which makes this Congress the greatest gathering of aestheticians in this region in the last 40 years. The ICA 2019 Belgrade aims to map out contemporary aesthetics practices in a vivid dialogue of aestheticians, philosophers, art theorists, architecture theorists, culture theorists, media theorists, artists, media entrepreneurs, architects, cultural activists and researchers in the fields of humanities and social sciences. More precisely, the goal is to map the possible worlds of contemporary aesthetics in Europe, Asia, North and South America, Africa and Australia. The idea is to show, interpret and map the unity and diverseness in aesthetic thought, expression, research, and philosophies on our shared planet. Our goal is to promote a dialogue concerning aesthetics in those parts of the world that have not been involved with the work of the International Association for Aesthetics to this day. Global dialogue, understanding and cooperation are what we aim to achieve. That said, the 21st ICA is the first Congress to highlight the aesthetic issues of marginalised regions that have not been fully involved in the work of the IAA. This will be accomplished, among others, via thematic round tables discussing contemporary aesthetics in East Africa and South America. Today, aesthetics is recognized as an important philosophical, theoretical and even scientific discipline that aims at interpreting the complexity of phenomena in our contemporary world. People rather talk about possible worlds or possible aesthetic regimes rather than a unique and consistent philosophical, scientific or theoretical discipline
    corecore