18 research outputs found

    Automatic methods to extract latent meanings in large text corpora

    Get PDF
    This thesis concentrates on Data Mining in Corpus Linguistic. We show the use of modern Data Mining by developing efficient and effective methods for research and teaching in Corpus Linguistics in the fields of lexicography and semantics. Modern language resources as they are provided by Common Language Resources and Technology Infrastructure (http://clarin.eu) offer a large number of heterogeneous information resources of written language. Besides large text corpora, additional information about the sources or publication date of the documents from the corpora are available. Further, information about words from dictionaries or WordNets offer prior information of the word distributions. Starting with pre-studies in lexicography and semantics with large text corpora, we investigate the use of latent variable methods to extract hidden concepts in large text collections. We show that these hidden concepts correspond to meanings of words and subjects in text collections. This motivates an investigation of latent variable methods for large corpora to support linguistic research. In an extensive survey, latent variable models are described. Mathematical and geometrical foundations are explained to motivate the latent variable methods. We distinguish two starting points for latent variable models depending on how we represent documents internally. The first representation is based on geometric objects in a vector space and latent variable are represented by vectors. Latent factor models are described to extract latent variables by finding factorizations of matrices summarizing the document objects. The second representation is based on random sequences and the latent variables are random variables on which the sequences conditionally depend. Latent topic models are described to extract latent variables by finding conditionally depending variables. We explain state-of-the-art methods for factor and topic models. To show the quality and hence the use of latent variable methods for corpus linguistic, different evaluation methods are discussed. Qualitative evaluation methods are described to effectively present the results of the latent variable methods to users. State-of-the-art quantitative evaluation methods are summarized to illustrate how to measure the quality of latent variable methods automatically. Additional, we propose new methods to efficiently estimate the quality of latent variable methods for corpora with time information about the documents. Besides standard evaluation methods based on likelihoods and coherences of the extracted hidden concepts, we develop methods to estimate the coherence of the concepts in terms of temporal aspects and likelihoods that including time. Based on the survey on latent variable methods, we interpret the latent variable methods as optimization problem that finds latent variables to optimally describe the document corpus. To efficiently integrate additional information about a corpus from a modern language resources, we propose to extend the optimization for the latent variables with a regularization that includes this additional information. In terms of the different latent variable models, regularizations are proposed to either align latent factors or jointly model latent topics with information about the documents in the corpus. From pre-studies and collaborations with researches from corpus linguistics, we compiled use cases to investigate the regularized latent variable methods for linguistic research and teaching. Two major application are investigated. In diachronic linguistics, we show efficient regularized latent topic models to jointly model latent variables with time stamps from documents. In variety linguistics, we integrate information about the sources of the documents to model similarities and dissimilarities between corpora. Finally, a software package as Plugin for the Data Mining toolkit RapidMiner as it is developed to implement the methods from the thesis is described. The interfaces to the language resources and text corpora, the text processing methods, the latent variable methods and the evaluation methods are specified. We give detailed information about how the software is used on the use cases. The integration of the developed methods in the modern language resources like WebLicht or the Dictionary of the German Languages is explained to show the acceptance of our method in corpus linguistic research and teaching

    Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

    Get PDF
    Machine learning methods offer a great potential to automatically investigate large amounts of data in the humanities. Our contribution to the workshop reports about ongoing work in the BMBF project KobRA (http://www.kobra.tu-dortmund.de) where we apply machine learning methods to the analysis of big corpora in language-focused research of computer-mediated communication (CMC). At the workshop, we will discuss first results from training a Support Vector Machine (SVM) for the classification of selected linguistic features in talk pages of the German Wikipedia corpus in DeReKo provided by the IDS Mannheim. We will investigate different representations of the data to integrate complex syntactic and semantic information for the SVM. The results shall foster both corpus-based research of CMC and the annotation of linguistic features in CMC corpora

    Machine Learning meets Data-Driven Journalism: Boosting International Understanding and Transparency in News Coverage

    Full text link
    Migration crisis, climate change or tax havens: Global challenges need global solutions. But agreeing on a joint approach is difficult without a common ground for discussion. Public spheres are highly segmented because news are mainly produced and received on a national level. Gain- ing a global view on international debates about important issues is hindered by the enormous quantity of news and by language barriers. Media analysis usually focuses only on qualitative re- search. In this position statement, we argue that it is imperative to pool methods from machine learning, journalism studies and statistics to help bridging the segmented data of the international public sphere, using the Transatlantic Trade and Investment Partnership (TTIP) as a case study.Comment: presented at 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, New York, N

    Untersuchungen zum Anbau von GVO in Sachsen - Untersuchungen zu Konsequenzen des Anbaus von GVO in Sachsen

    Get PDF
    Gegenwärtig nimmt weltweit die Anbaufläche von gv-Mais stetig zu. 2006 wurde auf über 25 Mio Hektar gv- Mais angebaut. 2007 erhöhte sich der Anbau um fast ein Drittel auf 32,5 Mio ha. In der Europäischen Union (Anbaufläche: 110 000 ha) hat Spanien mit 75 000 ha den größten Anteil an gv-Mais [1]. In Deutschland wurden 2007 insgesamt 2 685 ha angebaut. Sachsen nahm mit 556 ha (2006: 230 ha) nach Brandenburg und Mecklenburg-Vorpommern den dritten Platz innerhalb der deutschen Bundesländer ein [2]. Von gv-Maissorten werden beträchtliche Ertragssteigerungen bzw. eine Ertragsstabilisierung erwartet [3]. Diese Zuchtziele sollen durch Resistenz gegen Schädlinge und Krankheiten sowie durch Abbau abiotischer Stressfaktoren erreicht werden. Für bereits zugelassene GVO-Konstrukte ist es notwendig, ähnlich wie bei anderen durch das Bundessortenamt bzw. die EU zugelassenen Sorten regionale Anbaueignungsversuche durchzuführen, um Beratungsgrundlagen für die Praxis zu schaffen. Mit der Realisierung des Projektes wird eine spezifisch auf Sachsen zugeschnittene Demonstrations- und Beratungsbasis aufgebaut, aus der Ergebnisse auch in ein bundesweit vernetztes Versuchsprogramm einfließen. Neben der Einschätzung von Langzeitwirkungen sind die ökonomische Bewertung sowie Koexistenzfragen Schwerpunkte des Projektes. Die Vorhabensziele werden im Folgenden aufgeführt. Sie sind bedeutsam für weitere wissenschaftlich fundierte Empfehlungen zur guten fachlichen Praxis beim Umgang mit GVO. • Prüfung und Demonstration der Anbaueignung von Bt-Mais im Vergleich zu konventionellem Mais unter sächsischen Bedingungen nach den Grundsätzen der guten fachlichen Praxis • Untersuchungen zur Wirtschaftlichkeit des Verfahrens • Entwicklung eines effektiven Langzeitmonitorings des Maiszünslers • Untersuchungen zur Absicherung der Koexistenz zwischen Betrieben mit und ohne Anbau von gv- Sorten (Auskreuzung) • Untersuchungen zu Futterqualität und -wert • Beobachtung der Auswirkungen von GVO auf die bodenbiologische Aktivität sowie auf Nichtzielorganismen • Überleitung der Ergebnisse in die Praxis (Vor-Ort-Demonstrationen, Fachtagungen und Publikationen

    A framework for using self-organising maps to analyse spatiotemporal patterns, exemplified by analysis of mobile phone usage

    Get PDF
    We suggest a visual analytics framework for the exploration and analysis of spatially and temporally referenced values of numeric attributes. The framework supports two complementary perspectives on spatio-temporal data: as a temporal sequence of spatial distributions of attribute values (called spatial situations) and as a set of spatially referenced time series of attribute values representing local temporal variations. To handle a large amount of data, we use the self-organising map (SOM) method, which groups objects and arranges them according to similarity of relevant data features. We apply the SOM approach to spatial situations and to local temporal variations and obtain two types of SOM outcomes, called space-in-time SOM and time-in-space SOM, respectively. The examination and interpretation of both types of SOM outcomes are supported by appropriate visualisation and interaction techniques. This article describes the use of the framework by an example scenario of data analysis. We also discuss how the framework can be extended from supporting explorative analysis to building predictive models of the spatio-temporal variation of attribute values. We apply our approach to phone call data showing its usefulness in real-world analytic scenarios

    Automatische Klassifikation von StĂĽtzverbgefĂĽgen mithilfe von Data-Mining

    No full text
    BMBF-Verbundprojekt:Korpus-basierte linguistische Recherche und Analysemithilfe von Data-Mining (KobRA). - Automatische Klassifikation von StĂĽtzverbgefĂĽgenmithilfe von Data-Mining. 1. Problemstellung und Projektkontext. 2.Datengrundlage und linguistische Vorarbeiten. 3. Beschreibung der Data-Mining-Experimente. 4. Evaluation. 5. Fazit und Anschlussarbeiten. 6. Zitierte Literatu

    Automatische Klassifikation von StĂĽtzverbgefĂĽgen mithilfe von Data-Mining

    Full text link
    BMBF-Verbundprojekt:Korpus-basierte linguistische Recherche und Analysemithilfe von Data-Mining (KobRA). - Automatische Klassifikation von StĂĽtzverbgefĂĽgenmithilfe von Data-Mining. 1. Problemstellung und Projektkontext. 2.Datengrundlage und linguistische Vorarbeiten. 3. Beschreibung der Data-Mining-Experimente. 4. Evaluation. 5. Fazit und Anschlussarbeiten. 6. Zitierte Literatu
    corecore