10 research outputs found
Inferring the Origin Locations of Tweets with Quantitative Confidence
Social Internet content plays an increasingly critical role in many domains,
including public health, disaster management, and politics. However, its
utility is limited by missing geographic information; for example, fewer than
1.6% of Twitter messages (tweets) contain a geotag. We propose a scalable,
content-based approach to estimate the location of tweets using a novel yet
simple variant of gaussian mixture models. Further, because real-world
applications depend on quantified uncertainty for such estimates, we propose
novel metrics of accuracy, precision, and calibration, and we evaluate our
approach accordingly. Experiments on 13 million global, comprehensively
multi-lingual tweets show that our approach yields reliable, well-calibrated
results competitive with previous computationally intensive methods. We also
show that a relatively small number of training data are required for good
estimates (roughly 30,000 tweets) and models are quite time-invariant
(effective on tweets many weeks newer than the training set). Finally, we show
that toponyms and languages with small geographic footprint provide the most
useful location signals.Comment: 14 pages, 6 figures. Version 2: Move mathematics to appendix, 2 new
references, various other presentation improvements. Version 3: Various
presentation improvements, accepted at ACM CSCW 201
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Fast Topic Discovery From Web Search Streams
ABSTRACT Web search involves voluminous data streams that record millions of users' interactions with the search engine. Recently latent topics in web search data have been found to be critical for a wide range of search engine applications such as search personalization and search history warehousing. However, the existing methods usually discover latent topics from web search data in an offline and retrospective fashion. Hence, they are increasingly ineffective in the face of the ever-increasing web search data that accumulate in the format of online streams. In this paper, we propose a novel probabilistic topic model, the Web Search Stream Model (WSSM), which is delicately calibrated for handling two salient features of the web search data: it is in the format of streams and in massive volume. We further propose an efficient parameter inference method, the Stream Parameter Inference (SPI) to efficiently train WSSM with massive web search streams. Based on a large-scale search engine query log, we conduct extensive experiments to verify the effectiveness and efficiency of WSSM and SPI. We observe that WSSM together with SPI discovers latent topics from web search streams faster than the state-of-the-art methods while retaining a comparable topic modeling accuracy
On effective location-aware music recommendation
Ministry of Education, Singapore under its Academic Research Funding Tier
Mining Text and Time Series Data with Applications in Finance
Finance is a field extremely rich in data, and has great need of methods for summarizing and understanding these data. Existing methods of multivariate analysis allow the discovery of structure in time series data but can be difficult to interpret. Often there exists a wealth of text data directly related to the time series. In this thesis it is shown that this text can be exploited to aid interpretation of, and even to improve, the structure uncovered. To this end, two approaches are described and tested. Both serve to uncover structure in the relationship between text and time series data, but do so in very different ways. The first model comes from the field of topic modelling. A novel topic model is developed, closely related to an existing topic model for mixed data. Improved held-out likelihood is demonstrated for this model on a corpus of UK equity market data and the discovered structure is qualitatively examined. To the authors’ knowledge this is the first attempt to combine text and time series data in a single generative topic model. The second method is a simpler, discriminative method based on a low-rank decomposition of time series data with constraints determined by word frequencies in the text data. This is compared to topic modelling using both the equity data and a second corpus comprising foreign exchange rates time series and text describing global macroeconomic sentiments, showing further improvements in held-out likelihood. One example of an application for the inferred structure is also demonstrated: construction of carry trade portfolios. The superior results using this second method serve as a reminder that methodological complexity does not guarantee performance gains
Studying, developing, and experimenting contextual advertising systems
The World Wide Web has grown so fast in the last decade and it is today a vital daily part of people. The Internet is used for many purposes by an ever growing number of users, mostly for daily activities, tasks, and services.
To face the needs of users, an efficient and effective access to information is required. To deal with this task, the adoption of Information Retrieval and Information Filtering techniques is continuously growing. Information Re-trieval (IR) is the field concerned with searching for documents, information within documents, and metadata about documents, as well as searching for structured storage, relational databases, and the World Wide Web. Infor-
mation Filtering deals with the problem of selecting relevant information for a given user, according to her/his preferences and interest. Nowadays, Web advertising is one of the major sources of income for a large number of websites. Its main goal is to suggest products and services to the still ever growing population of Internet users. Web advertising is aimed at suggesting products and services to the users. A significant part of Web ad-vertising consists of textual ads, the ubiquitous short text messages usually
marked as sponsored links. There are two primary channels for distributing ads: Sponsored Search (or Paid Search Advertising) and Contextual Ad-vertising (or Content Match). Sponsored Search advertising is the task of
displaying ads on the page returned from a Web search engine following a query. Contextual Advertising (CA) displays ads within the content of a generic, third party, webpage. In this thesis I study, develop, and evaluated novel solutions in the field of Contextual Advertising. In particular, I studied and developed novel text summarization techniques, I adopted a novel semantic approach, I studied and adopted collaborative approaches, I started a conjunct study of Contex-tual Advertising and Geo-Localization, and I study the task of advertising
in the field of Multi-Modal Aggregation. The thesis is organized as follows. In Chapter 1, we briefly describe the
main aspects of Information Retrieval. Following, the Chapter 2 shows the problem of Contextual Advertising and describes the main contributes of the literature. Chapter 3 sketches a typical adopted approach and the eval-uation metrics of a Contextual Advertising system. Chapter 4 is related to the syntactic aspects, and its focus is on text summarization. In Chapter 5 the semantic aspects are taken into account, and a novel approach based on ConceptNet is proposed. Chapter 6 proposes a novel view of CA by the
adoption of a collaborative filtering approach. Chapter 7 shows a prelim-inary study of Geo Location, performed in collaboration with the Yahoo! Research center in Barcelona. The target is to study several techniques
of suggesting localized advertising in the field of mobile applications and search engines. In Chapter 8 is shown a joint work with the RAI Centre for Research and Technological Innovation. The main goal is to study and
propose a system of advertising for Multimodal Aggregation data. Chapter 9 ends this work with conclusions and future directions
Knowledge-based and data-driven approaches for geographical information access
Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents.
Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms.
The main contributions of this thesis to the state-of-the-art of GeoIA tasks are:
1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG).
2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks.
3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated.
4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and RodrÃguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments.
5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and RodrÃguez, 2006).
6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and RodrÃguez, 2014) and posterior experiments (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).L'Accés a la Informació Geogrà fica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anà lisi automà tic i la interpretació dels termes i restriccions geogrà fiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogrà fica (RIG), Cerca de la Resposta Geogrà fica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogrà fiques (com polÃgons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semà ntic i els termes i les restriccions geogrà fiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogrà fic i temà tic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurÃstics basats en Coneixement Geogrà fic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades especÃfics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogrà fic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogrà fica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogrà fic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguà resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and RodrÃguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadÃstica en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'à mbit geogrà fic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografÃa espanyola (Ferrés and RodrÃguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and RodrÃguez, 2014) i en experiments posteriors (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b)
Knowledge-based and data-driven approaches for geographical information access
Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents.
Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms.
The main contributions of this thesis to the state-of-the-art of GeoIA tasks are:
1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG).
2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks.
3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated.
4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and RodrÃguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments.
5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and RodrÃguez, 2006).
6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and RodrÃguez, 2014) and posterior experiments (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).L'Accés a la Informació Geogrà fica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anà lisi automà tic i la interpretació dels termes i restriccions geogrà fiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogrà fica (RIG), Cerca de la Resposta Geogrà fica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogrà fiques (com polÃgons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semà ntic i els termes i les restriccions geogrà fiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogrà fic i temà tic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurÃstics basats en Coneixement Geogrà fic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades especÃfics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogrà fic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogrà fica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogrà fic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguà resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and RodrÃguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadÃstica en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'à mbit geogrà fic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografÃa espanyola (Ferrés and RodrÃguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and RodrÃguez, 2014) i en experiments posteriors (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).Postprint (published version
Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter Textquellen: Automatisierte Verfahren für dieThemenanalyse nachrichtenorientierterTextquellen
Im Bereich der medienwissenschaftlichen Inhaltsanalyse stellt die Themenanalyse
einen wichtigen Bestandteil dar. Für die Analyse großer digitaler Textbestände hin-
sichtlich thematischer Strukturen ist es deshalb wichtig, das Potential automatisierter
computergestützter Methoden zu untersuchen. Dabei müssen die methodischen und
analytischen Anforderungen der Inhaltsanalyse beachtet und abgebildet werden, wel-
che auch für die Themenanalyse gelten. In dieser Arbeit werden die Möglichkeiten der
Automatisierung der Themenanalyse und deren Anwendungsperspektiven untersucht.
Dabei wird auf theoretische und methodische Grundlagen der Inhaltsanalyse und auf
linguistische Theorien zu Themenstrukturen zurückgegriffen,um Anforderungen an ei-
ne automatische Analyse abzuleiten. Den wesentlichen Beitrag stellt die Untersuchung
der Potentiale und Werkzeuge aus den Bereichen des Data- und Text-Mining dar, die
für die inhaltsanalytische Arbeit in Textdatenbanken hilfreich und gewinnbringend
eingesetzt werden können. Weiterhin wird eine exemplarische Analyse durchgeführt,
um die Anwendbarkeit automatischer Methoden für Themenanalysen zu zeigen. Die
Arbeit demonstriert auch Möglichkeiten der Nutzung interaktiver Oberflächen, formu-
liert die Idee und Umsetzung einer geeigneten Software und zeigt die Anwendung eines
möglichen Arbeitsablaufs für die Themenanalyse auf. Die Darstellung der Potentiale
automatisierter Themenuntersuchungen in großen digitalen Textkollektionen in dieser
Arbeit leistet einen Beitrag zur Erforschung der automatisierten Inhaltsanalyse.
Ausgehend von den Anforderungen, die an eine Themenanalyse gestellt werden,
zeigt diese Arbeit, mit welchen Methoden und Automatismen des Text-Mining diesen
Anforderungen nahe gekommen werden kann. Zusammenfassend sind zwei Anforde-
rungen herauszuheben, deren jeweilige Erfüllung die andere beeinflusst. Zum einen
ist eine schnelle thematische Erfassung der Themen in einer komplexen Dokument-
sammlung gefordert, um deren inhaltliche Struktur abzubilden und um Themen
kontrastieren zu können. Zum anderen müssen die Themen in einem ausreichenden
Detailgrad abbildbar sein, sodass eine Analyse des Sinns und der Bedeutung der The-
meninhalte möglich ist. Beide Ansätze haben eine methodische Verankerung in den
quantitativen und qualitativen Ansätzen der Inhaltsanalyse. Die Arbeit diskutiert
diese Parallelen und setzt automatische Verfahren und Algorithmen mit den Anforde-
rungen in Beziehung. Es können Methoden aufgezeigt werden, die eine semantische
und damit thematische Trennung der Daten erlauben und einen abstrahierten Ãœber-
blick über große Dokumentmengen schaffen. Dies sind Verfahren wie Topic-Modelle
oder clusternde Verfahren. Mit Hilfe dieser Algorithmen ist es möglich, thematisch
kohärente Untermengen in Dokumentkollektion zu erzeugen und deren thematischen
Gehalt für Zusammenfassungen bereitzustellen. Es wird gezeigt, dass die Themen
trotz der distanzierten Betrachtung unterscheidbar sind und deren Häufigkeiten und
Verteilungen in einer Textkollektion diachron dargestellt werden können. Diese Auf-
bereitung der Daten erlaubt die Analyse von thematischen Trends oder die Selektion
bestimmter thematischer Aspekte aus einer Fülle von Dokumenten. Diachrone Be-
trachtungen thematisch kohärenter Dokumentmengen werden dadurch möglich und
die temporären Häufigkeiten von Themen können analysiert werden. Für die detaillier-
te Interpretation und Zusammenfassung von Themen müssen weitere Darstellungen
und Informationen aus den Inhalten zu den Themen erstellt werden. Es kann gezeigt
werden, dass Bedeutungen, Aussagen und Kontexte über eine Kookurrenzanalyse
im Themenkontext stehender Dokumente sichtbar gemacht werden können. In einer
Anwendungsform, welche die Leserichtung und Wortarten beachtet, können häufig
auftretende Wortfolgen oder Aussagen innerhalb einer Thematisierung statistisch
erfasst werden. Die so generierten Phrasen können zur Definition von Kategorien
eingesetzt werden oder mit anderen Themen, Publikationen oder theoretischen An-
nahmen kontrastiert werden. Zudem sind diachrone Analysen einzelner Wörter, von
Wortgruppen oder von Eigennamen in einem Thema geeignet, um Themenphasen,
Schlüsselbegriffe oder Nachrichtenfaktoren zu identifizieren. Die so gewonnenen Infor-
mationen können mit einem „close-reading“ thematisch relevanter Dokumente ergänzt
werden, was durch die thematische Trennung der Dokumentmengen möglich ist. Über
diese methodischen Perspektiven hinaus lassen sich die automatisierten Analysen
als empirische Messinstrumente im Kontext weiterer hier nicht besprochener kommu-
nikationswissenschaftlicher Theorien einsetzen. Des Weiteren zeigt die Arbeit, dass
grafische Oberflächen und Software-Frameworks für die Bearbeitung von automatisier-
ten Themenanalysen realisierbar und praktikabel einsetzbar sind. Insofern zeigen die
Ausführungen, wie die besprochenen Lösungen und Ansätze in die Praxis überführt
werden können.
Wesentliche Beiträge liefert die Arbeit für die Erforschung der automatisierten
Inhaltsanalyse. Die Arbeit dokumentiert vor allem die wissenschaftliche Auseinan-
dersetzung mit automatisierten Themenanalysen. Während der Arbeit an diesem
Thema wurden vom Autor geeignete Vorgehensweisen entwickelt, wie Verfahren des
Text-Mining in der Praxis für Inhaltsanalysen einzusetzen sind. Unter anderem wur-
den Beiträge zur Visualisierung und einfachen Benutzung unterschiedlicher Verfahren
geleistet. Verfahren aus dem Bereich des Topic Modelling, des Clustering und der
Kookkurrenzanalyse mussten angepasst werden, sodass deren Anwendung in inhalts-
analytischen Anwendungen möglich ist. Weitere Beiträge entstanden im Rahmen der
methodologischen Einordnung der computergestützten Themenanalyse und in der
Definition innovativer Anwendungen in diesem Bereich. Die für die vorliegende Arbeit
durchgeführte Experimente und Untersuchungen wurden komplett in einer eigens ent-
wickelten Software durchgeführt, die auch in anderen Projekten erfolgreich eingesetzt
wird. Um dieses System herum wurden Verarbeitungsketten,Datenhaltung,Visualisie-
rung, grafische Oberflächen, Möglichkeiten der Dateninteraktion, maschinelle Lernver-
fahren und Komponenten für das Dokumentretrieval implementiert. Dadurch werden
die komplexen Methoden und Verfahren für die automatische Themenanalyse einfach
anwendbar und sind für künftige Projekte und Analysen benutzerfreundlich verfüg-
bar. Sozialwissenschaftler,Politikwissenschaftler oder Kommunikationswissenschaftler
können mit der Softwareumgebung arbeiten und Inhaltsanalysen durchführen, ohne
die Details der Automatisierung und der Computerunterstützung durchdringen zu
müssen