13 research outputs found
You can't always sketch what you want: Understanding Sensemaking in Visual Query Systems
Visual query systems (VQSs) empower users to interactively search for line
charts with desired visual patterns, typically specified using intuitive
sketch-based interfaces. Despite decades of past work on VQSs, these efforts
have not translated to adoption in practice, possibly because VQSs are largely
evaluated in unrealistic lab-based settings. To remedy this gap in adoption, we
collaborated with experts from three diverse domains---astronomy, genetics, and
material science---via a year-long user-centered design process to develop a
VQS that supports their workflow and analytical needs, and evaluate how VQSs
can be used in practice. Our study results reveal that ad-hoc sketch-only
querying is not as commonly used as prior work suggests, since analysts are
often unable to precisely express their patterns of interest. In addition, we
characterize three essential sensemaking processes supported by our enhanced
VQS. We discover that participants employ all three processes, but in different
proportions, depending on the analytical needs in each domain. Our findings
suggest that all three sensemaking processes must be integrated in order to
make future VQSs useful for a wide range of analytical inquiries.Comment: Accepted for presentation at IEEE VAST 2019, to be held October 20-25
in Vancouver, Canada. Paper will also be published in a special issue of IEEE
Transactions on Visualization and Computer Graphics (TVCG) IEEE VIS
(InfoVis/VAST/SciVis) 2019 ACM 2012 CCS - Human-centered computing,
Visualization, Visualization design and evaluation method
Decision support visualization approach in textile manufacturing a case study from operational control in textile industry
Decision support visualization tools provide insights for solving problems by displaying data in an interactive, graphical format. Such tools can be effective for supporting decision-makers in finding new opportunities and in measuring decision outcomes. In this study, was used a visualization tool capable of handling multivariate time series for studying a problem of operational control in a textile manufacturing plant; the main goal was to identify sources of inefficiency in the daily production data of three machines. A concise rule-based model of the inefficiency measures (i.e. quantitative measures were transformed into categorical variables) was developed and then performed an in-depth visual analysis using a particular technique, the categorical time series plots stacked vertically. With this approach were identified a wide array of production inefficiency patterns, which were difficult to identify using standard quantitative reporting - temporal pattern of best and worst performing machines - and critically, along with most important sources of inefficiency and some interactions between them were revealed. The case study underlying this work was further contextualized within the state of the art, and demonstrates the effectiveness of adequate visual analysis as a decision support tool for operational control in manufacturing.This study was partially conducted at the Psychology Research Centre (UID/PSI/01662/2013), University of Minho, and supported by the Portuguese Foundation for Science and Technology and the Portuguese Ministry of Science, Technology and Higher Education through national funds and co-financed by FEDER through COMPETE2020 under the PT2020 Partnership Agreement (POCI-010145-FEDER-007653). This work was also supported by the following grants: FCT project PTDC/MHC/PCN/1530; FEDER Funds through the "Programa Operacional Factores de Competitividade - COMPETE" program and by National Funds through FCT "Fundacao para a Ciencia e a Tecnologia" under the project: FCOMP-010124-FEDER-PEst-OE/EEI/UI0760/2011, PEst-OE/EEI/UI0760/2014, PEst2015-2020 and UID/CEC/00319/2019
Evaluating stance-annotated sentences from political blogs regarding the Brexit:a quantitative analysis
This paper offers a formally driven quantitative analysis of stance-annotated sentences in the Brexit Blog Corpus (BBC). Our goal is to identify features that determine the formal profiles of six stance categories (contrariety, hypotheticality, necessity, prediction, source of knowledge and uncertainty) in a subset of the BBC. The study has two parts: firstly, it examines a large number of formal linguistic features, such as punctuation, words and grammatical categories that occur in the sentences in order to describe the specific characteristics of each category, and secondly, it compares characteristics in the entire data set in order to determine stance similarities in the data set. We show that among the six stance categories in the corpus, contrariety and necessity are the most discriminative ones, with the former using longer sentences, more conjunctions, more repetitions and shorter forms than the sentences expressing other stances. necessity has longer lexical forms but shorter sentences, which are syntactically more complex. We show that stance in our data set is expressed in sentences with around 21 words per sentence. The sentences consist mainly of alphabetical characters forming a varied vocabulary without special forms, such as digits or special characters
Análise colaborativa de grandes conjuntos de séries temporais
The recent expansion of metrification on a daily basis has led to the production
of massive quantities of data, and in many cases, these collected metrics
are only useful for knowledge building when seen as a full sequence of
data ordered by time, which constitutes a time series. To find and interpret
meaningful behavioral patterns in time series, a multitude of analysis software
tools have been developed. Many of the existing solutions use annotations
to enable the curation of a knowledge base that is shared between a group
of researchers over a network. However, these tools also lack appropriate
mechanisms to handle a high number of concurrent requests and to properly
store massive data sets and ontologies, as well as suitable representations
for annotated data that are visually interpretable by humans and explorable by
automated systems. The goal of the work presented in this dissertation is to
iterate on existing time series analysis software and build a platform for the
collaborative analysis of massive time series data sets, leveraging state-of-the-art technologies for querying, storing and displaying time series and annotations.
A theoretical and domain-agnostic model was proposed to enable
the implementation of a distributed, extensible, secure and high-performant
architecture that handles various annotation proposals in simultaneous and
avoids any data loss from overlapping contributions or unsanctioned changes.
Analysts can share annotation projects with peers, restricting a set of collaborators
to a smaller scope of analysis and to a limited catalog of annotation
semantics. Annotations can express meaning not only over a segment of time,
but also over a subset of the series that coexist in the same segment. A novel
visual encoding for annotations is proposed, where annotations are rendered
as arcs traced only over the affected series’ curves in order to reduce visual
clutter. Moreover, the implementation of a full-stack prototype with a reactive
web interface was described, directly following the proposed architectural and
visualization model while applied to the HVAC domain. The performance of
the prototype under different architectural approaches was benchmarked, and
the interface was tested in its usability. Overall, the work described in this dissertation
contributes with a more versatile, intuitive and scalable time series
annotation platform that streamlines the knowledge-discovery workflow.A recente expansão de metrificação diária levou à produção de quantidades
massivas de dados, e em muitos casos, estas métricas são úteis para
a construção de conhecimento apenas quando vistas como uma sequência
de dados ordenada por tempo, o que constitui uma série temporal. Para se
encontrar padrões comportamentais significativos em séries temporais, uma
grande variedade de software de análise foi desenvolvida. Muitas das soluções
existentes utilizam anotações para permitir a curadoria de uma base
de conhecimento que é compartilhada entre investigadores em rede. No entanto,
estas ferramentas carecem de mecanismos apropriados para lidar com
um elevado número de pedidos concorrentes e para armazenar conjuntos
massivos de dados e ontologias, assim como também representações apropriadas
para dados anotados que são visualmente interpretáveis por seres
humanos e exploráveis por sistemas automatizados. O objetivo do trabalho
apresentado nesta dissertação é iterar sobre o software de análise de séries
temporais existente e construir uma plataforma para a análise colaborativa
de grandes conjuntos de séries temporais, utilizando tecnologias estado-de-arte
para pesquisar, armazenar e exibir séries temporais e anotações. Um
modelo teórico e agnóstico quanto ao domínio foi proposto para permitir a
implementação de uma arquitetura distribuída, extensível, segura e de alto
desempenho que lida com várias propostas de anotação em simultâneo e
evita quaisquer perdas de dados provenientes de contribuições sobrepostas
ou alterações não-sancionadas. Os analistas podem compartilhar projetos
de anotação com colegas, restringindo um conjunto de colaboradores a uma
janela de análise mais pequena e a um catálogo limitado de semântica de
anotação. As anotações podem exprimir significado não apenas sobre um
intervalo de tempo, mas também sobre um subconjunto das séries que coexistem
no mesmo intervalo. Uma nova codificação visual para anotações é
proposta, onde as anotações são desenhadas como arcos traçados apenas
sobre as curvas de séries afetadas de modo a reduzir o ruído visual. Para
além disso, a implementação de um protótipo full-stack com uma interface
reativa web foi descrita, seguindo diretamente o modelo de arquitetura e visualização
proposto enquanto aplicado ao domínio AVAC. O desempenho do
protótipo com diferentes decisões arquiteturais foi avaliado, e a interface foi
testada quanto à sua usabilidade. Em geral, o trabalho descrito nesta dissertação
contribui com uma abordagem mais versátil, intuitiva e escalável para
uma plataforma de anotação sobre séries temporais que simplifica o fluxo de
trabalho para a descoberta de conhecimento.Mestrado em Engenharia Informátic
Acoustic data optimisation for seabed mapping with visual and computational data mining
Oceans cover 70% of Earth’s surface but little is known about their waters.
While the echosounders, often used for exploration of our oceans, have developed at
a tremendous rate since the WWII, the methods used to analyse and interpret the data
still remain the same. These methods are inefficient, time consuming, and often
costly in dealing with the large data that modern echosounders produce. This PhD
project will examine the complexity of the de facto seabed mapping technique by
exploring and analysing acoustic data with a combination of data mining and visual
analytic methods.
First we test the redundancy issues in multibeam echosounder (MBES) data
by using the component plane visualisation of a Self Organising Map (SOM). A total
of 16 visual groups were identified among the 132 statistical data descriptors. The
optimised MBES dataset had 35 attributes from 16 visual groups and represented a
73% reduction in data dimensionality. A combined Principal Component Analysis
(PCA) + k-means was used to cluster both the datasets. The cluster results were
visually compared as well as internally validated using four different internal
validation methods.
Next we tested two novel approaches in singlebeam echosounder (SBES)
data processing and clustering – using visual exploration for outlier detection and
direct clustering of time series echo returns. Visual exploration identified further
outliers the automatic procedure was not able to find. The SBES data were then
clustered directly. The internal validation indices suggested the optimal number of
clusters to be three. This is consistent with the assumption that the SBES time series
represented the subsurface classes of the seabed.
Next the SBES data were joined with the corresponding MBES data based on
identification of the closest locations between MBES and SBES. Two algorithms,
PCA + k-means and fuzzy c-means were tested and results visualised. From visual
comparison, the cluster boundary appeared to have better definitions when compared
to the clustered MBES data only. The results seem to indicate that adding SBES did
in fact improve the boundary definitions.
Next the cluster results from the analysis chapters were validated against
ground truth data using a confusion matrix and kappa coefficients. For MBES, the
classes derived from optimised data yielded better accuracy compared to that of the
original data. For SBES, direct clustering was able to provide a relatively reliable
overview of the underlying classes in survey area. The combined MBES + SBES
data provided by far the best accuracy for mapping with almost a 10% increase in
overall accuracy compared to that of the original MBES data.
The results proved to be promising in optimising the acoustic data and
improving the quality of seabed mapping. Furthermore, these approaches have the
potential of significant time and cost saving in the seabed mapping process. Finally
some future directions are recommended for the findings of this research project with
the consideration that this could contribute to further development of seabed
mapping problems at mapping agencies worldwide
Analyse et détection des trajectoires d'approches atypiques des aéronefs à l'aide de l'analyse de données fonctionnelles et de l'apprentissage automatique
L'amélioration de la sécurité aérienne implique généralement l'identification, la détection et la gestion des événements indésirables qui peuvent conduire à des événements finaux mortels. De précédentes études menées par la DSAC, l'autorité de surveillance française, ont permis d'identifier les approches non-conformes présentant des déviations par rapport aux procédures standards comme des événements indésirables. Cette thèse vise à explorer les techniques de l'analyse de données fonctionnelles et d'apprentissage automatique afin de fournir des algorithmes permettant la détection et l'analyse de trajectoires atypiques en approche à partir de données sol. Quatre axes de recherche sont abordés. Le premier axe vise à développer un algorithme d'analyse post-opérationnel basé sur des techniques d'analyse de données fonctionnelles et d'apprentissage non-supervisé pour la détection de comportements atypiques en approche. Le modèle sera confronté à l'analyse des bureaux de sécurité des vols des compagnies aériennes, et sera appliqué dans le contexte particulier de la période COVID-19 pour illustrer son utilisation potentielle alors que le système global ATM est confronté à une crise. Le deuxième axe de recherche s'intéresse plus particulièrement à la génération et à l'extraction d'informations à partir de données radar à l'aide de nouvelles techniques telles que l'apprentissage automatique. Ces méthodologies permettent d'améliorer la compréhension et l'analyse des trajectoires, par exemple dans le cas de l'estimation des paramètres embarqués à partir des paramètres radar. Le troisième axe, propose de nouvelles techniques de manipulation et de génération de données en utilisant le cadre de l'analyse de données fonctionnelles. Enfin, le quatrième axe se concentre sur l'extension en temps réel de l'algorithme post-opérationnel grâce à l'utilisation de techniques de contrôle optimal, donnant des pistes vers de nouveaux systèmes d'alerte permettant une meilleure conscience de la situation.Improving aviation safety generally involves identifying, detecting and managing undesirable events that can lead to final events with fatalities. Previous studies conducted by the French National Supervisory Authority have led to the identification of non-compliant approaches presenting deviation from standard procedures as undesirable events. This thesis aims to explore functional data analysis and machine learning techniques in order to provide algorithms for the detection and analysis of atypical trajectories in approach from ground side. Four research directions are being investigated. The first axis aims to develop a post-op analysis algorithm based on functional data analysis techniques and unsupervised learning for the detection of atypical behaviours in approach. The model is confronted with the analysis of airline flight safety offices, and is applied in the particular context of the COVID-19 crisis to illustrate its potential use while the global ATM system is facing a standstill. The second axis of research addresses the generation and extraction of information from radar data using new techniques such as Machine Learning. These methodologies allow to \mbox{improve} the understanding and the analysis of trajectories, for example in the case of the estimation of on-board parameters from radar parameters. The third axis proposes novel data manipulation and generation techniques using the functional data analysis framework. Finally, the fourth axis focuses on extending the post-operational algorithm into real time with the use of optimal control techniques, giving directions to new situation awareness alerting systems
Analiza i predviđanje toka vremenskih serija pomoću “Case-BasedReasoning” tehnologije.
This thesis describes one promising approach where a problem of time series analysis and prediction was solved by using Case Based Reasoning (CBR) technology. Foundations and main concepts of this technology are described in detail. Furthermore, a detailed study of different approaches in time series analysis is given. System CuBaGe (Curve Base Generator) - A robust and general architecture for curve representation and indexing time series databases, based on Case based reasoning technology, was developed. Also, a corresponding similarity measure was modelled for a given kind of curve representation. The presented architecture may be employed equally well not only in conventional time series (where all values are known), but also in some non-standard time series (sparse, vague, non-equidistant). Dealing with the non-standard time series is the highest advantage of the presented architecture.U ovoj doktorskoj disertaciji prikazan je interesantan i perspektivan pristup rešavanja problema analize i predviđanja vremenskih serija korišćenjem Case Based Reasoning (CBR) tehnologije. Detaljno su opisane osnove i glavni koncepti ove tehnologije. Takođe, data je komparativna analiza različitih pristupa u analizi vremenskih serija sa posebnim osvrtom na predviđanje. Kao najveći doprinos ove disertacije, prikazan je sistem CuBaGe (Curve Base Generator) u kome je realizovan originalni način reprezentacije vremenskih serija zajedno sa, takođe originalnom, odgovarajućom merom sličnosti. Robusnost i generalnost sistema ilustrovana je realnom primenom u domenu finansijskog predviđanja, gde je pokazano da sistem jednako dobro funkcioniše sa standardnim, ali i sa nekim nestandardnim vremenskim serijama (neodređenim, retkim i neekvidistantnim)
Visuelle Suchanfragen auf graphbasierten Datenstrukturen
Die Menge an verfügbaren Daten nimmt stetig zu.
Durch standardisierte Datenformate wird die Verknüpfung verschiedener Datenquellen und dadurch auch die Zusammenführung unterschiedlicher Datenelemente je nach Anwendungszweck ermöglicht.
Dies führt wiederum zu noch umfassenderen Datenbeständen, in denen die eigentlich gewünschten Informationen teilweise nur schwer gefunden werden können.
Handelt es sich bei den Daten um unstrukturierte oder gleichförmige Informationen, so beschränken sich Suchmöglichkeiten auf die Suche nach Übereinstimmungen von Mustern mit Datenelementen oder Teilen davon - beispielsweise Zeichenketten oder regulären Ausdrücken, die mit Teilen von textuellen Datenelementen übereinstimmen.
In zunehmendem Maß stehen jedoch auch strukturierte Daten zur Verfügung.
Bei diesen wird entweder von Anfang an zwischen unterschiedlichen Facetten pro Datenelement unterschieden, oder es wurden ursprünglich unstrukturierte Daten entsprechend angereichert.
Da die einzelnen Facetten auch Verknüpfungen zu anderen Datenelementen darstellen können, entstehen hierbei Graphstrukturen, welche sich für Ansätze der facettierten Suche eignen.
Eine Interoperabilität zwischen Datenquellen wird hier unter anderem über die Konzepte und Techniken des Semantic Web erreicht.
Zahlreiche Arbeiten haben sich mit der Darstellung der gesamten Datenmengen als Übersicht oder von festgelegten Ausschnitten der Datenmengen im Detail auseinandergesetzt.
Jedoch ist das Auffinden bestimmter Daten nach wie vor ein Problem.
Die Schwierigkeit liegt dabei darin, die Suchkriterien präzise auszudrücken.
Da sich zwischen den einzelnen Kriterien komplexe Zusammenhänge ergeben können, bietet sich auch hier genau wie bei der Übersicht der Datenmengen eine visuelle Darstellung an.
Eine Besonderheit dieses Einsatzszenarios für Visualisierungen besteht darin, dass nicht zwangsläufig Daten vorliegen.
Statt dessen muss die Visualisierung auch ohne verfügbare Daten die konzeptuelle Idee einer Suchanfrage ausdrücken.
Frühere Arbeiten zu diesem Problem befassen sich mit der visuellen Repräsentation von Suchanfragen und Filterausdrücken in Bezug auf relationale Datenbanken und Objektdatenbanken.
Viele neuere Arbeiten gehen vermehrt auch auf den Kontext des Semantic Webs ein.
Einige dieser Konzepte sind jedoch nicht auf abstrakte Weise klar definiert.
Bei komplexeren Anfragen treten zum Teil auch Skalierungsprobleme auf.
Zudem wurde bisher kaum betrachtet, wie sich unterschiedliche Konzepte miteinander in Verbindung bringen lassen, um die Vorteile aus unterschiedlichen Anfragevisualisierungen nutzen zu können.
Diese Dissertation adressiert die beschriebenen Probleme und stellt sechs Konzepte für die visuelle Darstellung von Suchanfragen vor.
Es wird sowohl auf Visualisierungen für allgemeine Einsatzzwecke - also für die Filterung beliebiger strukturierter Informationen -, als auch für spezielle Domänen oder Arten von Informationen eingegangen.
Bestehende Ansätze wurden teilweise auf die Gegebenheiten graphbasierter Datenstrukturen angepasst.
Ebenso werden neue Ansätze präsentiert, die gezielt auf diese Art von Datenstrukturen ausgelegt sind.
Dazu wird jeweils erörtert, inwiefern sich die Anfragevisualisierungen auch ohne Vorhandensein einer zu filternden Datensammlung einsetzen lassen.
Zudem wird erklärt, wie bei Vorhandensein einer solchen eine Vorschau auf die Ergebnisse des Filtervorgangs gewährt werden kann.
Abschließend werden Verbindungsmöglichkeiten der unterschiedlichen Visualisierungskonzepte präsentiert.
Dieser Verbindungsansatz eignet sich dazu, beliebige Anfragevisualisierungen systematisch miteinander zu kombinieren.
Mit dem Verbindungskonzept können Benutzer verschiedene Bestandteile einer Anfrage mittels unterschiedlicher Visualisierungskonzepte ausdrücken, um gleichzeitig von den Stärken unterschiedlicher Anfragevisualisierungen zu profitieren.
Auf diese Weise können nun Anfragen visuell definiert und dargestellt werden, die sowohl komplexe Bedingungen als auch komplexe Zusammenhänge zwischen den Bedingungen aufweisen, ohne die visuelle Übersicht über einen dieser Aspekte zu verlieren.The total amount of available data is steadily increasing.
Standardized data formats allow for connecting different data sources, which can include merging of different data items depending on the use case.
This creates even more comprehensive datasets that render finding a particular piece of information difficult.
If the data consist of unstructured of homogenous information, searching can only be done by matching patterns with data items or parts thereof - for instance, character strings or regular expressions that match parts of textual data items.
However, the availability of structured data is increasing.
This kind of data is either stored as distinct facets of each data item from the outset, or originally unstructured data has been enriched to form a structure.
As each facet can indicate a link to another data item, the entire dataset forms a graph that is suitable for faceted search conepts.
At this point, some interoperability across data sources can be achieved by employing Semantic Web approaches and techniques.
Numerous works have attempted to visualize an overview of the entire dataset, or details of a particular excerpt of the dataset.
Finding specific data remains a problem, however, as the precise specification of search criteria is difficult.
As these criteria can be connected in complex ways, just like the overview of datasets, this issue lends itself to using visual representations.
A special trait of this application of visualization is the possible absence of any data.
Instead, the visualization must be capable of conveying the conceptual idea of a search query without displaying any data.
Former works related to this problem focused on the visual representation of search queries and filter expressions for relational and object-oriented databases.
More recent works increasingly address a Semantic Web context.
Various of these concepts, however, lack a clear abstract definition.
Also, scalability issues appear in the case of complex queries.
Furthermore, little attention was paid to how to connect several concepts in order to combine advantages of different query visualizations.
This dissertation considers the described problems and presents six concepts for query visualization.
Both generic visualizations - that is, for filtering any kind of structured data - and domain-specific or type-specific visualizations are addressed.
In part, existing approaches have been adapted to the particularities of graph-based data structures.
Likewise, several new approaches specifically designed for this kind of data are presented.
For each of these concepts, the necessity of a dataset is discussed.
Moreover, options for providing a preview on query results from such a dataset, if available, are considered.
Finally, ways for connecting the query visualization concepts are presented.
This connection approach is suitable for systematically linking together arbitrary query visualizations.
By means of the connection approach, users can express different parts of a query using different visualization concepts, in order to benefit from the advantages of several query visualizations at a time.
Like this, queries that include complex criteria as well as complex relations between criteria can now be defined and displayed visually without losing the visual overview of any of these aspects
Uloga mera sličnosti u analizi vremenskih serija
The subject of this dissertation encompasses a comprehensive overview and analysis of the impact of Sakoe-Chiba global constraint on the most commonly used elastic similarity measures in the field of time-series data mining with a focus on classification accuracy. The choice of similarity measure is one of the most significant aspects of time-series analysis - it should correctly reflect the resemblance between the data presented in the form of time series. Similarity measures represent a critical component of many tasks of mining time series, including: classification, clustering, prediction, anomaly detection, and others. The research covered by this dissertation is oriented on several issues: 1. review of the effects of global constraints on the performance of computing similarity measures, 2. a detailed analysis of the influence of constraining the elastic similarity measures on the accuracy of classical classification techniques, 3. an extensive study of the impact of different weighting schemes on the classification of time series, 4. development of an open source library that integrates the main techniques and methods required for analysis and mining time series, and which is used for the realization of these experimentsPredmet istraživanja ove disertacije obuhvata detaljan pregled i analizu uticaja Sakoe-Chiba globalnog ograničenja na najčešće korišćene elastične mere sličnosti u oblasti data mining-a vremenskih serija sa naglaskom na tačnost klasifikacije. Izbor mere sličnosti jedan je od najvažnijih aspekata analize vremenskih serija - ona treba verno reflektovati sličnost između podataka prikazanih u obliku vremenskih serija. Mera sličnosti predstavlјa kritičnu komponentu mnogih zadataka mining-a vremenskih serija, uklјučujući klasifikaciju, grupisanje (eng. clustering), predviđanje, otkrivanje anomalija i drugih. Istraživanje obuhvaćeno ovom disertacijom usmereno je na nekoliko pravaca: 1. pregled efekata globalnih ograničenja na performanse računanja mera sličnosti, 2. detalјna analiza posledice ograničenja elastičnih mera sličnosti na tačnost klasifikacije klasičnih tehnika klasifikacije, 3. opsežna studija uticaj različitih načina računanja težina (eng. weighting scheme) na klasifikaciju vremenskih serija, 4. razvoj biblioteke otvorenog koda (Framework for Analysis and Prediction - FAP) koja će integrisati glavne tehnike i metode potrebne za analizu i mining vremenskih serija i koja je korišćena za realizaciju ovih eksperimenata.Predmet istraživanja ove disertacije obuhvata detaljan pregled i analizu uticaja Sakoe-Chiba globalnog ograničenja na najčešće korišćene elastične mere sličnosti u oblasti data mining-a vremenskih serija sa naglaskom na tačnost klasifikacije. Izbor mere sličnosti jedan je od najvažnijih aspekata analize vremenskih serija - ona treba verno reflektovati sličnost između podataka prikazanih u obliku vremenskih serija. Mera sličnosti predstavlja kritičnu komponentu mnogih zadataka mining-a vremenskih serija, uključujući klasifikaciju, grupisanje (eng. clustering), predviđanje, otkrivanje anomalija i drugih. Istraživanje obuhvaćeno ovom disertacijom usmereno je na nekoliko pravaca: 1. pregled efekata globalnih ograničenja na performanse računanja mera sličnosti, 2. detaljna analiza posledice ograničenja elastičnih mera sličnosti na tačnost klasifikacije klasičnih tehnika klasifikacije, 3. opsežna studija uticaj različitih načina računanja težina (eng. weighting scheme) na klasifikaciju vremenskih serija, 4. razvoj biblioteke otvorenog koda (Framework for Analysis and Prediction - FAP) koja će integrisati glavne tehnike i metode potrebne za analizu i mining vremenskih serija i koja je korišćena za realizaciju ovih eksperimenata
Digital Intelligence – Möglichkeiten und Umsetzung einer informatikgestützten Frühaufklärung: Digital Intelligence – opportunities and implementation of a data-driven foresight
Das Ziel der Digital Intelligence bzw. datengetriebenen Strategischen Frühaufklärung ist, die Zukunftsgestaltung auf Basis valider und fundierter digitaler Information mit vergleichsweise geringem Aufwand und enormer Zeit- und Kostenersparnis zu unterstützen. Hilfe bieten innovative Technologien der (halb)automatischen Sprach- und Datenverarbeitung wie z. B. das Information Retrieval, das (Temporal) Data, Text und Web Mining, die Informationsvisualisierung, konzeptuelle Strukturen sowie die Informetrie. Sie ermöglichen, Schlüsselthemen und latente Zusammenhänge aus einer nicht überschaubaren, verteilten und inhomogenen Datenmenge wie z. B. Patenten, wissenschaftlichen Publikationen, Pressedokumenten oder Webinhalten rechzeitig zu erkennen und schnell und zielgerichtet bereitzustellen. Die Digital Intelligence macht somit intuitiv erahnte Muster und Entwicklungen explizit und messbar.
Die vorliegende Forschungsarbeit soll zum einen die Möglichkeiten der Informatik zur datengetriebenen Frühaufklärung aufzeigen und zum zweiten diese im pragmatischen Kontext umsetzen.
Ihren Ausgangspunkt findet sie in der Einführung in die Disziplin der Strategischen Frühaufklärung und ihren datengetriebenen Zweig – die Digital Intelligence.
Diskutiert und klassifiziert werden die theoretischen und insbesondere informatikbezogenen Grundlagen der Frühaufklärung – vor allem die Möglichkeiten der zeitorientierten Datenexploration.
Konzipiert und entwickelt werden verschiedene Methoden und Software-Werkzeuge, die die zeitorientierte Exploration insbesondere unstrukturierter Textdaten (Temporal Text Mining) unterstützen. Dabei werden nur Verfahren in Betracht gezogen, die sich im Kontext einer großen Institution und den spezifischen Anforderungen der Strategischen Frühaufklärung pragmatisch nutzen lassen. Hervorzuheben sind eine Plattform zur kollektiven Suche sowie ein innovatives Verfahren zur Identifikation schwacher Signale.
Vorgestellt und diskutiert wird eine Dienstleistung der Digital Intelligence, die auf dieser Basis in einem globalen technologieorientierten Konzern erfolgreich umgesetzt wurde und eine systematische Wettbewerbs-, Markt- und Technologie-Analyse auf Basis digitaler Spuren des Menschen ermöglicht.:Kurzzusammenfassung 2
Danksagung 3
Inhaltsverzeichnis 5
Tabellenverzeichnis 9
Abbildungsverzeichnis 10
A – EINLEITUNG 13
1 Hintergrund und Motivation 13
2 Beitrag und Aufbau der Arbeit 16
B – THEORIE 20
B0 – Digital Intelligence 20
3 Herleitung und Definition der Digital Intelligence 21
4 Abgrenzung zur Business Intelligence 23
5 Übersicht über unterschiedliche Textsorten 24
6 Informetrie: Bibliometrie, Szientometrie, Webometrie 29
7 Informationssysteme im Kontext der Digital Intelligence 31
B1 – Betriebswirtschaftliche Grundlagen der Digital Intelligence 36
8 Strategische Frühaufklärung 37
8.1 Facetten und historische Entwicklung 37
8.2 Methoden 41
8.3 Prozess 42
8.4 Bestimmung wiederkehrender Termini 44
8.5 Grundlagen der Innovations- und Diffusionsforschung 49
B2 – Informatik-Grundlagen der Digital Intelligence 57
9 Von Zeit, Daten, Text, Metadaten zu multidimensionalen zeitorientierten (Text)Daten 59
9.1 Zeit – eine Begriffsbestimmung 59
9.1.1 Zeitliche Grundelemente und Operatoren 59
9.1.2 Lineare, zyklische und verzweigte Entwicklungen 62
9.1.3 Zeitliche (Un)Bestimmtheit 62
9.1.4 Zeitliche Granularität 63
9.2 Text 63
9.2.1 Der Text und seine sprachlich-textuellen Ebenen 63
9.2.2 Von Signalen und Daten zu Information und Wissen 65
9.3 Daten 65
9.3.1 Herkunft 65
9.3.2 Datengröße 66
9.3.3 Datentyp und Wertebereich 66
9.3.4 Datenstruktur 67
9.3.5 Dimensionalität 68
9.4 Metadaten 69
9.5 Zusammenfassung und multidimensionale zeitorientierte Daten 70
10 Zeitorientierte Datenexplorationsmethoden 73
10.1 Zeitorientierte Datenbankabfragen und OLAP 76
10.2 Zeitorientiertes Information Retrieval 78
10.3 Data Mining und Temporal Data Mining 79
10.3.1 Repräsentationen zeitorientierter Daten 81
10.3.2 Aufgaben des Temporal Data Mining 86
10.4 Text Mining und Temporal Text Mining 91
10.4.1 Grundlagen des Text Mining 98
10.4.2 Entwickelte, genutzte und lizensierte Anwendungen des Text Mining 107
10.4.3 Formen des Temporal Text Mining 110
10.4.3.1 Entdeckung kausaler und zeitorientierter Regeln 110
10.4.3.2 Identifikation von Abweichungen und Volatilität 111
10.4.3.3 Identifikation und zeitorientierte Organisation von Themen 112
10.4.3.4 Zeitorientierte Analyse auf Basis konzeptueller Strukturen 116
10.4.3.5 Zeitorientierte Analyse von Frequenz, Vernetzung und Hierarchien 117
10.4.3.6 Halbautomatische Identifikation von Trends 121
10.4.3.7 Umgang mit dynamisch aktualisierten Daten 123
10.5 Web Mining und Temporal Web Mining 124
10.5.1 Web Content Mining 125
10.5.2 Web Structure Mining 126
10.5.3 Web Usage Mining 127
10.5.4 Temporal Web Mining 127
10.6 Informationsvisualisierung 128
10.6.1 Visualisierungstechniken 130
10.6.1.1 Visualisierungstechniken nach Datentypen 130
10.6.1.2 Visualisierungstechniken nach Darstellungsart 132
10.6.1.3 Visualisierungstechniken nach Art der Interaktion 137
10.6.1.4 Visualisierungstechniken nach Art der visuellen Aufgabe 139
10.6.1.5 Visualisierungstechniken nach Visualisierungsprozess 139
10.6.2 Zeitorientierte Visualisierungstechniken 140
10.6.2.1 Statische Repräsentationen 141
10.6.2.2 Dynamische Repräsentationen 145
10.6.2.3 Ereignisbasierte Repräsentationen 147
10.7 Zusammenfassung 152
11 Konzeptuelle Strukturen 154
12 Synopsis für die zeitorientierte Datenexploration 163
C – UMSETZUNG EINES DIGITAL-INTELLIGENCESYSTEMS 166
13 Bestimmung textbasierter Indikatoren 167
14 Anforderungen an ein Digital-Intelligence-System 171
15 Beschreibung der Umsetzung eines Digital-Intelligence-Systems 174
15.1 Konzept einer Dienstleistung der Digital Intelligence 175
15.1.1 Portalnutzung 177
15.1.2 Steckbriefe 178
15.1.3 Tiefenanalysen 180
15.1.4 Technologiescanning 185
15.2 Relevante Daten für die Digital Intelligence (Beispiel) 187
15.3 Frühaufklärungs-Plattform 188
15.4 WCTAnalyze und automatische Extraktion themenspezifischer Ereignisse 197
15.5 SemanticTalk 200
15.6 Halbautomatische Identifikation von Trends 204
15.6.1 Zeitreihenkorrelation 205
15.6.2 HD-SOM-Scanning 207
D – ZUSAMMENFASSUNG 217
Anhang A: Prozessbilder entwickelter Anwendungen des (Temporal) Text Mining 223
Anhang B: Synopsis der zeitorientierten Datenexploration 230
Literaturverzeichnis 231
Selbstständigkeitserklärung 285
Wissenschaftlicher Werdegang des Autors 286
Veröffentlichungen 28