71 research outputs found

    Multi-Source Spatial Entity Extraction and Linkage

    Get PDF

    Interactive data analysis and its applications on multi-structured datasets

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Query processing in complex modern traffic networks

    Get PDF
    The transport sector generates about one quarter of all greenhouse gas emissions worldwide. In the European Union (EU), passenger cars and light-duty trucks make up for over half of these traffic-related emissions. It is evident that everyday traffic is a serious environmental threat. At the same time, transport is a key factor for the ambitious EU climate goals; among them, for instance, the reduction of greenhouse gas emissions by 85 to 90 percent in the next 35 years. This thesis investigates complex traffic networks and their requirements from a computer science perspective. Modeling of and query processing in modern traffic networks are pivotal topics. Challenging theoretical problems are examined from different perspectives, novel algorithmic solutions are provided. Practical problems are investigated and solved, for instance, employing qualitative crowdsourced information and sensor data of various sources. Modern traffic networks are often modeled as graphs, i.e., defined by sets of nodes and edges. In conventional graphs, the edges are assigned numerical weights, for instance, reflecting cost criteria like distance or travel time. In multicriteria networks, the edges reflect multiple, possibly dynamically changing cost criteria. While these networks allow for diverse queries and meaningful insight, query processing usually is significantly more complex. Novel means for computation are required to keep query processing efficient. The crucial task of computing optimal paths is particularly expensive under multiple criteria. The most established set of optimal paths in multicriteria networks is referred to as path skyline (or set of pareto-optimal paths). Until now, computing the path skyline either required extensive precomputation or networks of minor size or complexity. Neither of these demands can be made on modern traffic networks. This thesis presents a novel method which makes on-the-fly computation of path skylines possible, even in dynamic networks with three or more cost criteria. Another problem examined is the exponentially growth of path skylines. The number of elements in a path skyline is potentially exponential in the number of cost criteria and the number of edges between start and target. This often produces less meaningful results, sometimes hindering usability. These drawbacks emphasize the importance of the linear path skyline which is investigated in this thesis. The linear path skyline is based on a different notion of optimality. By the notion of optimality, the linear path skyline is a subset of the conventional path skyline but in general contains less and more diverse elements. Thus, the linear path skyline facilitates interpretation while in general reducing computational effort. This topic is first studied in networks with two cost criteria and subsequently extended to more cost criteria. These cost criteria are not limited to purely quantitative measures like distance and travel time. This thesis examines the integration of qualitative information into abstractly modeled road networks. It is proposed to mine crowdsourced data for qualitative information and use this information to enrich road network graphs. These enriched networks may in turn be used to produce routing suggestions which reflect an opinion of the crowd. From data processing to knowledge extracting, network enrichment and route computation, the possibilities and challenges of crowdsourced data as a source for information are surveyed. Additionally, this thesis substantiates the practicability of network enrichment in real-world experiments. The description of a demonstration framework which applies some of the presented methods to the use case of tourist route recommendation serves as an example. The methods may also be applied to a novel graph-based routing problem proposed in this thesis. The problem extends the family of Orienteering Problems which find frequent application in tourist routing and other tasks. An approximate solution to this NP-hard problem is presented and evaluated on a large scale, real-world, time-dependent road network. Another central aspect of modern traffic networks is the integration of sensor data, often referred to as telematics. Nowadays, manifold sensors provide a plethora of data. Using this data to optimize traffic is and will continue to be a challenging task for research and industry. Some of the applications which qualify for the integration of modern telematics are surveyed in this thesis. For instance, the abstract problem of consumable and reoccurring resources in road networks is studied. An application of this problem is the search for a vacant parking space. Taking statistical and real-time sensor information into account, a stochastic routing algorithm which maximizes the probability of finding a vacant space is proposed. Furthermore, the thesis presents means for the extraction of driving preferences, helping to better understand user behavior in traffic. The theoretical concepts partially find application in a demonstration framework described in this thesis. This framework provides features which were developed for a real-world pilot project on the topics of electric and shared mobility. Actual sensor car data collected in the project, gives insight to the challenges of managing a fleet of electric vehicles.Verkehrsmittel erzeugen rund ein Viertel aller Treibhausgas-Emissionen weltweit. Für über die Hälfte der verkehrsbedingten Emissionen in der Europäischen Union (EU) zeichnen PKW und Kleinlaster verantwortlich. Die Tragweite ökologischer Konsequenzen durch alltäglichen Verkehr ist enorm. Zugleich ist ein Umdenken im Bezug auf Verkehr entscheidend, um die ehrgeizigen klimapolitischen Ziele der EU zu erfüllen. Dazu gehört unter anderem, Treibhausgas-Emissionen bis 2050 um 85 bis 90 Prozent zu verringern. Die vorliegende Arbeit widmet sich den komplexen Anforderungen an Verkehr und Verkehrsnetzwerke aus der Sicht der Informatik. Dabei spielen sowohl die Modellierung von als auch die Anfragebearbeitung in modernen Verkehrsnetzwerken eine entscheidende Rolle. Theoretische Fragestellungen werden aus unterschiedlichen Persepektiven beleuchtet, neue Algorithmen werden vorgestellt. Ebenso werden praktische Fragestellungen untersucht und gelöst, etwa durch die Einbindung nutzergenerierten Inhalts oder die Verwendung von Sensordaten aus unterschiedlichen Quellen. Moderne Verkehrsnetzwerke werden häufig als Graphen modelliert, d.h., durch Knoten und Kanten dargestellt. Man unterscheidet zwischen konventionellen Graphen und sogenannten Multiattributs-Graphen. Während die Kanten konventioneller Graphen numerische Gewichte tragen, die statische Kostenkriterien wie Distanz oder Reisezeit modellieren, beschreiben die Kantengewichte in Multiattributs-Graphen mehrere, möglicherweise dynamisch veränderliche Kostenkriterien. Das erlaubt einerseits vielseitige Anfragen und aussagekräftige Erkenntnisse, macht die Anfragebearbeitung jedoch ungleich komplexer und verlangt deshalb nach neuen Berechnungsmethoden. Eine besonders aufwendige Anfrage ist die Berechnung optimaler Pfade, zugleich eine der zentralsten Fragestellungen. Die gängigste Menge optimaler Pfade wird als Pfad-Skyline (auch: Menge der pareto-optimalen Pfade) bezeichnet. Die effiziente Berechnung der Pfad-Skyline setzte bisher überschaubare Netzwerke oder beträchtliche Vorberechnungen voraus. Keine der beiden Bedingung kann in modernen Verkehrsnetzwerken erfüllt werden. Diese Arbeit stellt deshalb eine Methode vor, die die Berechnung der Pfad-Skyline erheblich beschleunigt, selbst in dynamischen Netzwerken mit drei oder mehr Kostenkriterien. Außerdem wird das Problem des exponentiellen Wachstums der Pfad-Skyline betrachtet. Die Anzahl der Elemente der Pfad-Skyline wächst im schlechtesten Fall exponentiell in der Anzahl der Kostenkriterien sowie in der Entfernung zwischen Start und Ziel. Dies kann zu unübersichtlichen und wenig aussagekräftigen Resultatmengen führen. Diese Nachteile unterstreichen die Bedeutung der linearen Pfad-Skyline, die auch im Rahmen diese Arbeit untersucht wird. Die lineare Pfad-Skyline folgt einer anderen Definition von Optimalität. Stets ist die lineare Pfad-Skyline eine Teilmenge der konventionellen Pfad-Skyline, meist enthält sie deutlich weniger, unterschiedlichere Resultate. Dadurch lässt sich die lineare Pfad-Skyline im Allgemeinen schneller berechnen und erleichtert die Interpretation der Resultate. Die Berechnung der linearen Pfad-Skyline wird erst für Netzwerke mit zwei Kostenkriterien, anschließend für Netzwerke mit beliebig vielen Kostenkriterien untersucht. Kostenkriterien sind nicht notwendigerweise auf rein quantitative Maße wie Distanz oder Reisezeit beschränkt. Diese Arbeit widmet sich auch der Integration qualitativer Informationen, mit dem Ziel, intuitivere und greifbarere Routingergebnisse zu erzeugen. Dazu wird die Möglichkeit untersucht, abstrakte Straßennetzwerke mit qualitativen Informationen anzureichern, wobei die Informationen aus nutzergenerierten Daten geschöpft werden. Solche sogenannten Enriched Networks ermöglichen die Berechnung von Pfaden, die in gewisser Weise das Wissen der Nutzer reflektieren. Von der Datenverarbeitung, über die Extraktion von Wissen, bis hin zum Network-Enrichment und der Pfadberechnung, gibt diese Arbeit einen überblick zum Thema. Weiterhin wird die Praktikabilität dieses Vorgehens mit Experimenten auf Realdaten untermauert. Die Beschreibung eines Demonstrationstools für den Anwendungsfall der Navigation von Touristen dient als anschauliches Beispiel. Die vorgestellten Methoden sind darüber hinaus auch anwendbar auf ein neues, graphentheoretisches Routingproblem, das in dieser Arbeit vorgestellt wird. Es handelt sich dabei um eine zeitabängige Erweiterung der Familie der Orienteering Probleme, die häufig Anwendung finden, etwa auch im der Bereich der Touristennavigation. Das vorgestellte Problem ist NP-schwer lässt sich jedoch dank eines hier vorgestellten Algorithmus effizient approximieren. Die Evaluation untermauert die Effizienz des vorgestellten Lösungsansatzes und ist zugleich die erste Auswertung eines zeitabhängigen Orienteering Problems auf einem großformatigen Netzwerk. Ein weiterer zentraler Aspekt moderner Verkehrsnetzwerke ist die Integration von Sensordaten, oft unter dem Begriff Telematik zusammengefasst. Heutzutage generiert eine Vielzahl von Sensoren Unmengen an Daten. Diese Daten zur Verkehrsoptimierung einzusetzen ist und bleibt eine wichtige Aufgabe für Wissenschaft und Industrie. Einige der Anwendungen, die sich für den Einsatz von Telematik anbieten, werden in dieser Arbeit untersucht. So wird etwa das abstrakte Problem konsumierbarer und wiederkehrender Ressourcen im Straßennetzwerk untersucht. Ein alltägliches Beispiel für dieses Problem ist die Parkplatzsuche. Der vorgeschlagene Algorithmus, der die Wahrscheinlichkeit maximiert, einen freien Parkplatz zu finden, baut auf die Verwendung statistischer sowie aktueller Sensordaten. Weiterhin werden Methoden zur Ableitung von Fahrerpräferenzen entwickelt. Die theoretischen Fundamente finden zum Teil in einem hier beschriebenen Demonstrationstool Anwendung. Das Tool veranschaulicht Features, die für ein Pilotprojekt zu den Themen Elektromobilität und Fahrzeugflotten entwickelt wurden. Im Rahmen eines Pilotversuchs wurden Sensordaten von Elektrofahrzeugen erhoben, die Einblick in die Herausforderungen beim Management von Elektrofahrzeugflotten geben

    Skyline queries in dynamic environments

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Coping with new Challenges in Clustering and Biomedical Imaging

    Get PDF
    The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known. Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively. Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications. In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people

    Applying reranking strategies to route recommendation using sequence-aware evaluation

    Full text link
    Venue recommendation approaches have become particularly useful nowadays due to the increasing number of users registered in location-based social networks (LBSNs), applications where it is possible to share the venues someone has visited and establish connections with other users in the system. Besides, the venue recommendation problem has certain characteristics that differ from traditional recommendation, and it can also benefit from other contextual aspects to not only recommend independent venues, but complete routes or venue sequences of related locations. Hence, in this paper, we investigate the problem of route recommendation under the perspective of generating a sequence of meaningful locations for the users, by analyzing both their personal interests and the intrinsic relationships between the venues. We divide this problem into three stages, proposing general solutions to each case: First, we state a general methodology to derive user routes from LBSNs datasets that can be applied in as many scenarios as possible; second, we define a reranking framework that generate sequences of items from recommendation lists using different techniques; and third, we propose an evaluation metric that captures both accuracy and sequentiality at the same time. We report our experiments on several LBSNs datasets and by means of different recommendation quality metrics and algorithms. As a result, we have found that classical recommender systems are comparable to specifically tailored algorithms for this task, although exploiting the temporal dimension, in general, helps on improving the performance of these techniques; additionally, the proposed reranking strategies show promising results in terms of finding a trade-off between relevance, sequentiality, and distance, essential dimensions in both venue and route recommendation tasksThis work has been funded by the Ministerio de Ciencia, Innovación y Universidades (reference: TIN2016-80630-P) and by the European Social Fund (ESF), within the 2017 call for predoctoral contract

    Efficient Knowledge Extraction from Structured Data

    Get PDF
    Knowledge extraction from structured data aims for identifying valid, novel, potentially useful, and ultimately understandable patterns in the data. The core step of this process is the application of a data mining algorithm in order to produce an enumeration of particular patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. In this thesis, we advance the state-of-the-art data mining algorithms for analyzing structured data types. We describe the development of innovative solutions for hierarchical data mining. The EM-based hierarchical clustering method ITCH (Information-Theoretic Cluster Hierarchies) is designed to propose solid solutions for four different challenges. (1) to guide the hierarchical clustering algorithm to identify only meaningful and valid clusters. (2) to represent each cluster content in the hierarchy by an intuitive description with e.g. a probability density function. (3) to consistently handle outliers. (4) to avoid difficult parameter settings. ITCH is built on a hierarchical variant of the information-theoretic principle of Minimum Description Length (MDL). Interpreting the hierarchical cluster structure as a statistical model of the dataset, it can be used for effective data compression by Huffman coding. Thus, the achievable compression rate induces a natural objective function for clustering, which automatically satisfies all four above mentioned goals. The genetic-based hierarchical clustering algorithm GACH (Genetic Algorithm for finding Cluster Hierarchies) overcomes the problem of getting stuck in a local optimum by a beneficial combination of genetic algorithms, information theory and model-based clustering. Besides hierarchical data mining, we also made contributions to more complex data structures, namely objects that consist of mixed type attributes and skyline objects. The algorithm INTEGRATE performs integrative mining of heterogeneous data, which is one of the major challenges in the next decade, by a unified view on numerical and categorical information in clustering. Once more, supported by the MDL principle, INTEGRATE guarantees the usability on real world data. For skyline objects we developed SkyDist, a similarity measure for comparing different skyline objects, which is therefore a first step towards performing data mining on this kind of data structure. Applied in a recommender system, for example SkyDist can be used for pointing the user to alternative car types, exhibiting a similar price/mileage behavior like in his original query. For mining graph-structured data, we developed different approaches that have the ability to detect patterns in static as well as in dynamic networks. We confirmed the practical feasibility of our novel approaches on large real-world case studies ranging from medical brain data to biological yeast networks. In the second part of this thesis, we focused on boosting the knowledge extraction process. We achieved this objective by an intelligent adoption of Graphics Processing Units (GPUs). The GPUs have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks but can also be used for general numeric and symbolic computations. As major advantage, GPUs provide extreme parallelism combined with a high bandwidth in memory transfer at low cost. In this thesis, we propose algorithms for computationally expensive data mining tasks like similarity search and different clustering paradigms which are designed for the highly parallel environment of a GPU, called CUDA-DClust and CUDA-k-means. We define a multi-dimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU. We demonstrate the superiority of our algorithms running on GPU over their conventional counterparts on CPU in terms of efficiency
    corecore