16 research outputs found

    A compact and scalable encoding for updating XML based on node labeling schemes

    Get PDF
    The eXtensible Markup Language (XML) has been adopted as the new standard for data exchange on the World Wide Web. As the rate of adoption increases, there is an ever pressing need to store, query and update XML in its native format, thereby eliminating the overhead of parsing and transforming XML in and out of various data formats. However, the hierarchical, ordered and semi-structured properties of the tree structure underlying the XML data model presents many challenges to updating XML. In particular, many of the tree labeling schemes were designed to solve a particular problem or provide a particular feature, often at the expense of other important features. In this dissertation, we identify the core properties that are representative of the desirable characteristics of a good dynamic labeling scheme for XML. We focus on four features central to the outstanding problems in existing dynamic labeling schemes; namely a compact label encoding, scalability, deleted node label reuse and a label storage scheme for binary-encoded bit-string node labels. At present there is no dynamic labeling scheme that integrates support for all four features. We present a novel compact and scalable adaptive encoding method to facilitate a highly constrained growth rate of label size under arbitrary node insertion and deletion scenarios and our encoding method can scale efficiently. We deploy our encoding method in two novel dynamic labeling schemes for XML that can completely avoid node relabeling, process frequently skewed insertions gracefully and reuse deleted node labels

    Applications of Internet of Things

    Get PDF
    This book introduces the Special Issue entitled “Applications of Internet of Things”, of ISPRS International Journal of Geo-Information. Topics covered in this issue include three main parts: (I) intelligent transportation systems (ITSs), (II) location-based services (LBSs), and (III) sensing techniques and applications. Three papers on ITSs are as follows: (1) “Vehicle positioning and speed estimation based on cellular network signals for urban roads,” by Lai and Kuo; (2) “A method for traffic congestion clustering judgment based on grey relational analysis,” by Zhang et al.; and (3) “Smartphone-based pedestrian’s avoidance behavior recognition towards opportunistic road anomaly detection,” by Ishikawa and Fujinami. Three papers on LBSs are as follows: (1) “A high-efficiency method of mobile positioning based on commercial vehicle operation data,” by Chen et al.; (2) “Efficient location privacy-preserving k-anonymity method based on the credible chain,” by Wang et al.; and (3) “Proximity-based asynchronous messaging platform for location-based Internet of things service,” by Gon Jo et al. Two papers on sensing techniques and applications are as follows: (1) “Detection of electronic anklet wearers’ groupings throughout telematics monitoring,” by Machado et al.; and (2) “Camera coverage estimation based on multistage grid subdivision,” by Wang et al

    Child Prime Label Approaches to Evaluate XML Structured Queries

    Get PDF
    The adoption of the eXtensible Markup Language (XML) as the standard format to store and exchange semi-structure data has been gaining momentum. The growing number of XML documents leads to the need for appropriate XML querying algorithms which are able to retrieve XML data efficiently. Due to the importance of twig pattern matching in XML retrieval systems, finding all matching occurrences of a tree pattern query in an XML document is often considered as a specific task for XML databases as well as a core operation in XML query processing. This thesis presents a design and implementation of a new indexing technique, called the Child Prime Label (CPL) which exploits the property of prime numbers to identify Parent-Child (P-C) edges in twig pattern queries (TPQs) during query evaluation. The CPL approach can be incorporated efficiently within the existing labelling schemes. The major contributions of this thesis can be seen as a set of novel twig matching algorithms which apply the CPL approach and focus on reducing the overhead of storing useless elements and performing unnecessary computations during the output enumeration. The research presented here is the first to provide an efficient and general solution for TPQs containing ordering constraints and positional predicates specified by the XML query languages. To evaluate the CPL approaches, the holistic model was implemented as an experimental prototype in which the approaches proposed are compared against state-of-the-art holistic twig algorithms. Extensive performance studies on various real-world and artificial datasets were conducted to demonstrate the significant improvement of the CPL approaches over the previous indexing and querying methods. The experimental results demonstrate the validity and improvements of the new algorithms over other related methods on common various subclasses of TPQs. Moreover, the scalability tests reveal that the new algorithms are more suitable for processing large XML datasets

    Clustering Approaches for Multi-source Entity Resolution

    Get PDF
    Entity Resolution (ER) or deduplication aims at identifying entities, such as specific customer or product descriptions, in one or several data sources that refer to the same real-world entity. ER is of key importance for improving data quality and has a crucial role in data integration and querying. The previous generation of ER approaches focus on integrating records from two relational databases or performing deduplication within a single database. Nevertheless, in the era of Big Data the number of available data sources is increasing rapidly. Therefore, large-scale data mining or querying systems need to integrate data obtained from numerous sources. For example, in online digital libraries or E-Shops, publications or products are incorporated from a large number of archives or suppliers across the world or within a specified region or country to provide a unified view for the user. This process requires data consolidation from numerous heterogeneous data sources, which are mostly evolving. By raising the number of sources, data heterogeneity and velocity as well as the variance in data quality is increased. Therefore, multi-source ER, i.e. finding matching entities in an arbitrary number of sources, is a challenging task. Previous efforts for matching and clustering entities between multiple sources (> 2) mostly treated all sources as a single source. This approach excludes utilizing metadata or provenance information for enhancing the integration quality and leads up to poor results due to ignorance of the discrepancy between quality of sources. The conventional ER pipeline consists of blocking, pair-wise matching of entities, and classification. In order to meet the new needs and requirements, holistic clustering approaches that are capable of scaling to many data sources are needed. The holistic clustering-based ER should further overcome the restriction of pairwise linking of entities by making the process capable of grouping entities from multiple sources into clusters. The clustering step aims at removing false links while adding missing true links across sources. Additionally, incremental clustering and repairing approaches need to be developed to cope with the ever-increasing number of sources and new incoming entities. To this end, we developed novel clustering and repairing schemes for multi-source entity resolution. The approaches are capable of grouping entities from multiple clean (duplicate-free) sources, as well as handling data from an arbitrary combination of clean and dirty sources. The multi-source clustering schemes exclusively developed for multi-source ER can obtain superior results compared to general purpose clustering algorithms. Additionally, we developed incremental clustering and repairing methods in order to handle the evolving sources. The proposed incremental approaches are capable of incorporating new sources as well as new entities from existing sources. The more sophisticated approach is able to repair previously determined clusters, and consequently yields improved quality and a reduced dependency on the insert order of the new entities. To ensure scalability, the parallel variation of all approaches are implemented on top of the Apache Flink framework which is a distributed processing engine. The proposed methods have been integrated in a new end-to-end ER tool named FAMER (FAst Multi-source Entity Resolution system). The FAMER framework is comprised of Linking and Clustering components encompassing both batch and incremental ER functionalities. The output of Linking part is recorded as a similarity graph where each vertex represents an entity and each edge maintains the similarity relationship between two entities. Such a similarity graph is the input of the Clustering component. The comprehensive comparative evaluations overall show that the proposed clustering and repairing approaches for both batch and incremental ER achieve high quality while maintaining the scalability

    Schema-aware keyword search on linked data

    Get PDF
    Keyword search is a popular technique for querying the ever growing repositories of RDF graph data on the Web. This is due to the fact that the users do not need to master complex query languages (e.g., SQL, SPARQL) and they do not need to know the underlying structure of the data on the Web to compose their queries. Keyword search is simple and flexible. However, it is at the same time ambiguous since a keyword query can be interpreted in different ways. This feature of keyword search poses at least two challenges: (a) identifying relevant results among a multitude of candidate results, and (b) dealing with the performance scalability issue of the query evaluation algorithms. In the literature, multiple schema-unaware approaches are proposed to cope with the above challenges. Some of them identify as relevant results only those candidate results which maintain the keyword instances in close proximity. Other approaches filter out irrelevant results using their structural characteristics or rank and top-k process the retrieved results based on statistical information about the data. In any case, these approaches cannot disambiguate the query to identify the intent of the user and they cannot scale satisfactorily when the size of the data and the number of the query keywords grow. In recent years, different approaches tried to exploit the schema (structural summary) of the RDF (Resource Description Framework) data graph to address the problems above. In this context, an original hierarchical clustering technique is introduced in this dissertation. This approach clusters the results based on a semantic interpretation of the keyword instances and takes advantage of relevance feedback from the user. The clustering hierarchy uses pattern graphs which are structured queries and clustering together result graphs with the same structure. Pattern graphs represent possible interpretations for the keyword query. By navigating though the hierarchy the user can select the pattern graph which is relevant to her intent. Nevertheless, structural summaries are approximate representations of the data and, therefore, might return empty answers or miss results which are relevant to the user intent. To address this issue, a novel approach is presented which combines the use of the structural summary and the user feedback with a relaxation technique for pattern graphs to extract additional results potentially of interest to the user. Query caching and multi-query optimization techniques are leveraged for the efficient evaluation of relaxed pattern graphs. Although the approaches which consider the structural summary of the data graph are promising, they require interaction with the user. It is claimed in this dissertation that without additional information from the user, it is not possible to produce results of high quality from keyword search on RDF data with the existing techniques. In this regard, an original keyword query language on RDF data is introduced which allows the user to convey his intention flexibly and effortlessly by specifying cohesive keyword groups. A cohesive group of keywords in a query indicates that its keywords should form a cohesive unit in the query results. It is experimentally demonstrated that cohesive keyword queries improve the result quality effectively and prune the search space of the pattern graphs efficiently compared to traditional keyword queries. Most importantly, these benefits are achieved while retaining the simplicity and the convenience of traditional keyword search. The last issue addressed in this dissertation is the diversification problem for keyword search on RDF data. The goal of diversification is to trade off relevance and diversity in the results set of a keyword query in order to minimize the dissatisfaction of the average user. Novel metrics are developed for assessing relevance and diversity along with techniques for the generation of a relevant and diversified set of query interpretations for a keyword query on an RDF data graph. Experimental results show the effectiveness of the metrics and the efficiency of the approach

    Approximate information filtering in structured peer-to-peer networks

    Get PDF
    Today';s content providers are naturally distributed and produce large amounts of information every day, making peer-to-peer data management a promising approach offering scalability, adaptivity to dynamics, and failure resilience. In such systems, subscribing with a continuous query is of equal importance as one-time querying since it allows the user to cope with the high rate of information production and avoid the cognitive overload of repeated searches. In the information filtering setting users specify continuous queries, thus subscribing to newly appearing documents satisfying the query conditions. Contrary to existing approaches providing exact information filtering functionality, this doctoral thesis introduces the concept of approximate information filtering, where users subscribe to only a few selected sources most likely to satisfy their information demand. This way, efficiency and scalability are enhanced by trading a small reduction in recall for lower message traffic. This thesis contains the following contributions: (i) the first architecture to support approximate information filtering in structured peer-to-peer networks, (ii) novel strategies to select the most appropriate publishers by taking into account correlations among keywords, (iii) a prototype implementation for approximate information retrieval and filtering, and (iv) a digital library use case to demonstrate the integration of retrieval and filtering in a unified system.Heutige Content-Anbieter sind verteilt und produzieren riesige Mengen an Daten jeden Tag. Daher wird die Datenhaltung in Peer-to-Peer Netzen zu einem vielversprechenden Ansatz, der Skalierbarkeit, Anpassbarkeit an Dynamik und Ausfallsicherheit bietet. Für solche Systeme besitzt das Abonnieren mit Daueranfragen die gleiche Wichtigkeit wie einmalige Anfragen, da dies dem Nutzer erlaubt, mit der hohen Datenrate umzugehen und gleichzeitig die Überlastung durch erneutes Suchen verhindert. Im Information Filtering Szenario legen Nutzer Daueranfragen fest und abonnieren dadurch neue Dokumente, die die Anfrage erfüllen. Im Gegensatz zu vorhandenen Ansätzen für exaktes Information Filtering führt diese Doktorarbeit das Konzept von approximativem Information Filtering ein. Ein Nutzer abonniert nur wenige ausgewählte Quellen, die am ehesten die Anfrage erfüllen werden. Effizienz und Skalierbarkeit werden verbessert, indem Recall gegen einen geringeren Nachrichtenverkehr eingetauscht wird. Diese Arbeit beinhaltet folgende Beiträge: (i) die erste Architektur für approximatives Information Filtering in strukturierten Peer-to-Peer Netzen, (ii) Strategien zur Wahl der besten Anbieter unter Berücksichtigung von Schlüsselwörter-Korrelationen, (iii) ein Prototyp, der approximatives Information Retrieval und Filtering realisiert und (iv) ein Anwendungsfall für Digitale Bibliotheken, der beide Funktionalitäten in einem vereinten System aufzeigt

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Untersuchungen zur Risikominimierungstechnik Stealth Computing für verteilte datenverarbeitende Software-Anwendungen mit nutzerkontrollierbar zusicherbaren Eigenschaften

    Get PDF
    Die Sicherheit und Zuverlässigkeit von Anwendungen, welche schutzwürdige Daten verarbeiten, lässt sich durch die geschützte Verlagerung in die Cloud mit einer Kombination aus zielgrößenabhängiger Datenkodierung, kontinuierlicher mehrfacher Dienstauswahl, dienstabhängiger optimierter Datenverteilung und kodierungsabhängiger Algorithmen deutlich erhöhen und anwenderseitig kontrollieren. Die Kombination der Verfahren zu einer anwendungsintegrierten Stealth-Schutzschicht ist eine notwendige Grundlage für die Konstruktion sicherer Anwendungen mit zusicherbaren Sicherheitseigenschaften im Rahmen eines darauf angepassten Softwareentwicklungsprozesses.:1 Problemdarstellung 1.1 Einführung 1.2 Grundlegende Betrachtungen 1.3 Problemdefinition 1.4 Einordnung und Abgrenzung 2 Vorgehensweise und Problemlösungsmethodik 2.1 Annahmen und Beiträge 2.2 Wissenschaftliche Methoden 2.3 Struktur der Arbeit 3 Stealth-Kodierung für die abgesicherte Datennutzung 3.1 Datenkodierung 3.2 Datenverteilung 3.3 Semantische Verknüpfung verteilter kodierter Daten 3.4 Verarbeitung verteilter kodierter Daten 3.5 Zusammenfassung der Beiträge 4 Stealth-Konzepte für zuverlässige Dienste und Anwendungen 4.1 Überblick über Plattformkonzepte und -dienste 4.2 Netzwerkmultiplexerschnittstelle 4.3 Dateispeicherschnittstelle 4.4 Datenbankschnittstelle 4.5 Stromspeicherdienstschnittstelle 4.6 Ereignisverarbeitungsschnittstelle 4.7 Dienstintegration 4.8 Entwicklung von Anwendungen 4.9 Plattformäquivalente Cloud-Integration sicherer Dienste und Anwendungen 4.10 Zusammenfassung der Beiträge 5 Szenarien und Anwendungsfelder 5.1 Online-Speicherung von Dateien mit Suchfunktion 5.2 Persönliche Datenanalyse 5.3 Mehrwertdienste für das Internet der Dinge 6 Validierung 6.1 Infrastruktur für Experimente 6.2 Experimentelle Validierung der Datenkodierung 6.3 Experimentelle Validierung der Datenverteilung 6.4 Experimentelle Validierung der Datenverarbeitung 6.5 Funktionstüchtigkeit und Eigenschaften der Speicherdienstanbindung 6.6 Funktionstüchtigkeit und Eigenschaften der Speicherdienstintegration 6.7 Funktionstüchtigkeit und Eigenschaften der Datenverwaltung 6.8 Funktionstüchtigkeit und Eigenschaften der Datenstromverarbeitung 6.9 Integriertes Szenario: Online-Speicherung von Dateien 6.10 Integriertes Szenario: Persönliche Datenanalyse 6.11 Integriertes Szenario: Mobile Anwendungen für das Internet der Dinge 7 Zusammenfassung 7.1 Zusammenfassung der Beiträge 7.2 Kritische Diskussion und Bewertung 7.3 Ausblick Verzeichnisse Tabellenverzeichnis Abbildungsverzeichnis Listings Literaturverzeichnis Symbole und Notationen Software-Beiträge für native Cloud-Anwendungen Repositorien mit ExperimentdatenThe security and reliability of applications processing sensitive data can be significantly increased and controlled by the user by a combination of techniques. These encompass a targeted data coding, continuous multiple service selection, service-specific optimal data distribution and coding-specific algorithms. The combination of the techniques towards an application-integrated stealth protection layer is a necessary precondition for the construction of safe applications with guaranteeable safety properties in the context of a custom software development process.:1 Problemdarstellung 1.1 Einführung 1.2 Grundlegende Betrachtungen 1.3 Problemdefinition 1.4 Einordnung und Abgrenzung 2 Vorgehensweise und Problemlösungsmethodik 2.1 Annahmen und Beiträge 2.2 Wissenschaftliche Methoden 2.3 Struktur der Arbeit 3 Stealth-Kodierung für die abgesicherte Datennutzung 3.1 Datenkodierung 3.2 Datenverteilung 3.3 Semantische Verknüpfung verteilter kodierter Daten 3.4 Verarbeitung verteilter kodierter Daten 3.5 Zusammenfassung der Beiträge 4 Stealth-Konzepte für zuverlässige Dienste und Anwendungen 4.1 Überblick über Plattformkonzepte und -dienste 4.2 Netzwerkmultiplexerschnittstelle 4.3 Dateispeicherschnittstelle 4.4 Datenbankschnittstelle 4.5 Stromspeicherdienstschnittstelle 4.6 Ereignisverarbeitungsschnittstelle 4.7 Dienstintegration 4.8 Entwicklung von Anwendungen 4.9 Plattformäquivalente Cloud-Integration sicherer Dienste und Anwendungen 4.10 Zusammenfassung der Beiträge 5 Szenarien und Anwendungsfelder 5.1 Online-Speicherung von Dateien mit Suchfunktion 5.2 Persönliche Datenanalyse 5.3 Mehrwertdienste für das Internet der Dinge 6 Validierung 6.1 Infrastruktur für Experimente 6.2 Experimentelle Validierung der Datenkodierung 6.3 Experimentelle Validierung der Datenverteilung 6.4 Experimentelle Validierung der Datenverarbeitung 6.5 Funktionstüchtigkeit und Eigenschaften der Speicherdienstanbindung 6.6 Funktionstüchtigkeit und Eigenschaften der Speicherdienstintegration 6.7 Funktionstüchtigkeit und Eigenschaften der Datenverwaltung 6.8 Funktionstüchtigkeit und Eigenschaften der Datenstromverarbeitung 6.9 Integriertes Szenario: Online-Speicherung von Dateien 6.10 Integriertes Szenario: Persönliche Datenanalyse 6.11 Integriertes Szenario: Mobile Anwendungen für das Internet der Dinge 7 Zusammenfassung 7.1 Zusammenfassung der Beiträge 7.2 Kritische Diskussion und Bewertung 7.3 Ausblick Verzeichnisse Tabellenverzeichnis Abbildungsverzeichnis Listings Literaturverzeichnis Symbole und Notationen Software-Beiträge für native Cloud-Anwendungen Repositorien mit Experimentdate

    Efficient resource management in a multi-tenant cloud environment

    Get PDF
    corecore