308 research outputs found

    Big Data Visualization Tools

    Full text link
    Data visualization is the presentation of data in a pictorial or graphical format, and a data visualization tool is the software that generates this presentation. Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activities.Comment: This article appears in Encyclopedia of Big Data Technologies, Springer, 201

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency

    Computer-language based data prefetching techniques

    Get PDF
    Data prefetching has long been used as a technique to improve access times to persistent data. It is based on retrieving data records from persistent storage to main memory before the records are needed. Data prefetching has been applied to a wide variety of persistent storage systems, from file systems to Relational Database Management Systems and NoSQL databases, with the aim of reducing access times to the data maintained by the system and thus improve the execution times of the applications using this data. However, most existing solutions to data prefetching have been based on information that can be retrieved from the storage system itself, whether in the form of heuristics based on the data schema or data access patterns detected by monitoring access to the system. There are multiple disadvantages of these approaches in terms of the rigidity of the heuristics they use, the accuracy of the predictions they make and / or the time they need to make these predictions, a process often performed while the applications are accessing the data and causing considerable overhead. In light of the above, this thesis proposes two novel approaches to data prefetching based on predictions made by analyzing the instructions and statements of the computer languages used to access persistent data. The proposed approaches take into consideration how the data is accessed by the higher-level applications, make accurate predictions and are performed without causing any additional overhead. The first of the proposed approaches aims at analyzing instructions of applications written in object-oriented languages in order to prefetch data from Persistent Object Stores. The approach is based on static code analysis that is done prior to the application execution and hence does not add any overhead. It also includes various strategies to deal with cases that require runtime information unavailable prior to the execution of the application. We integrate this analysis approach into an existing Persistent Object Store and run a series of extensive experiments to measure the improvement obtained by prefetching the objects predicted by the approach. The second approach analyzes statements and historic logs of the declarative query language SPARQL in order to prefetch data from RDF Triplestores. The approach measures two types of similarity between SPARQL queries in order to detect recurring query patterns in the historic logs. Afterwards, it uses the detected patterns to predict subsequent queries and launch them before they are requested to prefetch the data needed by them. Our evaluation of the proposed approach shows that it high-accuracy prediction and can achieve a high cache hit rate when caching the results of the predicted queries.Precargar datos ha sido una de las técnicas más comunes para mejorar los tiempos de acceso a datos persistentes. Esta técnica se basa en predecir los registros de datos que se van a acceder en el futuro y cargarlos del almacenimiento persistente a la memoria con antelación a su uso. Precargar datos ha sido aplicado en multitud de sistemas de almacenimiento persistente, desde sistemas de ficheros a bases de datos relacionales y NoSQL, con el objetivo de reducir los tiempos de acceso a los datos y por lo tanto mejorar los tiempos de ejecución de las aplicaciones que usan estos datos. Sin embargo, la mayoría de los enfoques existentes utilizan predicciones basadas en información que se encuentra dentro del mismo sistema de almacenimiento, ya sea en forma de heurísticas basadas en el esquema de los datos o patrones de acceso a los datos generados mediante la monitorización del acceso al sistema. Estos enfoques presentan varias desventajas en cuanto a la rigidez de las heurísticas usadas, la precisión de las predicciones generadas y el tiempo que necesitan para generar estas predicciones, un proceso que se realiza con frecuencia mientras las aplicaciones acceden a los datos y que puede tener efectos negativos en el tiempo de ejecución de estas aplicaciones. En vista de lo anterior, esta tesis presenta dos enfoques novedosos para precargar datos basados en predicciones generadas por el análisis de las instrucciones y sentencias del lenguaje informático usado para acceder a los datos persistentes. Los enfoques propuestos toman en consideración cómo las aplicaciones acceden a los datos, generan predicciones precisas y mejoran el rendimiento de las aplicaciones sin causar ningún efecto negativo. El primer enfoque analiza las instrucciones de applicaciones escritas en lenguajes de programación orientados a objetos con el fin de precargar datos de almacenes de objetos persistentes. El enfoque emplea análisis estático de código hecho antes de la ejecución de las aplicaciones, y por lo tanto no afecta negativamente el rendimiento de las mismas. El enfoque también incluye varias estrategias para tratar casos que requieren información de runtime no disponible antes de ejecutar las aplicaciones. Además, integramos este enfoque en un almacén de objetos persistentes y ejecutamos una serie extensa de experimentos para medir la mejora de rendimiento que se puede obtener utilizando el enfoque. Por otro lado, el segundo enfoque analiza las sentencias y logs del lenguaje declarativo de consultas SPARQL para precargar datos de triplestores de RDF. Este enfoque aplica dos medidas para calcular la similtud entre las consultas del lenguaje SPARQL con el objetivo de detectar patrones recurrentes en los logs históricos. Posteriormente, el enfoque utiliza los patrones detectados para predecir las consultas siguientes y precargar con antelación los datos que necesitan. Nuestra evaluación muestra que este enfoque produce predicciones de alta precisión y puede lograr un alto índice de aciertos cuando los resultados de las consultas predichas se guardan en el caché.Postprint (published version

    SEER-MCache: A Prefetchable Memory Object Caching System for IoT Real-Time Data Processing

    Get PDF
    Memory object caching systems, such as Memcached and Redis, have been proved to be a simple and high-efficient middleware for improving the performance of Internet of Things (IoT) devices querying the database in cloud. However, its performance guarantee is built on the fact that the target data, queried by the IoT device, will be accessed many times and hit in the caching system. Therefore, when database system is handling the unrepeated IoT queries, it usually presents the suboptimal performance, which greatly impairs the efficiency of real-time data processing on IoT devices. To improve this issue, we propose Seer-MCache, the memory object caching system with a smart prefetching (read-ahead) function, to fill up the caching system with the desired data before the intensive IoT queries arriving. Seer-MCache includes a set of rules to launch the specific behaviors of read-head. These rules are able to be customized according to the workload characteristics and system load. We implement a prototype system in Redis (caching layer) and MySQL server (database system). Extensive experiments are conducted to verify the effectiveness of Seer-MCache, the results show that Seer-MCache can improve the performance of read-intensive workload up to 61% (39.5% in average). Meanwhile, the cost of the read-ahead behavior is moderate and controllable

    Βελτιστοποίηση Ροής Σχεσιακών Επερωτήσεων κατά το Χρόνο Εκτέλεσης

    Get PDF
    Στα πλαίσια αυτής της διπλωματικής εργασίας, έχουμε αναπτύξει μία βιβλιοθήκη σε Java που επιτρέπει αποδοτικούς συνδυασμούς δύο ή περισσότερων επερωτήσεων σε μία, η οποία υποβάλλεται εναλλακτικά στη βάση δεδομένων. Ουσιαστικά αυτή η βιβλιοθήκη λειτουργεί σαν ένας wrapper γύρω από την εκάστοτε JDBC βιβλιοθήκη του κατασκευαστή της σχεσιακής βάσης δεδομένων. Η βιβλιοθήκη αυτή λειτουργεί σε δύο καταστάσεις. Στην πρώτη φάση, λειτουργεί σε κατάσταση "εκπαίδευσης", δηλαδή παρατηρεί τις αρχικές επερωτήσεις και τις καταγράφει μαζί με επιπρόσθετη μετά- ληροφορία σε ένα n-ary δένδρο, το οποίο ονομάζουμε "context tree". Η πρώτη φάση, κατά προσέγγιση αντιπροσωπεύει το 10\% του συνολικού χρόνου της λειτουργίας της επιχειρησιακής εφαρμογής. Μετά το τέλος της πρώτης φάσης, αυτή η βιβλιοθήκη αποφασίζει ποιες αρχικές επερωτήσεις μπορούν και αξίζει (με κριτήριο τη μείωση της καθυστέρησης) να συνδυαστούν μεταξύ τους. Στη δεύτερη φάση που ονομάζεται κατάσταση "κανονικής" λειτουργίας και που συνήθως είναι το 90\% του χρόνου, αυτή η βιβλιοθήκη χρησιμοποιεί τις εναλλακτικές, συνδυαστικές επερωτήσεις που παρήγαγε η προηγούμενη φάση για να τις υποβάλλει στη βάση δεδομένων στη θέση των αρχικών απλών επερωτήσεων. Η επανεγγραφή πραγματοποιείται στον αέρα καθώς το σύστημα είναι σε κανονική λειτουργία. Παρόλο, που οι παραγόμενες επερωτήσεις είναι πιο πολύπλοκες, αποκαλύπτουν στον επεξεργαστή επερωτήσεων της σχεσιακής βάσης δεδομένων περισσότερες ευκαιρίες για βελτιστοποίηση. Διαφορετικά αυτές οι βελτιστοποιήσεις θα παρέμεναν κρυμμένες και ανευκμετάλλευτες μέσα στην εφαρμογή. Επιπλέον, έχουμε αναπτύξει ένα μοντέλο για τη κοστολόγηση όλων των προτεινόμενων τεχνικών επανεγγραφής που μας επιτρέπει να τις συγκρίνουμε μεταξύ τους και να λαμβάνουμε υπόψιν συστημικές παραμέτρους, όπως τη μέση καθυστέρηση του δικτύου. Η διαδικασία της κοστολόγησης των διαθέσιμων εναλλακτικών και της επιλογής του βέλτιστου σχήματος για την επανεγγραφή μίας ροής επερωτήσεων, πραγματοποιείται μετά το πέρας της "εκπαιδευτικής" φάσης και πριν την έναρξη της "κανονικής" λειτουργίας. Τέλος, εφαρμόσαμε μία εκτεταμένη σειρά από ελέγχους απόδοσης, ώστε να καταγράψουμε τις βελτιώσεις στο πρόβλημα της δικτυακής καθυστέρησης, που γίνεται αντιληπτή στον τελικό χρήστη. Όλες οι εναλλακτικές στρατηγικές επανεγγραφής αποδείχθηκαν πιο αποδοτικές από το αρχικό σχήμα επερωτήσεων από 2 έως και 4 φορές! Παρόλαυτά, οι προτεινόμενες στρατηγικές δεν είναι σε όλες τις περιπτώσεις πιο αποδοτικές. Διαπιστώσαμε, ότι σε ορισμένα δίκτυα που η μέση καθυστέρηση του δικτύου είναι πολύ μικρή, οι εναλλακτικές επερωτήσεις μπορούν να είναι πιο αργές από τις αρχικές! Αυτή η διαπίστωση μαζί με πλήθος άλλωνσυμπερασμάτων παρατίθενται και ερμηνεύονται αναλυτικά στο αντίστοιχο κεφάλαιο των πειραματικών ελέγχων.Current multi user applications submit streams of relational queries against a back end database server. For many years the research community has focused its attention in developing sophisticated storage engines, more efficient query processors, scalable clustering systems, in main memory key-value caching servers that would alleviate any throughput bottlenecks and could allow for more concurrency. Query streams initiated by individual users have received some attention in the form of result set caching. However, unless the same query is resubmitted, this remedy has not proved very efficient on minimizing the latency perceived by the end user. In addition, improvements in network latency have lagged developments in bandwidth usage. In this thesis, we attempt to tackle the latency experienced by the end users, which in contrary to the throughput remains a big issue and there does not seem that any networking hardware improvements will alleviate it in the future. We have studied a specific pattern of query streams that is quite often found in most applications and is characterized by a number of query correlations and deep nesting. We verified that these two factors result in excessive numbers of round-trips, which could be avoided with either manual rewriting of the queries or whith some form of runtime rewritings on the fly. Although, the manual rewriting of these applications could result in much more efficient queries, it is not recommended as it is not always clear for which system configuration or application instance we should optimize. Furthermore, good sofware engineering practices promote code modularity and encapsulation, with more simple queries. We have developed a prototype software library, which allows for run time optimization of these query streams. We have implemented a number of alternative query rewritings that are applied during run time and essentially submit at the back end RDBMS a combined query. This combined query although more complex, reveals to the RDBMS query processor more scope for optimization, which otherwise would have remained hidden within the client application code. In addition, we developed an analytic cost model that allows us to compare different alternatives and at the same time take into account any critical system properties like network communication latency. We performed comprehensive benchmarking so as to measure any improvements, on the total latency, seen by the end user. Finally, we present experimental results where all the alternative strategies outperform the orignal queries by 2 to 4 times

    Active caching for recommender systems

    Get PDF
    Web users are often overwhelmed by the amount of information available while carrying out browsing and searching tasks. Recommender systems substantially reduce the information overload by suggesting a list of similar documents that users might find interesting. However, generating these ranked lists requires an enormous amount of resources that often results in access latency. Caching frequently accessed data has been a useful technique for reducing stress on limited resources and improving response time. Traditional passive caching techniques, where the focus is on answering queries based on temporal locality or popularity, achieve a very limited performance gain. In this dissertation, we are proposing an ‘active caching’ technique for recommender systems as an extension of the caching model. In this approach estimation is used to generate an answer for queries whose results are not explicitly cached, where the estimation makes use of the partial order lists cached for related queries. By answering non-cached queries along with cached queries, the active caching system acts as a form of query processor and offers substantial improvement over traditional caching methodologies. Test results for several data sets and recommendation techniques show substantial improvement in the cache hit rate, byte hit rate and CPU costs, while achieving reasonable recall rates. To ameliorate the performance of proposed active caching solution, a shared neighbor similarity measure is introduced which improves the recall rates by eliminating the dependence on monotinicity in the partial order lists. Finally, a greedy balancing cache selection policy is also proposed to select most appropriate data objects for the cache that help to improve the cache hit rate and recall further
    corecore