14 research outputs found

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    Select-based random access to variable-byte encodings

    Get PDF
    Enormous datasets are a common occurence today and compressing them is often beneficial. Fast direct access to any element in the compressed data is a requirement in the field of compressed data structures, which is not easily supported with traditional compression methods. Variable-byte encoding is a method for compressing integers of different byte lengths. It removes unused leading bytes and adds an additional continuation bit to each byte to denote whether the compressed integer continues to the next byte or not. An existing solution using a rank data structure performs well in this given task. This thesis introduces an alternative solution using a select data structure and compares the two implementations. An experimentation is also done on retrieving a subarray from the compressed data structure. The rank implementation performs better on data containing mostly small integers. The select implementation benefits on larger integers. The select implementation has significant advantages on subarray fetching due to how the data is compressed

    Reordering Rows for Better Compression: Beyond the Lexicographic Order

    Get PDF
    Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD

    Compression aware physical database design

    Full text link

    Electricity use profiling and forecasting at microgrid level

    Get PDF
    Σκοπός αυτής της διπλωματικής εργασίας είναι η δημιουργία ενός ευέλικτου και εύκολα προσαρμόσιμου εργαλείου που θα εφαρμοστεί σε microgrids για την δημιουργία ενεργιακών προφίλ χρήσης ηλεκτρικής ενέργειας και για την πρόβλεψη φορτίου. Το αρθρωτό αυτό εργαλείο ονομάζεται Divinus και η αρχιτεκτονική του αποτελείται από πολλά διασυνδεδεμένα και καλά καθορισμένα στοιχεία, όπου το καθένα αλληλεπιδρά άμεσα με το άλλο. Οι τρεις πρώτοι δομικοί πυλώνες της πλατφόρμας είναι η βάση δεδομένων, στην οποία αποθηκεύονται όλες οι πληροφορίες, το Django framework στο οποίο υπάρχει ο πηγαίος κώδικας και τέλος ο ιστότοπος όπου εμφανίζονται όλα τα αποτελέσματα. Το επόμενο σύνολο στοιχείων δεν αφορά τόσο την δομική όσο την λειτουργική πλευρά του Divinus. Στα στοιχεία αυτά εμπεριέχονται διαδικασίες όπως είναι η συλλογή δεδομένων που θα αποθηκευτούν στη βάση, η δημιουργία ενεργειακών προφίλ χρήση που θα εκτελεστεί πάνω στα δεδομένα που συλλέγονται καθώς και η πρόβλεψη φορτίου για την οποία θα χρησιμοποιηθούν δεδομένα από τα ενεργειακά προφίλ χρήσης. Μέσω τον αυτοοργανωτικών χαρτών, που είναι ανταγωνιστικά δίκτυα που παρέχουν τοπολογική χαρτογράφηση στα εισαγόμενα δεδομένα, πραγματοποιούμε τη δημιουργία ενεργιακών προφίλ χρήσης ηλεκτρικής ενέργειας με βάση τα συλλεχθέντα δεδομένα από το 2010 έως το 2017 της περιοχής των Ψαχνών Ευβοίας του Τεχνολογικού Εκπαιδευτικού Ινστιτούτου Στερεάς Ελλάδας. Μόλις η χαρτογράφηση των δεδομένων αυτών είναι πλήρης τοποθετηθούν σε ομάδες βάσει των χαρακτηριστικών τους, η διαδικασία πρόβλεψης είναι σε θέση να ξεκινήσει. Η πρόβλεψη πραγματοποιείται με βάση τη μεθοδολογία machine learning και πιο συγκεκριμένα μέσω του αλγόριθμο k-neighbours. Από τις δοκιμές που έχουν πραγματοποιηθεί μέχρι τώρα, παρατηρούμαι ότι το Divinus έχει υψηλή ακρίβεια και μικρά σφάλματα. Πιο συγκεκριμένα, με βάση τις προβλέψεις που πραγματοποιήθηκαν για τις επόμενες πέντε ημέρες, τον επόμενο μήνα και τον επόμενο χρόνο, το μέσο σφάλμα δεν υπερβεί το 5% για τις επόμενες πέντε ημέρες, το 12% για τον επόμενο μήνα και το 16% για το επόμενο έτος. Ως εκ τούτου, στο στάδιο που βρίσκεται αυτήν την στιγμή το Divinus μπορούμε να πούμε ότι αποτελεί ένα πολύ ελπιδοφόρο εργαλείο που είναι πιθανό να χρησιμοποιηθεί τόσο για βραχυπρόθεσμες όσο και για μεσοπρόθεσμες προβλέψεις.The aim of this thesis is to create a flexible and easily customized tool applicable in microgrids to carry out electricity use profiling and forecasting. This modular tool is called Divinus and its architecture consists of several interconnected well-defined components where each one interacts directly with the other. Τhe first three structural pillars of the platform are its database where all the information is stored, the Django framework in which the code exists and finally the website where all the results are displayed. Τhe next set of components are not as structural as they are functional. Upon them is based the collection of data that will be saved in the database, the use profile that will be performed on the collected data and the load forecasting for which use profiling data will be used. Through the Self-Organizing Map, that are competing networks that provide topological mapping to the imported data, we perform the use profiling based on the collected data of Technological Institute of Sterea Ellada, Psachna campus from 2010 till 2017. As soon as the use profiling is complete and these data are placed in clusters based on their characteristics the forecasting process is able to begin. The forecasting is performed based on the machine learning methodology and more specifically with the k-neighbours algorithm. From the tests that have been carried out so far, we observed that Divinus has a high accuracy and low mean errors. More specifically based on forecasts made for the next five days, the next month and the next year the average error does not exceed 5% for the next five days, 12% for next month and 16% for the next year. Therefore, at the current stage of the tools is we are able to say that it is quite promising tool and that is likely to be used for both short-term and medium-term forecasts

    Reordering Columns for Smaller Indexes

    Get PDF
    Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality. Experimentally, sorting based on Hilbert space-filling curves is poor at minimizing the number of runs.Comment: to appear in Information Science

    Letter from the Special Issue Editor

    Get PDF
    Editorial work for DEBULL on a special issue on data management on Storage Class Memory (SCM) technologies
    corecore