6 research outputs found

    Enabling data-driven decision-making for a Finnish SME: a data lake solution

    Get PDF
    In the era of big data, data-driven decision-making has become a key success factor for companies of all sizes. Technological development has made it possible to store, process and analyse vast amounts of data effectively. The availability of cloud computing services has lowered the costs of data analysis. Even small businesses have access to advanced technical solutions, such as data lakes and machine learning applications. Data-driven decision-making requires integrating relevant data from various sources. Data has to be extracted from distributed internal and external systems and stored into a centralised system that enables processing and analysing it for meaningful insights. Data can be structured, semi-structured or unstructured. Data lakes have emerged as a solution for storing vast amounts of data, including a growing amount of unstructured data, in a cost-effective manner. The rise of the SaaS model has led to companies abandoning on-premise software. This blurs the line between internal and external data as the company’s own data is actually maintained by a third-party. Most enterprise software targeted for small businesses are provided through the SaaS model. Small businesses are facing the challenge of adopting data-driven decision-making, while having limited visibility to their own data. In this thesis, we study how small businesses can take advantage of data-driven decision-making by leveraging cloud computing services. We found that the report- ing features of SaaS based business applications used by our case company, a sales oriented SME, were insufficient for detailed analysis. Data-driven decision-making required aggregating data from multiple systems, causing excessive manual labour. A cloud based data lake solution was found to be a cost-effective solution for creating a centralised repository and automated data integration. It enabled management to visualise customer and sales data and to assess the effectiveness of marketing efforts. Better skills at data analysis among the managers of the case company would have been detrimental to obtaining the full benefits of the solution

    High-dimensional indexing methods utilizing clustering and dimensionality reduction

    Get PDF
    The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

    Implementasjon av datavarehus - et eksempel

    Get PDF
    Temaet for oppgaven har vært datavarehus eksemplifisert ved implementasjon av en database og et datavarehus på denne i Oracle. Oppgavens mål var å se på størrelse og responstider på databasen versus datavarehuset; hvem tar størst plass, og hvem er det raskest og spørre mot? Kildefilene i oppgaven er fra Nasjonalbiblioteket og inneholder informasjon om til sammen 51 517 artikler som ble gitt ut i årene 2000 til 2002. Datavarehus er integrerte data fra flere kilder lagret i en multidimensjonal modell. Hovedmålet med datavarehus er at det skal være et verktøy til hjelp for beslutningsstøtte. Denne teknologien skal hjelpe bedrifter til å ta raskere og bedre avgjørelser. Teorien sier at datavarehus har større lagringsplass, mer funksjonalitet og raskere responstider enn transaksjonsorienterte databaser. I implementeringseksempelet i denne oppgaven var det ikke spesielt lønnsomt å implementere et datavarehus i tillegg til databasen. Her tar spørringene uansett bare noen få millisekunder. Det er først når forskjellen i responstid mellom databasene og datavarehuset blir noen sekunder eller mer at man vil merke nytten av et datavarehus. Dessuten er ikke datavarehuset betydelig raskere å spørre mot enn databasen. Det skyldes først og fremst at databasen inneholder veldig få nullverdier. I et tilfelle der databasen er mer komplisert og inneholder flere nullverdier, ville det vært mer lønnsomt med et datavarehus. Da ville forskjellen i responstidene vært betydelig større. Databasen tar størst lagringsplass. Den er på 72,5 MB, mens datavarehuset er på 42,6 MB. Indekser er hovedgrunnen til at de bruker så stor plass

    Conception d'une légende interactive et forable pour le SOLAP

    Get PDF
    Afin de palier au manque d'efficacité des SIG en tant qu'outil d'aide à la décision (granularités multiples, rapidité, convivialité, temporalité), différentes saveurs d'outils SOLAP (Spatial OLAP) ont vu le jour dans les centres de recherche et fournisseurs de logiciels (CRG/Kheops/Syntell, SFU/DBMiner, Proclarity, Cognos, Microsoft, Beyond 20/20, ESRI, MapInfo, etc.). Combinant des fonctions SIG avec l'informatique décisionnelle (entrepôts de données, OLAP, data mining), le SOLAP est décrit comme un "logiciel de navigation rapide et facile dans les bases de données spatiales qui offre plusieurs niveaux de granularité d'information, plusieurs époques, plusieurs thèmes et plusieurs modes de visualisation synchronisés ou non: cartes, tableaux et graphiques statistiques (Bédard 2004). Le SOLAP facilite l'exploration volontaire des données spatiales pour aider l'utilisateur à détecter les corrélations d'informations, les regroupements potentiels, les tendances dissimulées dans un amas de données à référence spatiale, etc. Le tout se fait par simple sélection/click de souris (pas de langage SQL) et des opérations simples comme : le forage, le remontage ou le forage latéral. Il permet à l'utilisateur de se focaliser sur les résultats des opérations au lieu de l'analyse du processus de navigation. Le SOLAP étant amené à prendre de l'essor au niveau des fonctions qu'il propose, il devient important de proposer des améliorations à son interface à l'usager de manière à conserver sa facilité d'utilisation. Le développement d'une légende interactive et forable fut la première solution en ce genre proposée par Bédard (Bédard 1997). Nous avons donc retenu cette piste pour la présente recherche, étudié la sémiologie graphique et son applicabilité à l'analyse multidimensionnelle, analysé ce qui existait dans des domaines connexes, exploré différentes alternatives permettant de résoudre le problème causé par l'enrichissement des fonctions de navigation, construit un prototype, recueilli des commentaires d'utilisateurs SOLAP et proposé une solution. Tout au long de cette recherche, nous avons été confrontés à une absence de littérature portant explicitement sur le sujet (les SOLAP étant trop nouveaux), à des corpus théoriques qu'il fallait adapter (sémiologie, interface homme-machine, visualisation scientifique, cartographie dynamique) et à des besoins en maquettes et prototypes pour illustrer les solutions envisagées. Finalement, cette recherche propose une solution parmi plusieurs; cependant, son principal intérêt est davantage l'ensemble des réflexions et considérations mises de l'avant tout au long du mémoire pour arriver au résultat proposé que la solution proposée en elle-même. Ce sont ces réflexions théoriques et pratiques qui permettront d'améliorer l'interface à l'usager de tout outil SOLAP grâce au nouveau concept de légende interactive et forable

    INCREMENTAL AND REGULARIZED LINEAR DISCRIMINANT ANALYSIS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Index Structures for Data Warehouses

    No full text
    corecore