1,804 research outputs found

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    Optimization of Columnar NoSQL Data Warehouse Model with Clarans Clustering Algorithm

    Get PDF
    In order to perfectly meet the needs of business leaders, decision-makers have resorted to the integration of external sources (such as Linked Open Data) in the decision-making system in order to enrich their existing data warehouses with new concepts contributing to bring added value to their organizations, enhance its productivity and retain its customers. However, the traditional data warehouse environment is not suitable to support external Big Data. To deal with this new challenge, several researches are oriented towards the direct conversion of classical relational data warehouse to a columnar NoSQL data warehouse, whereas the existing advanced works based on clustering algorithms are very limited and have several shortcomings. In this context, our paper proposes a new solution that conceives an optimized columnar data warehouse based on CLARANS clustering algorithm that has proven its effectiveness in generating optimal column families. Experimental results improve the validity of our system by performing a detailed comparative study between the existing advanced approaches and our proposed optimized method

    A Biased Topic Modeling Approach for Case Control Study from Health Related Social Media Postings

    Get PDF
    abstract: Online social networks are the hubs of social activity in cyberspace, and using them to exchange knowledge, experiences, and opinions is common. In this work, an advanced topic modeling framework is designed to analyse complex longitudinal health information from social media with minimal human annotation, and Adverse Drug Events and Reaction (ADR) information is extracted and automatically processed by using a biased topic modeling method. This framework improves and extends existing topic modelling algorithms that incorporate background knowledge. Using this approach, background knowledge such as ADR terms and other biomedical knowledge can be incorporated during the text mining process, with scores which indicate the presence of ADR being generated. A case control study has been performed on a data set of twitter timelines of women that announced their pregnancy, the goals of the study is to compare the ADR risk of medication usage from each medication category during the pregnancy. In addition, to evaluate the prediction power of this approach, another important aspect of personalized medicine was addressed: the prediction of medication usage through the identification of risk groups. During the prediction process, the health information from Twitter timeline, such as diseases, symptoms, treatments, effects, and etc., is summarized by the topic modelling processes and the summarization results is used for prediction. Dimension reduction and topic similarity measurement are integrated into this framework for timeline classification and prediction. This work could be applied to provide guidelines for FDA drug risk categories. Currently, this process is done based on laboratory results and reported cases. Finally, a multi-dimensional text data warehouse (MTD) to manage the output from the topic modelling is proposed. Some attempts have been also made to incorporate topic structure (ontology) and the MTD hierarchy. Results demonstrate that proposed methods show promise and this system represents a low-cost approach for drug safety early warning.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Implementing data-driven decision support system based on independent educational data mart

    Get PDF
    Decision makers in the educational field always seek new technologies and tools, which provide solid, fast answers that can support decision-making process. They need a platform that utilize the students’ academic data and turn them into knowledge to make the right strategic decisions. In this paper, a roadmap for implementing a data driven decision support system (DSS) is presented based on an educational data mart. The independent data mart is implemented on the students’ degrees in 8 subjects in a private school (Al-Iskandaria Primary School in Basrah province, Iraq). The DSS implementation roadmap is started from pre-processing paper-based data source and ended with providing three categories of online analytical processing (OLAP) queries (multidimensional OLAP, desktop OLAP and web OLAP). Key performance indicator (KPI) is implemented as an essential part of educational DSS to measure school performance. The static evaluation method shows that the proposed DSS follows the privacy, security and performance aspects with no errors after inspecting the DSS knowledge base. The evaluation shows that the data driven DSS based on independent data mart with KPI, OLAP is one of the best platforms to support short-to-long term academic decisions

    Word Sense Disambiguation for Ontology Learning

    Get PDF
    Ontology learning aims to automatically extract ontological concepts and relationships from related text repositories and is expected to be more efficient and scalable than manual ontology development. One of the challenging issues associated with ontology learning is word sense disambiguation (WSD). Most WSD research employs resources such as WordNet, text corpora, or a hybrid approach. Motivated by the large volume and richness of user-generated content in social media, this research explores the role of social media in ontology learning. Specifically, our approach exploits social media as a dynamic context rich data source for WSD. This paper presents a method and preliminary evidence for the efficacy of our proposed method for WSD. The research is in progress toward conducting a formal evaluation of the social media based method for WSD, and plans to incorporate the WSD routine into an ontology learning system in the future

    Ontology based data warehousing for mining of heterogeneous and multidimensional data sources

    Get PDF
    Heterogeneous and multidimensional big-data sources are virtually prevalent in all business environments. System and data analysts are unable to fast-track and access big-data sources. A robust and versatile data warehousing system is developed, integrating domain ontologies from multidimensional data sources. For example, petroleum digital ecosystems and digital oil field solutions, derived from big-data petroleum (information) systems, are in increasing demand in multibillion dollar resource businesses worldwide. This work is recognized by Industrial Electronic Society of IEEE and appeared in more than 50 international conference proceedings and journals

    Multi-Faceted Search and Navigation of Biological Databases

    Get PDF

    Towards development of fuzzy spatial datacubes : fundamental concepts with example for multidimensional coastal erosion risk assessment and representation

    Get PDF
    Les systèmes actuels de base de données géodécisionnels (GeoBI) ne tiennent généralement pas compte de l'incertitude liée à l'imprécision et le flou des objets; ils supposent que les objets ont une sémantique, une géométrie et une temporalité bien définies et précises. Un exemple de cela est la représentation des zones à risque par des polygones avec des limites bien définies. Ces polygones sont créés en utilisant des agrégations d'un ensemble d'unités spatiales définies sur soit des intérêts des organismes responsables ou les divisions de recensement national. Malgré la variation spatio-temporelle des multiples critères impliqués dans l’analyse du risque, chaque polygone a une valeur unique de risque attribué de façon homogène sur l'étendue du territoire. En réalité, la valeur du risque change progressivement d'un polygone à l'autre. Le passage d'une zone à l'autre n'est donc pas bien représenté avec les modèles d’objets bien définis (crisp). Cette thèse propose des concepts fondamentaux pour le développement d'une approche combinant le paradigme GeoBI et le concept flou de considérer la présence de l’incertitude spatiale dans la représentation des zones à risque. En fin de compte, nous supposons cela devrait améliorer l’analyse du risque. Pour ce faire, un cadre conceptuel est développé pour créer un model conceptuel d’une base de donnée multidimensionnelle avec une application pour l’analyse du risque d’érosion côtier. Ensuite, une approche de la représentation des risques fondée sur la logique floue est développée pour traiter l'incertitude spatiale inhérente liée à l'imprécision et le flou des objets. Pour cela, les fonctions d'appartenance floues sont définies en basant sur l’indice de vulnérabilité qui est un composant important du risque. Au lieu de déterminer les limites bien définies entre les zones à risque, l'approche proposée permet une transition en douceur d'une zone à une autre. Les valeurs d'appartenance de plusieurs indicateurs sont ensuite agrégées basées sur la formule des risques et les règles SI-ALORS de la logique floue pour représenter les zones à risque. Ensuite, les éléments clés d'un cube de données spatiales floues sont formalisés en combinant la théorie des ensembles flous et le paradigme de GeoBI. En plus, certains opérateurs d'agrégation spatiale floue sont présentés. En résumé, la principale contribution de cette thèse se réfère de la combinaison de la théorie des ensembles flous et le paradigme de GeoBI. Cela permet l’extraction de connaissances plus compréhensibles et appropriées avec le raisonnement humain à partir de données spatiales et non-spatiales. Pour ce faire, un cadre conceptuel a été proposé sur la base de paradigme GéoBI afin de développer un cube de données spatiale floue dans le system de Spatial Online Analytical Processing (SOLAP) pour évaluer le risque de l'érosion côtière. Cela nécessite d'abord d'élaborer un cadre pour concevoir le modèle conceptuel basé sur les paramètres de risque, d'autre part, de mettre en œuvre l’objet spatial flou dans une base de données spatiales multidimensionnelle, puis l'agrégation des objets spatiaux flous pour envisager à la représentation multi-échelle des zones à risque. Pour valider l'approche proposée, elle est appliquée à la région Perce (Est du Québec, Canada) comme une étude de cas.Current Geospatial Business Intelligence (GeoBI) systems typically do not take into account the uncertainty related to vagueness and fuzziness of objects; they assume that the objects have well-defined and exact semantics, geometry, and temporality. Representation of fuzzy zones by polygons with well-defined boundaries is an example of such approximation. This thesis uses an application in Coastal Erosion Risk Analysis (CERA) to illustrate the problems. CERA polygons are created using aggregations of a set of spatial units defined by either the stakeholders’ interests or national census divisions. Despite spatiotemporal variation of the multiple criteria involved in estimating the extent of coastal erosion risk, each polygon typically has a unique value of risk attributed homogeneously across its spatial extent. In reality, risk value changes gradually within polygons and when going from one polygon to another. Therefore, the transition from one zone to another is not properly represented with crisp object models. The main objective of the present thesis is to develop a new approach combining GeoBI paradigm and fuzzy concept to consider the presence of the spatial uncertainty in the representation of risk zones. Ultimately, we assume this should improve coastal erosion risk assessment. To do so, a comprehensive GeoBI-based conceptual framework is developed with an application for Coastal Erosion Risk Assessment (CERA). Then, a fuzzy-based risk representation approach is developed to handle the inherent spatial uncertainty related to vagueness and fuzziness of objects. Fuzzy membership functions are defined by an expert-based vulnerability index. Instead of determining well-defined boundaries between risk zones, the proposed approach permits a smooth transition from one zone to another. The membership values of multiple indicators (e.g. slop and elevation of region under study, infrastructures, houses, hydrology network and so on) are then aggregated based on risk formula and Fuzzy IF-THEN rules to represent risk zones. Also, the key elements of a fuzzy spatial datacube are formally defined by combining fuzzy set theory and GeoBI paradigm. In this regard, some operators of fuzzy spatial aggregation are also formally defined. The main contribution of this study is combining fuzzy set theory and GeoBI. This makes spatial knowledge discovery more understandable with human reasoning and perception. Hence, an analytical conceptual framework was proposed based on GeoBI paradigm to develop a fuzzy spatial datacube within Spatial Online Analytical Processing (SOLAP) to assess coastal erosion risk. This necessitates developing a framework to design a conceptual model based on risk parameters, implementing fuzzy spatial objects in a spatial multi-dimensional database, and aggregating fuzzy spatial objects to deal with multi-scale representation of risk zones. To validate the proposed approach, it is applied to Perce region (Eastern Quebec, Canada) as a case study
    • …
    corecore