114 research outputs found

    Visualization of Heterogeneous Data

    Full text link

    MLED_BI: A Novel Business Intelligence Design Approach to Support Multilingualism

    Get PDF
    With emerging markets and expanding international cooperation, there is a requirement to support Business Intelligence (BI) applications in multiple languages, a process which we refer to as Multilingualism (ML). ML in BI is understood in this research as the ability to store descriptive content (such as descriptions of attributes in BI reports) in more than one language at Data Warehousing (DWH) level and to use this information at presentation level to provide reports, queries or dashboards in more than one language. Design strategies for data warehouses are typically based on the assumption of a single language environment. The motivations for this research are the design and performance challenges encountered when implementing ML in a BI data warehouse environment. These include design issues, slow response times, delays in updating reports and changing languages between reports, the complexity of amending existing reports and the performance overhead. The literature review identified that the underlying cause of these problems is that existing approaches used to enable ML in BI are primarily ad-hoc workarounds which introduce dependency between elements and lead to excessive redundancy. From the literature review, it was concluded that a satisfactory solution to the challenge of ML in BI requires a design approach based on data independence the concept of immunity from changes and that such a solution does not currently exist. This thesis presents MLED_BI (Multilingual Enabled Design for Business Intelligence). MLED_BI is a novel design approach which supports data independence and immunity from changes in the design of ML data warehouses and BI systems. MLED_BI extends existing data warehouse design approaches by revising the role of the star schema and introducing a ML design layer to support the separation of language elements. This also facilitates ML at presentation level by enabling the use of a ML content management system. Compared to existing workarounds for ML, the MLED_BI design approach has a theoretical underpinning which allows languages to be added, amended and deleted without requiring a redesign of the star schema; provides support for the manipulation of ML content; improves performance and streamlines data warehouse operations such as ETL (Extract, Transform, Load). Minor contributions include the development of a novel BI framework to address the limitations of existing BI frameworks and the development of a tool to evaluate changes to BI reporting solutions. The MLED_BI design approach was developed based on the literature review and a mixed methods approach was used for validation. Technical elements were validated experimentally using performance metrics while end user acceptance was validated qualitatively with end users and technical users from a number of countries, reflecting the ML basis of the research. MLED_BI requires more resources at design and initial implementation stage than existing ML workarounds but this is outweighed by improved performance and by the much greater flexibility in ML made possible by the data independence approach of MLED_BI. The MLED_BI design approach enhances existing BI design approaches for use in ML environments

    Formal Description of Web Services for Expressive Matchmaking

    Get PDF

    Open source software GitHub ecosystem: a SEM approach

    Get PDF
    Open source software (OSS) is a collaborative effort. Getting affordable high-quality software with less probability of errors or fails is not far away. Thousands of open-source projects (termed repos) are alternatives to proprietary software development. More than two-thirds of companies are contributing to open source. Open source technologies like OpenStack, Docker and KVM are being used to build the next generation of digital infrastructure. An iconic example of OSS is 'GitHub' - a successful social site. GitHub is a hosting platform that host repositories (repos) based on the Git version control system. GitHub is a knowledge-based workspace. It has several features that facilitate user communication and work integration. Through this thesis I employ data extracted from GitHub, and seek to better understand the OSS ecosystem, and to what extent each of its deployed elements affects the successful development of the OSS ecosystem. In addition, I investigate a repo's growth over different time periods to test the changing behavior of the repo. From our observations developers do not follow one development methodology when developing, and growing their project, and such developers tend to cherry-pick from differing available software methodologies. GitHub API remains the main OSS location engaged to extract the metadata for this thesis's research. This extraction process is time-consuming - due to restrictive access limitations (even with authentication). I apply Structure Equation Modelling (termed SEM) to investigate the relative path relationships between the GitHub- deployed OSS elements, and I determine the path strength contributions of each element to determine the OSS repo's activity level. SEM is a multivariate statistical analysis technique used to analyze structural relationships. This technique is the combination of factor analysis and multiple regression analysis. It is used to analyze the structural relationship between measured variables and/or latent constructs. This thesis bridges the research gap around longitude OSS studies. It engages large sample-size OSS repo metadata sets, data-quality control, and multiple programming language comparisons. Querying GitHub is not direct (nor simple) yet querying for all valid repos remains important - as sometimes illegal, or unrepresentative outlier repos (which may even be quite popular) do arise, and these then need to be removed from each initial OSS's language-specific metadata set. Eight top GitHub programming languages, (selected as the most forked repos) are separately engaged in this thesis's research. This thesis observes these eight metadata sets of GitHub repos. Over time, it measures the different repo contributions of the deployed elements of each metadata set. The number of stars-provided to the repo delivers a weaker contribution to its software development processes. Sometimes forks work against the repo's progress by generating very minor negative total effects into its commit (activity) level, and by sometimes diluting the focus of the repo's software development strategies. Here, a fork may generate new ideas, create a new repo, and then draw some original repo developers off into this new software development direction, thus retarding the original repo's commit (activity) level progression. Multiple intermittent and minor version releases exert lesser GitHub JavaScript repo commit (or activity) changes because they often involve only slight OSS improvements, and because they only require minimal commit/commits contributions. More commit(s) also bring more changes to documentation, and again the GitHub OSS repo's commit (activity) level rises. There are both direct and indirect drivers of the repo's OSS activity. Pulls and commits are the strongest drivers. This suggests creating higher levels of pull requests is likely a preferred prime target consideration for the repo creator's core team of developers. This study offers a big data direction for future work. It allows for the deployment of more sophisticated statistical comparison techniques. It offers further indications around the internal and broad relationships that likely exist between GitHub's OSS big data. Its data extraction ideas suggest a link through to business/consumer consumption, and possibly how these may be connected using improved repo search algorithms that release individual business value components

    Evolving a secure grid-enabled, distributed data warehouse : a standards-based perspective

    Get PDF
    As digital data-collection has increased in scale and number, it becomes an important type of resource serving a wide community of researchers. Cross-institutional data-sharing and collaboration introduce a suitable approach to facilitate those research institutions that are suffering the lack of data and related IT infrastructures. Grid computing has become a widely adopted approach to enable cross-institutional resource-sharing and collaboration. It integrates a distributed and heterogeneous collection of locally managed users and resources. This project proposes a distributed data warehouse system, which uses Grid technology to enable data-access and integration, and collaborative operations across multi-distributed institutions in the context of HV/AIDS research. This study is based on wider research into OGSA-based Grid services architecture, comprising a data-analysis system which utilizes a data warehouse, data marts, and near-line operational database that are hosted by distributed institutions. Within this framework, specific patterns for collaboration, interoperability, resource virtualization and security are included. The heterogeneous and dynamic nature of the Grid environment introduces a number of security challenges. This study also concerns a set of particular security aspects, including PKI-based authentication, single sign-on, dynamic delegation, and attribute-based authorization. These mechanisms, as supported by the Globus Toolkit’s Grid Security Infrastructure, are used to enable interoperability and establish trust relationship between various security mechanisms and policies within different institutions; manage credentials; and ensure secure interactions

    Automated Structural and Spatial Comprehension of Data Tables

    Get PDF
    Data tables on the Web hold large quantities of information, but are difficult to search, browse, and merge using existing systems. This dissertation presents a collection of techniques for extracting, processing, and querying tables that contain geographic data, by harnessing the coherence of table structures for retrieval tasks. Data tables, including spreadsheets, HTML tables, and those found in rich document formats, are the standard way of communicating structured data for typical computer users. Notably, geographic tables (i.e., those containing names of locations) constitute a large fraction of publicly-available data tables and are ripe for exposure to Internet users who are increasingly comfortable interacting with geographic data using web-based maps. Of particular interest is the creation of a large repository of geographic data tables that would enable novel queries such as "find vacation itineraries geographically similar to mine" for use in trip planning or "find demographic datasets that cover regions X, Y, and Z" for sociological research. In support of these goals, this dissertation identifies several methods for using the structure and context of data tables to improve the interpretation of the contents, even in the presence of ambiguity. First, a method for identifying functional components of data tables is presented, capitalizing on techniques for sequence labeling that are used in natural language processing. Next, a novel automated method for converting place references to physical latitude/longitude values, a process known as geotagging, is applied to tables with high accuracy. A classification procedure for identifying a specific class of geographic table, the travel itinerary, is also described, which borrows inspiration from optimization techniques for the traveling salesman problem (TSP). Finally, methods for querying spatially similar tables are introduced and several mechanisms for visualizing and interacting with the extracted geographic data are explored

    Access beyond geographic accessibility: understanding opportunities to human needs in a physical-virtual world

    Get PDF
    Access to basic human needs, such as food and healthcare, is conceptually understood to be comprised of multiple spatial and aspatial dimensions. However, research in this area has traditionally been explored with spatial accessibility measures that almost exclusively focus on just two dimensions. Namely, the availability of resources, services, and facilities, and the accessibility or ease to which locations of these opportunities can be reached with existing land-use and transport systems under temporal constraints and considering individual characteristics of people. These calculated measures are insufficient in holistically capturing available opportunities as they ignore other components, such as the emergence of virtual space to carry out activities and interactions enabled by modern information and communication technologies (ICT). Human dynamics today exist in a hybrid physical-virtual space, and recent research has highlighted the importance of understanding ICT, individual behavior, local context, social relations, and human perceptions in identifying opportunities available to people. However, there lacks a holistic approach that relates these different aspects to access research. This dissertation addresses this gap by proposing a new conceptual framework for the geography of access for various kinds of human needs, using food access as a case study to illustrate how the proposed framework can be applied to address critical societal issues. An interactive multispace geographic information system (GIS) web application is developed to better understand and visualize individual potential food access based on the conceptual framework. This dissertation contributes to the body of research with a proposed conceptual framework of access in a hybrid physical-virtual world, integration of various big and small data sources to reveal information relating to the access of people, and novel development of a multi-space GIS to analyze and visualize access to opportunities
    • …
    corecore