33 research outputs found

    Identifying Relationships between Scientific Datasets

    Get PDF
    Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B. We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists\u27 ability to detect such relationships between datasets. The promising results of ReConnect\u27s evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist\u27s or a small group of researchers\u27 collections of datasets, and the sensitivity of the overall system to the performance of its various components

    Data Citation: A New Provenance Challenge

    Get PDF

    Curriculum analysis for data systems education.

    Get PDF
    The field of data systems has seen quick advances due to the popularization of data science, machine learning, and real-time analytics. In industry contexts, system features such as recommendation systems, chatbots and reverse image search require efficient infrastructure and data management solutions. Due to recent advances, it remains unclear (i) which topics are recommended to be included in data systems studies in higher education, (ii) which topics are a part of data systems courses and how they are taught, and (iii) which data-related skills are valued for roles such as software developers, data engineers, and data scientists. This working group aims to answer these points to explain the state of data systems education today and to uncover knowledge gaps and possible discrepancies between recommendations, course implementations, and industry needs. We expect the results to be applicable in tailoring various data systems courses to better cater to the needs of industry, and for teachers to share best practices

    Data systems education : curriculum recommendations, course syllabi, and industry needs

    Get PDF
    Data systems have been an important part of computing curricula for decades, and an integral part of data-focused industry roles such as software developers, data engineers, and data scientists. However, the field of data systems encompasses a large number of topics ranging from data manipulation and database distribution to creating data pipelines and data analytics solutions. Due to the slow nature of curriculum development, it remains unclear (i) which data systems topics are recommended across diverse higher education curriculum guidelines, (ii) which topics are taught in higher education data systems courses, and (iii) which data systems topics are actually valued in data-focused industry roles. In this study, we analyzed computing curriculum guidelines, course contents, and industry needs regarding data systems to uncover discrepancies between them. Our results show, for example, that topics such as data visualization, data warehousing, and semi-structured data models are valued in industry, yet seldom taught in courses. This work allows professionals to further align curriculum guidelines, higher education, and data systems industry to better prepare students for their working life by focusing on relevant skills in data systems education

    Data systems education: curriculum recommendations, course syllabi, and industry needs.

    Get PDF
    Data systems have been an important part of computing curricula for decades, and an integral part of data-focused industry roles such as software developers, data engineers, and data scientists. However, the field of data systems encompasses a large number of topics ranging from data manipulation and database distribution to creating data pipelines and data analytics solutions. Due to the slow nature of curriculum development, it remains unclear (i) which data systems topics are recommended across diverse higher education curriculum guidelines, (ii) which topics are taught in higher education data systems courses, and (iii) which data systems topics are actually valued in data-focused industry roles. In this study, we analyzed computing curriculum guidelines, course contents, and industry needs regarding data systems to uncover discrepancies between them. Our results show, for example, that topics such as data visualization, data warehousing, and semi-structured data models are valued in industry, yet seldom taught in courses. This work allows professionals to further align curriculum guidelines, higher education, and data systems industry to better prepare students for their working life by focusing on relevant skills in data systems education

    Identifying Relationships between Scientific Datasets

    No full text

    Green BIM Adoption,an Agile Approach

    No full text
    The energy consumption issues of the United States cannot be discussed without the inclusion of the energy needs in the building sector. Currently there are approximately 76 million residential structures and 5 million commercial structures in the United States [1]. As the population grows upward of 311 million people, the need for additional buildings will correspondingly increase [2]. Currently, buildings account for approximately 40% of total energy and 70% of electricity usage [4]. Additionally, the cost of energy in the United States has also been increasing. As the rest of world develops and industrializes, the demand for energy is going to increase due to the economic elasticity in the energy sector

    DBLP-NSF dataset SQL dump

    No full text
    This dataset is called DBLP-NSF, which is a Postgresql database dump file that connects computer science publications—extracted from DBLP—to their NSF funding grants—extracted from the National Science Foundation grant dataset. This dataset was used in an NSF-funded research project on data citation as an example of extending bibliographic citations to include funding information (NSF IIS 1302212, URLs: https://alliance.seas.upenn.edu/~citation/wiki/, https://www.researchgate.net/project/CiteDB). It is not a complete dataset — not all publications or all grants are included — and is not intended as an authoritatively complete data set to be used for data mining. Special thanks to Shivendra Pandey for his work on developing this dataset

    DBLP-NSF dataset SQL dump

    No full text
    This dataset is called DBLP-NSF, which is a Postgresql database dump file that connects computer science publications—extracted from DBLP—to their NSF funding grants—extracted from the National Science Foundation grant dataset. This dataset was used in an NSF-funded research project on data citation as an example of extending bibliographic citations to include funding information (NSF IIS 1302212, URLs: https://alliance.seas.upenn.edu/~citation/wiki/, https://www.researchgate.net/project/CiteDB). It is not a complete dataset — not all publications or all grants are included — and is not intended as an authoritatively complete data set to be used for data mining

    Insights from Student Solutions to MongoDB Homework Problems

    No full text
    corecore