2,321 research outputs found

    Developing a service endpoint to integrate semantic collection data from botanical databases and other information systems

    Get PDF
    The digitization of botanical collections has increasingly brought biodiversity research activities online. In order to make these data usable in the most efficient way, various obstacles have to be overcome. One such obstacle is a lack of ability to integrate information from other sources. While agreed upon, machine-understandable data standards such as ABCD have resulted in concepts that can already be described semantically, yet they are often transmitted as free-text information. The utilization of identifers for collectors has created opportunities for the integration of data from external information systems. However, since the identifers used are not standardized and vary from institution to institution, this work aims to develop a web service demonstrating that this problem can be overcome by applying appropriate Linked Data methods on centralized knowledge bases such as Wikidata. After eliciting requirements from participating CETAF institutions, an API was designed and implemented on this basis that can integrate biographic, bibliographic, and collection data into a single semantic file format by leveraging multiple endpoints. Thus, the work shows that diverse identifers used in collection databases do not have to be a problem. Moreover, missing IDs for important information sources such as Wikidata can be found and used. Heterogeneous data from different sources can be merged using previously defined mappings, although such data may not be available in semantic formats. Further sources of information could thus be added in the future. Furthermore, a future focus on annotated geographic identifers is also conceivable to additionally integrate semantic data on collection object found locations

    Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article

    Get PDF
    With the rapid development of the digital humanities (DH) field, demands for historical and cultural heritage data have generated deep interest in the data provided by libraries, archives, and museums (LAMs). In order to enhance LAM data’s quality and discoverability while enabling a self-sustaining ecosystem, “semantic enrichment” becomes a strategy increasingly used by LAMs during recent years. This article introduces a number of semantic enrichment methods and efforts that can be applied to LAM data at various levels, aiming to support deeper and wider exploration and use of LAM data in DH research. The real cases, research projects, experiments, and pilot studies shared in this article demonstrate endless potential for LAM data, whether they are structured, semi-structured, or unstructured, regardless of what types of original artifacts carry the data. Following their roadmaps would encourage more effective initiatives and strengthen this effort to maximize LAM data’s discoverability, use- and reuse-ability, and their value in the mainstream of DH and Semantic Web

    Music feature extraction and analysis through Python

    Get PDF
    En l'era digital, plataformes com Spotify s'han convertit en els principals canals de consum de música, ampliant les possibilitats per analitzar i entendre la música a través de les dades. Aquest projecte es centra en un examen exhaustiu d'un conjunt de dades obtingut de Spotify, utilitzant Python com a eina per a l'extracció i anàlisi de dades. L'objectiu principal es centra en la creació d'aquest conjunt de dades, emfatitzant una àmplia varietat de cançons de diversos subgèneres. La intenció és representar tant el panorama musical més tendenciós i popular com els nínxols, alineant-se amb el concepte de distribució de Cua Llarga, terme popularitzat com a "Long Tail" en anglès, que destaca el potencial de mercat de productes de nínxols amb menor popularitat. A través de l'anàlisi, es posen de manifest patrons en l'evolució de les característiques musicals al llarg de les dècades passades. Canvis en característiques com l'energia, el volum, la capacitat de ball, el positivisme que desprèn una cançó i la seva correlació amb la popularitat sorgeixen del conjunt de dades. Paral·lelament a aquesta anàlisi, es concep un sistema de recomanació musical basat en el contingut del conjunt de dades creat. L'objectiu és connectar cançons, especialment les menys conegudes, amb possibles oients. Aquest projecte ofereix perspectives beneficioses per a entusiastes de la música, científics de dades i professionals de la indústria. Les metodologies implementades i l'anàlisi realitzat presenten un punt de convergència de la ciència de dades i la indústria de la música en el context digital actualEn la era digital, plataformas como Spotify se han convertido en los principales canales de consumo de música, ampliando las posibilidades para analizar y entender la música a través de los datos. Este proyecto se centra en un examen exhaustivo de un conjunto de datos obtenido de Spotify, utilizando Python como herramienta para la extracción y análisis de datos. El objetivo principal se centra en la creación de este conjunto de datos, enfatizando una amplia variedad de canciones de diversos subgéneros. La intención es representar tanto el panorama musical más tendencioso y popular como los nichos, alineándose con el concepto de distribución de Cola Larga, término popularizado como Long Tail en inglés, que destaca el potencial de mercado de productos de nichos con menor popularidad. A través del análisis, se evidencian patrones en la evolución de las características musicales a lo largo de las décadas pasadas. Cambios en características como la energía, el volumen, la capacidad de baile, el positivismo que desprende una canción y su correlación con la popularidad surgen del conjunto de datos. Paralelamente a este análisis, se concibe un sistema de recomendación musical basado en el contenido del conjunto de datos creado. El objetivo es conectar canciones, especialmente las menos conocidas, con posibles oyentes. Este proyecto ofrece perspectivas beneficiosas para entusiastas de la música, científicos de datos y profesionales de la industria. Las metodologías implementadas y el análisis realizado presentan un punto de convergencia de la ciencia de datos y la industria de la música en el contexto digital actualIn the digital era, platforms like Spotify have become the primary channels of music consumption, broadening the possibilities for analyzing and understanding music through data. This project focuses on a comprehensive examination of a dataset sourced from Spotify, with Python as the tool for data extraction and analysis. The primary objective centers around the creation of this dataset, emphasizing a diverse range of songs from various subgenres. The intention is to represent both mainstream and niche musical landscapes, aligning with the Long Tail distribution concept, which highlights the market potential of less popular niche products. Through analysis, patterns in the evolution of musical features over past decades become evident. Shifts in features such as energy, loudness, danceability, and valence and their correlation with popularity emerge from the dataset. Parallel to this analysis is the conceptualization of a music recommendation system based on the content of the data set. The aim is to connect tracks, especially lesser-known ones, with potential listeners. This project provides insights beneficial for music enthusiasts, data scientists, and industry professionals. The methodologies and analyses present a convergence of data science and the music industry in today's digital contex

    Semantic enrichment for enhancing LAM data and supporting digital humanities. Review article

    Get PDF
    With the rapid development of the digital humanities (DH) field, demands for historical and cultural heritage data have generated deep interest the data provided by libraries, archives, and museums (LAMs). In order to enhance LAM data’s quality and discoverability while enabling a self-sustaining ecosystem, “semantic enrichment” becomes a strategy increasingly used by LAMs during recent years. This article introduces a number of semantic enrichment methods and efforts that can be applied to LAM data at various levels, aiming to support deeper and wider exploration and use of LAM data in DH research. The real cases, research projects, experiments, and pilot studies shared in this article demonstrate endless potential for LAM data, whether they are structured, semi-structured, or unstructured, regardless of what types of original artifacts carry the data. Following their roadmaps would encourage more effective initiatives and strengthen this effort to maximize LAM data’s discoverability, use- and reuse-ability, and their value in the mainstream of DH and Semantic Web

    A Systematic Review of Automated Query Reformulations in Source Code Search

    Full text link
    Fixing software bugs and adding new features are two of the major maintenance tasks. Software bugs and features are reported as change requests. Developers consult these requests and often choose a few keywords from them as an ad hoc query. Then they execute the query with a search engine to find the exact locations within software code that need to be changed. Unfortunately, even experienced developers often fail to choose appropriate queries, which leads to costly trials and errors during a code search. Over the years, many studies attempt to reformulate the ad hoc queries from developers to support them. In this systematic literature review, we carefully select 70 primary studies on query reformulations from 2,970 candidate studies, perform an in-depth qualitative analysis (e.g., Grounded Theory), and then answer seven research questions with major findings. First, to date, eight major methodologies (e.g., term weighting, term co-occurrence analysis, thesaurus lookup) have been adopted to reformulate queries. Second, the existing studies suffer from several major limitations (e.g., lack of generalizability, vocabulary mismatch problem, subjective bias) that might prevent their wide adoption. Finally, we discuss the best practices and future opportunities to advance the state of research in search query reformulations.Comment: 81 pages, accepted at TOSE

    World-Historical Gazetteer

    Get PDF
    This project will advance work toward creation of a world-historical gazetteer that will provide comprehensive databases of places throughout the world since 1500 CE, including attention to the range of attributes known for each place. To satisfy the needs of all the large-scale historical data resources now being created, there is need for such a comprehensive and general gazetteer system. The convening of a two-day workshop, including leading figures who have developed gazetteers and the datasets in which they are incorporated, will bring about a research design for this world-historical gazetteer system, which can then be implemented in subsequent work. Four small research tasks concerning services, standards, and content will bring immediate advance toward implementation. The project is organized by the Collaborative for Historical Information and Analysis (CHIA), which has a record in sustaining collaborations for large-scale humanities work

    Investigating bias in Music Recommender Systems

    Get PDF
    Music Recommender Systems (MRS) are software applications that provide personalized music recommendations based on user preferences and listening history. They analyze data to suggest music that aligns with individual tastes, enhancing the music discovery experience. This thesis aims to investigate the influence of record labels across different music recommendation datasets and evaluate their impact on recommender systems. Additionally, it seeks to expand the scope and experimentation of prior research on bias within feedback loops of MRS. To study their effect, the datasets are preprocessed and fed into a multi-stage web crawler that retrieves record label information for individual albums as well as an assignment to a major record company (Universal, Sony, Warner) or independent. This crawler is used to enrich our dataset collection. Based on the additional information, we can show different characteristics and identify particular biases in their user-generated music collections of playlists and listening profiles. Moreover, recommender system experiments are conducted, presenting results of feedback loop simulations, where the stability of record label distribution in longitudinal recommendations are studied. All findings and gathered record label information are made publicly available to the research community.Els Sistemes de Recomanació Musical (MRS) són aplicacions de software que proporcionen recomanacions de música personalitzades basades en les preferències i el històric d'escolta de l'usuari. Analitzen dades per suggerir música que s'ajusti als gustos individuals, millorant així l'experiència de descobriment musical. Aquesta tesi té com a objectiu investigar la influència de les discogràfiques en diferents conjunts de dades de recomanació musical i avaluar el seu impacte en els sistemes de recomanació. A més, busca ampliar l'abast i l'experimentació de recerques prèvies sobre biaixos en els bucles de retroalimentació dels MRS. Per estudiar el seu efecte, els conjunts de dades es pre-processen i s'insereixen a un rastrejador web de diverses etapes que recopila informació sobre les discogràfiques dels àlbums individuals, així com la seva classificació en una discogràfica principal (Universal, Sony, Warner) o independent. Aquest rastrejador s'utilitza per enriquir la nostra col·lecció de dades. Basant-nos en la informació addicional, podem mostrar diferents característiques i identificar biaixos particulars en les col·leccions de música generades pels usuaris, com ara llistes de reproducció i perfils d'escolta. A més, es fan experiments en un entorn simulat de recomanacions, presentant els primers resultats de la simulació de bucles de retroalimentació on s'estudia l'estabilitat de la distribució de segells discogràfics en recomanacions longitudinals. Totes les troballes i la informació recopilada de segells discogràfics es posa a la disposició del públic per a la comunitat investigadora

    Data-Driven Decisions and Actions in Today’s Software Development

    Full text link
    Today’s software development is all about data: data about the software product itself, about the process and its different stages, about the customers and markets, about the development, the testing, the integration, the deployment, or the runtime aspects in the cloud. We use static and dynamic data of various kinds and quantities to analyze market feedback, feature impact, code quality, architectural design alternatives, or effects of performance optimizations. Development environments are no longer limited to IDEs in a desktop application or the like but span the Internet using live programming environments such as Cloud9 or large-volume repositories such as BitBucket, GitHub, GitLab, or StackOverflow. Software development has become “live” in the cloud, be it the coding, the testing, or the experimentation with different product options on the Internet. The inherent complexity puts a further burden on developers, since they need to stay alert when constantly switching between tasks in different phases. Research has been analyzing the development process, its data and stakeholders, for decades and is working on various tools that can help developers in their daily tasks to improve the quality of their work and their productivity. In this chapter, we critically reflect on the challenges faced by developers in a typical release cycle, identify inherent problems of the individual phases, and present the current state of the research that can help overcome these issues
    corecore