306 research outputs found

    Early American Cookbooks: Creating and Analyzing a Digital Collection Using the HathiTrust Research Center Portal

    Full text link
    The Early American Cookbooks project is a carefully curated online collection of 1450 cookbooks published in the United States between 1800 and 1920. The purposes of the project are to create a freely available, searchable online collection of early American cookbooks, to offer an overview of the scope and contents of the collection, and to use digital humanities tools to explore trends and patterns in the metadata and the full text of the collection. The project has two basic components: a collection of 1450 full-text titles on HathiTrust and a website site to present a guide to the collection and the results of the digital humanities analysis. Early American Cookbooks collection URL: https://babel.hathitrust.org/cgi/mb?a=listis&c=1934413200 Early American Cookbooks website URL: https://wp.nyu.edu/early_american_cookbooks

    Methodologically Grounded SemanticAnalysis of Large Volume of Chilean Medical Literature Data Applied to the Analysis of Medical Research Funding Efficiency in Chile

    Get PDF
    Background Medical knowledge is accumulated in scientific research papers along time. In order to exploit this knowledge by automated systems, there is a growing interest in developing text mining methodologies to extract, structure, and analyze in the shortest time possible the knowledge encoded in the large volume of medical literature. In this paper, we use the Latent Dirichlet Allocation approach to analyze the correlation between funding efforts and actually published research results in order to provide the policy makers with a systematic and rigorous tool to assess the efficiency of funding programs in the medical area. Results We have tested our methodology in the Revista Medica de Chile, years 2012-2015. 50 relevant semantic topics were identified within 643 medical scientific research papers. Relationships between the identified semantic topics were uncovered using visualization methods. We have also been able to analyze the funding patterns of scientific research underlying these publications. We found that only 29% of the publications declare funding sources, and we identified five topic clusters that concentrate 86% of the declared funds. Conclusions Our methodology allows analyzing and interpreting the current state of medical research at a national level. The funding source analysis may be useful at the policy making level in order to assess the impact of actual funding policies, and to design new policies.This research was partially funded by CONICYT, Programa de Formacion de Capital Humano avanzado (CONICYT-PCHA/Doctorado Nacional/2015-21150115). MG work in this paper has been partially supported by FEDER funds for the MINECO project TIN2017-85827-P, and projects KK-2018/00071 and KK2018/00082 of the Elkartek 2018 funding program. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 777720. No role has been played by funding bodies in the design of the study and collection, analysis, or interpretation of data or in writing the manuscript

    Empowering open science with reflexive and spatialised indicators

    Get PDF
    Bibliometrics have become commonplace and widely used by authors and journals to monitor, to evaluate and to identify their readership in an ever-increasingly publishing scientific world. This contribution introduces a multi-method corpus analysis tool, specifically conceived for scientific corpuses with spatialised content. We propose a dedicated interactive application that integrates three strategies for building semantic networks, using keywords (self-declared themes), citations (areas of research using the papers) and full-texts (themes derived from the words used in writing). The networks can be studied with respect to their temporal evolution as well as to their spatial expressions, by considering the countries studied in the papers under inquiry. The tool is applied as a proof-of-concept on the papers published in the online open access geography journal Cybergeo since its creation in 1996. Finally, we compare the three methods and conclude that their complementarity can help go beyond simple statistics to better understand the epistemological evolution of a scientific community and the readership target of the journal. Our tool can be applied by any journal on its own corpus, fostering thus open science and reflexivity

    Kano-sarnase mudeli kasutamine avatud innovatsiooni saavutamiseks nõuete analüüsi protsessis

    Get PDF
    Kui viiakse läbi nõuete analüüsi (inglise k Requirements Engineering, lühend RE), siis sageli järjestatakse nõuded nende olulisuse alusel (inglise k requirements prioritization), et saada selgust, milliste välja pakutud nõuetega funktsioon peaks tarkvaral olemas olema, seega sõltub tarkvara analüüsist tarkvara majandusliku väärtuse suurendamisega seotud otsuste tegemine. Tänapäeval arenevad tooted väga kiiresti ning ka nõuete olulisuse alusel järjestamine (inglise k requirements prioritization) on muutunud kiiremaks. Ettevõtted sooviksid saada kasutajatelt kiiret tagasisidet selle kohta, mis peaks olema järgmises mudelis olemas. Üks häid lahendusi sellele on Kano mudel (inglise k Kano model). Kano mudel selgitab välja kasutajate rahulolu ja toodete tunnuste vahelise suhte. See meetod liigitab kasutajate eelistused nende tähtsuse järjekorras, seega toetab see ka nõuete olulisuse järjekorra moodustamist. Aga Kano mudeli rakendamine on kallis ja aeganõudev ning seda ei saa kiiresti korrata. Veelgi enam – see mudel on keeruline väikeste ettevõtete jaoks, sest neil ei tarvitse olla piisavalt rahalisi jm vahendeid, et kasutajatega ühendust võtta ja neid intervjueerida. See omakorda paneb väikesed ettevõtted, eriti just idufirmad, ebavõrdsesse olukorda suurte ettevõtetega. Et sellele probleemile lahendust leida ja Kano mudeli kasutuselevõttu lihtsamaks ning odavamaks teha, arvame, et Kano mudelit tuleks arendada kahel viisil. Esiteks tuleks kasutada tasuta võrgus leiduvaid kirjalikke andmeid, mida saaks asendada intervjueeritavatelt kogutud vastustega. Teiseks – selleks, et hakkama saada võrgust kogutud kirjalike andmete suure mahuga, ning et kaasa aidata korrapärastele analüüsidele, peaks andmete analüüsimine olema automaatne. Selle uurimuse eesmärk on välja pakkuda meetodeid, et kasutajate avamusi, mis on võrgus saadavatest vabadest allikatest kogutud, (semi-)automaatselt liigitada, ja seda selleks, et aidata otsustajatel otsustada, millised tarkvara nõuded järgmises mudelis kindlasti olemas peaksid olema. Et seda uurimuse eesmärki saavutada, pakume me välja avatud innovatsiooni nõuete analüüsi (OIRE) meetodi, mille abil saavad tarkvarafirmad parema ülevaate kasutajate vajadustest ja sellest, kuivõrd rahul on nad olemasolevate toodetega.When Requirements Engineering (RE) is applied, requirements analysis is often used to determine which candidate requirements of a feature should be included in a software release. This plays a crucial role in the decisions made to increase the economic value of software. Nowadays, products evolve fast, and the process of requirements prioritization is becoming shorter as well. Companies benefit from receiving quick feedback from end users about what should be included in subsequent releases. One effective approach supporting requirements prioritization is the Kano model. The Kano model defines the relationship between user satisfaction and product features. It is a method used to classify user preferences according to their importance, and in doing so, supports requirements prioritization. However, implementing the Kano model is costly and time-consuming, and the application of the Kano model cannot be repeated quickly. Moreover, this is even more difficult for small companies because they might not have sufficient funds and resources to contact end users and conduct interviews. This puts small businesses, especially start-ups, at an unfair disadvantage in competing with big companies. To address this problem and make the application of the Kano model simpler, faster, and cheaper, we propose evolving the Kano model in two aspects. First, free online text data should be used to replace responses collected from interviewees. Second, in order to handle the higher amount of data that can be collected from free online text data and in order to facilitate frequent analyses, the data analysis process should be automated. The goal of this research is to propose methods for (semi-)automatically classifying user opinions collected from online open sources (e.g., from online reviews) to help decision-makers decide which software requirements to include in subsequent product versions. To achieve this research goal, we propose the Open Innovation in Requirements Engineering (OIRE) method to help software organizations gain a better understanding of user needs and satisfaction with existing products. A key element of the OIRE method is its Kano-like model. This Kano-like model mimics the traditional Kano model, except that it uses data from online reviews instead of interviews conducted with select focus groups.https://www.ester.ee/record=b527385

    Modeling technological topic changes in patent claims

    Full text link
    © 2014 Portland International Conference on Management of Engineering and Technology. Patent claims usually embody the most essential terms and the core technological scope to define the protection of an invention, which makes them the ideal resource for patent content and topic change analysis. However, manually conducting content analysis on massive technical terms is very time consuming and laborious. Even with the help of traditional text mining techniques, it is still difficult to model topic changes over time, because single keywords alone are usually too general or ambiguous to represent a concept. Moreover, term frequency which used to define a topic cannot separate polysemous words that are actually describing a different theme. To address this issue, this research proposes a topic change identification approach based on Latent Dirichlet Allocation to model and analyze topic changes with minimal human intervention. After textual data cleaning, underlying semantic topics hidden in large archives of patent claims are revealed automatically. Concepts are defined by probability distributions over words instead of term frequency, so that polysemy is allowed. A case study using patents published in the United States Patent and Trademark Office (USPTO) from 2009 to 2013 with Australia as their assignee country is presented to demonstrate the validity of the proposed topic change identification approach. The experimental result shows that the proposed approach can be used as an automatic tool to provide machine-identified topic changes for more efficient and effective R&D management assistance

    Identifying Synonymous Terms in Preparation for Technology Mining

    Get PDF
    In this research, the development of a `concept-clumping algorithm\u27 designed to improve the clustering of technical concepts is demonstrated . The algorithm developed first identifies a list of technically relevant noun phrases from a cleaned extracted list and then applies a rule-based algorithm for identifying synonymous terms based on shared words in each term. An assessment of the algorithm found that the algorithm has an 89—91% precision rate, was successful in moving technically important terms higher in the term frequency list, and improved the technical specificity of term clusters

    HathiTrust Research Center: Computational Research on the HathiTrust Repository

    Get PDF
    PIs (exec mgt team): Beth A. Plale, Indiana University; Marshall Scott Poole, University of Illinois Urbana-Champaign ; Robert McDonald, IU; John Unsworth (UIUC) Senior investigators: Loretta Auvil (UIUC); Johan Bollen (IU), Randy Butler (UIUC); Dennis Cromwell (IU), Geoffrey Fox (IU), Eileen Julien (IU), Stacy Kowalczyk (IU); Danny Powell (UIUC); Beth Sandore (UIUC); Craig Stewart (IU); John Towns (UIUC); Carolyn Walters (IU), Michael Welge (UIUC); Eric Wernert (IU

    A high-reproducibility and high-accuracy method for automated topic classification

    Full text link
    Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.Comment: 23 pages, 24 figure

    DARIAH and the Benelux

    Get PDF

    A New Approach to Information Extraction in User-Centric E-Recruitment Systems

    Get PDF
    In modern society, people are heavily reliant on information available online through various channels, such as websites, social media, and web portals. Examples include searching for product prices, news, weather, and jobs. This paper focuses on an area of information extraction in e-recruitment, or job searching, which is increasingly used by a large population of users in across the world. Given the enormous volume of information related to job descriptions and users’ profiles, it is complicated to appropriately match a user’s profile with a job description, and vice versa. Existing information extraction techniques are unable to extract contextual entities. Thus, they fall short of extracting domain-specific information entities and consequently affect the matching of the user profile with the job description. The work presented in this paper aims to extract entities from job descriptions using a domain-specific dictionary. The extracted information entities are enriched with knowledge using Linked Open Data. Furthermore, job context information is expanded using a job description domain ontology based on the contextual and knowledge information. The proposed approach appropriately matches users’ profiles/queries and job descriptions. The proposed approach is tested using various experiments on data from real life jobs’ portals. The results show that the proposed approach enriches extracted data from job descriptions, and can help users to find more relevant jobs
    corecore