16 research outputs found

    Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database

    Get PDF
    We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.eduNIH P01AG039347NSF 1348742Ope

    Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

    Full text link
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (1) syntactic complexity, including measurements of sentence length and sentence complexity; and (2) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.Comment: 6 figure

    Effect of minimally invasive autopsy and ethnic background on acceptance of clinical postmortem investigation in adults

    Get PDF
    Objectives Autopsy rates worldwide have dropped significantly over the last five decades. Imaging based autopsies are increasingly used as alternatives to conventional autopsy (CA). The aim of this study was to investigate the effect of the introduction of minimally invasive autopsy, consisting of CT, MRI and tissue biopsies on the overall autopsy rate (of CA and minimally invasive autopsy) and the autopsy rate among different ethnicities. Methods We performed a prospective single center before-after study. The intervention was the introduction of m

    LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation

    Full text link
    In this paper, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS-AND, two large, gold-standard datasets for author name disambiguation (AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our LAGOS-AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5M citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure

    Chinese Superstition and Real Estate Prices: Transaction-level Evidence from the US Housing Market

    Get PDF
    We investigate the impact of Chinese superstition on prices paid by Chinese home buyers in Seattle, Washington. Chinese consider 8 lucky and 4 unlucky. Empirical results indicate Chinese buyers pay a 1-2% premium for addresses including an 8 and a 1% discount for addresses including a 4. These results are unrelated to unobserved property quality: no premium exists when Chinese sell to non-Chinese. Absent explicit identfiers for Chinese individuals, we develop a binomial name classifier using methods from the biomedical and document classification literature, allowing for falsification tests using other ethnic groups and mitigating ambiguity attributable to transliteration of Chinese characters into the Latin alphabet

    Peculiarities of gender disambiguation and ordering of non-English authors’ names for Economic papers beyond core databases

    Get PDF
    Purpose: To supplement the quantitative portrait of Ukrainian Economics discipline with the results of gender and author ordering analysis at the level of individual authors, special methods of working with bibliographic data with a predominant share of non-English authors are used. The properties of gender mixing, the likelihood of male and female authors occupying the first position in the authorship list, as well as the arrangements of names are studied. Design/methodology/approach: A data set containing bibliographic records related to Ukrainian journal publications in the field of Economics is constructed using Crossref meta-data. Partial semi-automatic disambiguation of authors' names is performed. First names, along with gender-specific ethnic surnames, are used for gender disambiguation required for further comparative gender analysis. Random reshuffling of data is used to determine the impact of gender correlations. To assess the level of alphabetization for our data set, both Latin and Cyrillic versions of names are taken into account. Findings: The lack of well-structured metadata and the poor use of digital identifiers lead to numerous problems with automatization of bibliographic data pre-processing, especially in the case of publications by non-Western authors. The described stages for working with such specific data help to work at the level of authors and analyse, in particular, gender issues. Despite the larger number of female authors, gender equality is more likely to be reported at the individual level for the discipline of Ukrainian Economics. The tendencies towards collaborative or solo-publications and gender mixing patterns are found to be dependent on the journal: the differences for publications indexed in Scopus and/or Web of Science databases are found. It has also been found that Ukrainian Economics research is characterized by rather a non-alphabetical order of authors. Research limitations: Only partial authors' name disambiguation is performed in a semi-automatic way. Gender labels can be derived only for authors declared by full First names or gender-specific Last names. Practical implications: The typical features of Ukrainian Economic discipline can be used to perform a comparison with other countries and disciplines, to develop an informed-based assessment procedure at the national level. The proposed way of processing publication data can be borrowed to enrich metadata about other research disciplines, especially for non-English speaking countries. Originality/value: To our knowledge, this is the first large-scale quantitative study of Ukrainian Economic discipline. The results obtained are valuable not only at the national level, but also contribute to general knowledge about Economic research, gender issues, and authors' names ordering. An example of the use of Crossref data is provided, while this data source is still less used due to a number of drawbacks. Here, for the first time, attention is drawn to the explicit use of the features of the Slavic authors' names

    Google Summer of Code Gender Diversity: An analysis of the last 4 editions

    Get PDF
    This work presents a comprehensive research about the participationof men and women in the area of Information and CommunicationsTechnology (ICT) through data extracted from the last foureditions of Google Summer of Code (GSoC). The goal of this workis to find Association Rules between gender characteristics andcoding using the Apriori Algorithm. A total of 61 association ruleswere generated through the aforementioned algorithm, being 22 ofthem found only in the data set with the women, 24 found only withthe men, and 15 applicable to both sets. We can cite as one of themain findings of this work the fact that the representativeness ofwomen in GSoC is decreasing in the last few years. Despite this, therepresentativeness of women in GSoC is above average, accordingto what has been reported in other studies in the literature in whichwomen are underrepresented. When it comes to the most utilizedtechnologies, we have “Python", “Java", “C++", “C" and “JavaScript"in the top. Analyzing technologies, it’s possible to realize that themain utilized technologies for men and women are similar, but, ingeneral, men are more likely linked to programming languages.The most common project topics are: “Event Management", “Web",“Web Development", “Data Science" and “Cloud" in the top. Thiscan represent how diverse the project topics of the database are,but not necessarily has something related to gender

    Homeowner Preferences after September 11th, a Microdata Approach

    Get PDF
    The existence of homeowner preferences - specifically homeowner preferences for neighbors -is fundamental to economic models of sorting. This paper investigates whether or not the terrorist attacks of September 11, 2001 (9/11) impacted local preferences for Arab neighbors. We test for changes in preferences using a differences-in-differences approach in a hedonic pricing model. Relative to sales before 9/11, we find properties within 0.1 miles of an Arab homeowner sold at a 1.4% discount in the 180 days after 9/11. The results are robust to a number of specifications including time horizon, event date, distance, time, alternative ethnic groups, and the presence of nearby mosques. Previous research has shown price effects at neighborhood levels but has not identified effects at the micro or individual property level, and for good reason: most transaction level data sets do not include ethnic identifiers. Applying methods from the machine learning and biostatistics literature, we develop a binomial classifier using a supervised learning algorithm and identify Arab homeowners based on the name of the buyer. We train the binomial classifier using names from Summer Olympic Rosters for 221 countries during the years 1948-2012. We demonstrate the flexibility of our methodology and perform an interesting counterfactual by identifying Hispanic and Asian homeowners in the data; unlike the statistically significant results for Arab homeowners, we find no meaningful results for Hispanic and Asian homeowners following 9/11
    corecore