74 research outputs found
Automatically assembling a full census of an academic field
The composition of the scientific workforce shapes the direction of
scientific research, directly through the selection of questions to
investigate, and indirectly through its influence on the training of future
scientists. In most fields, however, complete census information is difficult
to obtain, complicating efforts to study workforce dynamics and the effects of
policy. This is particularly true in computer science, which lacks a single,
all-encompassing directory or professional organization. A full census of
computer science would serve many purposes, not the least of which is a better
understanding of the trends and causes of unequal representation in computing.
Previous academic census efforts have relied on narrow or biased samples, or on
professional society membership rolls. A full census can be constructed
directly from online departmental faculty directories, but doing so by hand is
prohibitively expensive and time-consuming. Here, we introduce a topical web
crawler for automating the collection of faculty information from web-based
department rosters, and demonstrate the resulting system on the 205
PhD-granting computer science departments in the U.S. and Canada. This method
constructs a complete census of the field within a few minutes, and achieves
over 99% precision and recall. We conclude by comparing the resulting 2017
census to a hand-curated 2011 census to quantify turnover and retention in
computer science, in general and for female faculty in particular,
demonstrating the types of analysis made possible by automated census
construction.Comment: 11 pages, 6 figures, 2 table
Environmental Changes and the Dynamics of Musical Identity
Musical tastes reflect our unique values and experiences, our relationships
with others, and the places where we live. But as each of these things changes,
do our tastes also change to reflect the present, or remain fixed, reflecting
our past? Here, we investigate how where a person lives shapes their musical
preferences, using geographic relocation to construct quasi-natural experiments
that measure short- and long-term effects. Analyzing comprehensive data on over
16 million users on Spotify, we show that relocation within the United States
has only a small impact on individuals' tastes, which remain more similar to
those of their past environments. We then show that the age gap between a
person and the music they consume indicates that adolescence, and likely their
environment during these years, shapes their lifelong musical tastes. Our
results demonstrate the robustness of individuals' musical identity, and shed
new light on the development of preferences.Comment: Accepted to be published at ICWSM'1
Prestige drives epistemic inequality in the diffusion of scientific ideas
The spread of ideas in the scientific community is often viewed as a
competition, in which good ideas spread further because of greater intrinsic
fitness, and publication venue and citation counts correlate with importance
and impact. However, relatively little is known about how structural factors
influence the spread of ideas, and specifically how where an idea originates
might influence how it spreads. Here, we investigate the role of faculty hiring
networks, which embody the set of researcher transitions from doctoral to
faculty institutions, in shaping the spread of ideas in computer science, and
the importance of where in the network an idea originates. We consider
comprehensive data on the hiring events of 5032 faculty at all 205
Ph.D.-granting departments of computer science in the U.S. and Canada, and on
the timing and titles of 200,476 associated publications. Analyzing five
popular research topics, we show empirically that faculty hiring can and does
facilitate the spread of ideas in science. Having established such a mechanism,
we then analyze its potential consequences using epidemic models to simulate
the generic spread of research ideas and quantify the impact of where an idea
originates on its longterm diffusion across the network. We find that research
from prestigious institutions spreads more quickly and completely than work of
similar quality originating from less prestigious institutions. Our analyses
establish the theoretical trade-offs between university prestige and the
quality of ideas necessary for efficient circulation. Our results establish
faculty hiring as an underlying mechanism that drives the persistent epistemic
advantage observed for elite institutions, and provide a theoretical lower
bound for the impact of structural inequality in shaping the spread of ideas in
science.Comment: 10 pages, 8 figures, 1 tabl
Scientific productivity as a random walk
The expectation that scientific productivity follows regular patterns over a
career underpins many scholarly evaluations, including hiring, promotion and
tenure, awards, and grant funding. However, recent studies of individual
productivity patterns reveal a puzzle: on the one hand, the average number of
papers published per year robustly follows the "canonical trajectory" of a
rapid rise to an early peak followed by a graduate decline, but on the other
hand, only about 20% of individual researchers' productivity follows this
pattern. We resolve this puzzle by modeling scientific productivity as a
parameterized random walk, showing that the canonical pattern can be explained
as a decrease in the variance in changes to productivity in the early-to-mid
career. By empirically characterizing the variable structure of 2,085
productivity trajectories of computer science faculty at 205 PhD-granting
institutions, spanning 29,119 publications over 1980--2016, we (i) discover
remarkably simple patterns in both early-career and year-to-year changes to
productivity, and (ii) show that a random walk model of productivity both
reproduces the canonical trajectory in the average productivity and captures
much of the diversity of individual-level trajectories. These results highlight
the fundamental role of a panoply of contingent factors in shaping individual
scientific productivity, opening up new avenues for characterizing how systemic
incentives and opportunities can be directed for aggregate effect
The misleading narrative of the canonical faculty productivity trajectory.
A scientist may publish tens or hundreds of papers over a career, but these contributions are not evenly spaced in time. Sixty years of studies on career productivity patterns in a variety of fields suggest an intuitive and universal pattern: Productivity tends to rise rapidly to an early peak and then gradually declines. Here, we test the universality of this conventional narrative by analyzing the structures of individual faculty productivity time series, constructed from over 200,000 publications and matched with hiring data for 2,453 tenure-track faculty in all 205 PhD-granting computer science departments in the United States and Canada. Unlike prior studies, which considered only some faculty or some institutions, or lacked common career reference points, here we combine a large bibliographic dataset with comprehensive information on career transitions that covers an entire field of study. We show that the conventional narrative confidently describes only one-fifth of faculty, regardless of department prestige or researcher gender, and the remaining four-fifths of faculty exhibit a rich diversity of productivity patterns. To explain this diversity, we introduce a simple model of productivity trajectories and explore correlations between its parameters and researcher covariates, showing that departmental prestige predicts overall individual productivity and the timing of the transition from first- to last-author publications. These results demonstrate the unpredictability of productivity over time and open the door for new efforts to understand how environmental and individual factors shape scientific productivity
Recommended from our members
The unequal impact of parenthood in academia
Across academia, men and women tend to publish at unequal rates. Existing explanations include the potentially unequal impact of parenthood on scholarship, but a lack of appropriate data has prevented its clear assessment. Here, we quantify the impact of parenthood on scholarship using an extensive survey of the timing of parenthood events, longitudinal publication data, and perceptions of research expectations among 3064 tenure-track faculty at 450 Ph.D.-granting computer science, history, and business departments across the United States and Canada, along with data on institution-specific parental leave policies. Parenthood explains most of the gender productivity gap by lowering the average short-term productivity of mothers, even as parents tend to be slightly more productive on average than nonparents. However, the size of productivity penalty for mothers appears to have shrunk over time. Women report that paid parental leave and adequate childcare are important factors in their recruitment and retention. These results have broad implications for efforts to improve the inclusiveness of scholarship.
</p
A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences
Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.
Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.
Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences
RAIphy: Phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles
Background: Computational analysis of metagenomes requires the taxonomical assignment of the genome contigs assembled from DNA reads of environmental samples. Because of the diverse nature of microbiomes, the length of the assemblies obtained can vary between a few hundred bp to a few hundred Kbp. Current taxonomic classification algorithms provide accurate classification for long contigs or for short fragments from organisms that have close relatives with annotated genomes. These are significant limitations for metagenome analysis because of the complexity of microbiomes and the paucity of existing annotated genomes.
Results: We propose a robust taxonomic classification method, RAIphy, that uses a novel sequence similarity metric with iterative refinement of taxonomic models and functions effectively without these limitations. We have tested RAIphy with synthetic metagenomics data ranging between 100 bp to 50 Kbp. Within a sequence read range of 100 bp-1000 bp, the sensitivity of RAIphy ranges between 38%-81% outperforming the currently popular composition-based methods for reads in this range. Comparison with computationally more intensive sequence similarity methods shows that RAIphy performs competitively while being significantly faster. The sensitivityspecificity characteristics for relatively longer contigs were compared with the PhyloPythia and TACOA algorithms. RAIphy performs better than these algorithms at varying clade-levels. For an acid mine drainage (AMD) metagenome, RAIphy was able to taxonomically bin the sequence read set more accurately than the currently available methods, Phymm and MEGAN, and more accurately in two out of three tests than the much more computationally intensive method, PhymmBL.
Conclusions: With the introduction of the relative abundance index metric and an iterative classification method, we propose a taxonomic classification algorithm that performs competitively for a large range of DNA contig lengths assembled from metagenome data. Because of its speed, simplicity, and accuracy RAIphy can be successfully used in the binning process for a broad range of metagenomic data obtained from environmental samples
- …