63,013 research outputs found
When is it Biased? Assessing the Representativeness of Twitter's Streaming API
Twitter has captured the interest of the scientific community not only for
its massive user base and content, but also for its openness in sharing its
data. Twitter shares a free 1% sample of its tweets through the "Streaming
API", a service that returns a sample of tweets according to a set of
parameters set by the researcher. Recently, research has pointed to evidence of
bias in the data returned through the Streaming API, raising concern in the
integrity of this data service for use in research scenarios. While these
results are important, the methodologies proposed in previous work rely on the
restrictive and expensive Firehose to find the bias in the Streaming API data.
In this work we tackle the problem of finding sample bias without the need for
"gold standard" Firehose data. Namely, we focus on finding time periods in the
Streaming API data where the trend of a hashtag is significantly different from
its trend in the true activity on Twitter. We propose a solution that focuses
on using an open data source to find bias in the Streaming API. Finally, we
assess the utility of the data source in sparse data situations and for users
issuing the same query from different regions
Wind and Wave Extremes over the World Oceans from Very Large Ensembles
Global return values of marine wind speed and significant wave height are
estimated from very large aggregates of archived ensemble forecasts at +240-h
lead time. Long lead time ensures that the forecasts represent independent
draws from the model climate. Compared with ERA-Interim, a reanalysis, the
ensemble yields higher return estimates for both wind speed and significant
wave height. Confidence intervals are much tighter due to the large size of the
dataset. The period (9 yrs) is short enough to be considered stationary even
with climate change. Furthermore, the ensemble is large enough for
non-parametric 100-yr return estimates to be made from order statistics. These
direct return estimates compare well with extreme value estimates outside areas
with tropical cyclones. Like any method employing modeled fields, it is
sensitive to tail biases in the numerical model, but we find that the biases
are moderate outside areas with tropical cyclones.Comment: 28 pages, 16 figure
Estimation of species relative abundances and habitat preferences using opportunistic data
We develop a new statistical procedure to monitor, with opportunist data,
relative species abundances and their respective preferences for dierent
habitat types. Following Giraud et al. (2015), we combine the opportunistic
data with some standardized data in order to correct the bias inherent to the
opportunistic data collection. Our main contributions are (i) to tackle the
bias induced by habitat selection behaviors, (ii) to handle data where the
habitat type associated to each observation is unknown, (iii) to estimate
probabilities of selection of habitat for the species. As an illustration, we
estimate common bird species habitat preferences and abundances in the region
of Aquitaine (France)
A global profile of replicative polymerase usage
Three eukaryotic DNA polymerases are essential for genome replication. Polymerase (Pol) α–primase initiates each synthesis event and is rapidly replaced by processive DNA polymerases: Polɛ replicates the leading strand, whereas Polδ performs lagging-strand synthesis. However, it is not known whether this division of labor is maintained across the whole genome or how uniform it is within single replicons. Using Schizosaccharomyces pombe, we have developed a polymerase usage sequencing (Pu-seq) strategy to map polymerase usage genome wide. Pu-seq provides direct replication-origin location and efficiency data and indirect estimates of replication timing. We confirm that the division of labor is broadly maintained across an entire genome. However, our data suggest a subtle variability in the usage of the two polymerases within individual replicons. We propose that this results from occasional leading-strand initiation by Polδ followed by exchange for Polɛ
Discordance Between Mitochondrial and Nuclear Contact Zones Within Antelope Ground Squirrels (Ammospermophilus)
A common biogeographic pattern found in many co-distributed species along the Baja California peninsula is the genetic divergence in the Vizcaíno Desert. This separation is hypothesized to have been caused by a mid-peninsular seaway that formed during the late Miocene-middle Pleistocene and later dried, allowing contact again between formerly isolated populations. Previous phylogeographic studies on the antelope ground squirrel (Ammospermophilus leucurus) show a mitochondrial DNA break through the middle of the peninsula. We investigated whether (1) the mitochondrial pattern of divergence and secondary contact between the northern and southern Ammospermophilus clades are consistent with results from genome-wide nuclear data and (2) whether genetic admixture is occurring. One hundred thirty-three samples were collected spanning from the northwest US south into the Baja California peninsula and pooled using ddRADseq protocol. Our nuclear DNA analyses show a 335 km divergence between the two contact zones and low levels of admixture. Several individuals belonging to the southern clade have a northern mitochondrial haplotype, suggesting introgression. This introgression and lack of admixture suggests that there may have been ancestral hybridization between the now reproductively isolated populations.No embargoAcademic Major: Zoolog
Generating social network data using partially described networks: an example informing avian influenza control in the British poultry industry
<p>Background: Targeted sampling can capture the characteristics of more vulnerable sectors of a population, but may bias the picture of population level disease risk. When sampling network data, an incomplete description of the population may arise leading to biased estimates of between-host connectivity. Avian influenza (AI) control planning in Great Britain (GB) provides one example where network data for the poultry industry (the Poultry Network Database or PND), targeted large premises and is consequently demographically biased. Exposing the effect of such biases on the geographical distribution of network properties could help target future poultry network data collection exercises. These data will be important for informing the control of potential future disease outbreaks.</p>
<p>Results: The PND was used to compute between-farm association frequencies, assuming that farms sharing the same slaughterhouse or catching company, or through integration, are potentially epidemiologically linked. The fitted statistical models were extrapolated to the Great Britain Poultry Register (GBPR); this dataset is more representative of the poultry industry but lacks network information. This comparison showed how systematic biases in the demographic characterisation of a network, resulting from targeted sampling procedures, can bias the derived picture of between-host connectivity within the network.</p>
<p>Conclusions: With particular reference to the predictive modeling of AI in GB, we find significantly different connectivity patterns across GB when network estimates incorporate the more demographically representative information provided by the GBPR; this has not been accounted for by previous epidemiological analyses. We recommend ranking geographical regions, based on relative confidence in extrapolated estimates, for prioritising further data collection. Evaluating whether and how the between-farm association frequencies impact on the risk of between-farm transmission will be the focus of future work.</p>
Nonparametric Transient Classification using Adaptive Wavelets
Classifying transients based on multi band light curves is a challenging but
crucial problem in the era of GAIA and LSST since the sheer volume of
transients will make spectroscopic classification unfeasible. Here we present a
nonparametric classifier that uses the transient's light curve measurements to
predict its class given training data. It implements two novel components: the
first is the use of the BAGIDIS wavelet methodology - a characterization of
functional data using hierarchical wavelet coefficients. The second novelty is
the introduction of a ranked probability classifier on the wavelet coefficients
that handles both the heteroscedasticity of the data in addition to the
potential non-representativity of the training set. The ranked classifier is
simple and quick to implement while a major advantage of the BAGIDIS wavelets
is that they are translation invariant, hence they do not need the light curves
to be aligned to extract features. Further, BAGIDIS is nonparametric so it can
be used for blind searches for new objects. We demonstrate the effectiveness of
our ranked wavelet classifier against the well-tested Supernova Photometric
Classification Challenge dataset in which the challenge is to correctly
classify light curves as Type Ia or non-Ia supernovae. We train our ranked
probability classifier on the spectroscopically-confirmed subsample (which is
not representative) and show that it gives good results for all supernova with
observed light curve timespans greater than 100 days (roughly 55% of the
dataset). For such data, we obtain a Ia efficiency of 80.5% and a purity of
82.4% yielding a highly competitive score of 0.49 whilst implementing a truly
"model-blind" approach to supernova classification. Consequently this approach
may be particularly suitable for the classification of astronomical transients
in the era of large synoptic sky surveys.Comment: 14 pages, 8 figures. Published in MNRA
- …