63,013 research outputs found

    When is it Biased? Assessing the Representativeness of Twitter's Streaming API

    Full text link
    Twitter has captured the interest of the scientific community not only for its massive user base and content, but also for its openness in sharing its data. Twitter shares a free 1% sample of its tweets through the "Streaming API", a service that returns a sample of tweets according to a set of parameters set by the researcher. Recently, research has pointed to evidence of bias in the data returned through the Streaming API, raising concern in the integrity of this data service for use in research scenarios. While these results are important, the methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. In this work we tackle the problem of finding sample bias without the need for "gold standard" Firehose data. Namely, we focus on finding time periods in the Streaming API data where the trend of a hashtag is significantly different from its trend in the true activity on Twitter. We propose a solution that focuses on using an open data source to find bias in the Streaming API. Finally, we assess the utility of the data source in sparse data situations and for users issuing the same query from different regions

    Wind and Wave Extremes over the World Oceans from Very Large Ensembles

    Get PDF
    Global return values of marine wind speed and significant wave height are estimated from very large aggregates of archived ensemble forecasts at +240-h lead time. Long lead time ensures that the forecasts represent independent draws from the model climate. Compared with ERA-Interim, a reanalysis, the ensemble yields higher return estimates for both wind speed and significant wave height. Confidence intervals are much tighter due to the large size of the dataset. The period (9 yrs) is short enough to be considered stationary even with climate change. Furthermore, the ensemble is large enough for non-parametric 100-yr return estimates to be made from order statistics. These direct return estimates compare well with extreme value estimates outside areas with tropical cyclones. Like any method employing modeled fields, it is sensitive to tail biases in the numerical model, but we find that the biases are moderate outside areas with tropical cyclones.Comment: 28 pages, 16 figure

    Estimation of species relative abundances and habitat preferences using opportunistic data

    Get PDF
    We develop a new statistical procedure to monitor, with opportunist data, relative species abundances and their respective preferences for dierent habitat types. Following Giraud et al. (2015), we combine the opportunistic data with some standardized data in order to correct the bias inherent to the opportunistic data collection. Our main contributions are (i) to tackle the bias induced by habitat selection behaviors, (ii) to handle data where the habitat type associated to each observation is unknown, (iii) to estimate probabilities of selection of habitat for the species. As an illustration, we estimate common bird species habitat preferences and abundances in the region of Aquitaine (France)

    A global profile of replicative polymerase usage

    Get PDF
    Three eukaryotic DNA polymerases are essential for genome replication. Polymerase (Pol) α–primase initiates each synthesis event and is rapidly replaced by processive DNA polymerases: Polɛ replicates the leading strand, whereas Polδ performs lagging-strand synthesis. However, it is not known whether this division of labor is maintained across the whole genome or how uniform it is within single replicons. Using Schizosaccharomyces pombe, we have developed a polymerase usage sequencing (Pu-seq) strategy to map polymerase usage genome wide. Pu-seq provides direct replication-origin location and efficiency data and indirect estimates of replication timing. We confirm that the division of labor is broadly maintained across an entire genome. However, our data suggest a subtle variability in the usage of the two polymerases within individual replicons. We propose that this results from occasional leading-strand initiation by Polδ followed by exchange for Polɛ

    Discordance Between Mitochondrial and Nuclear Contact Zones Within Antelope Ground Squirrels (Ammospermophilus)

    Get PDF
    A common biogeographic pattern found in many co-distributed species along the Baja California peninsula is the genetic divergence in the Vizcaíno Desert. This separation is hypothesized to have been caused by a mid-peninsular seaway that formed during the late Miocene-middle Pleistocene and later dried, allowing contact again between formerly isolated populations. Previous phylogeographic studies on the antelope ground squirrel (Ammospermophilus leucurus) show a mitochondrial DNA break through the middle of the peninsula. We investigated whether (1) the mitochondrial pattern of divergence and secondary contact between the northern and southern Ammospermophilus clades are consistent with results from genome-wide nuclear data and (2) whether genetic admixture is occurring. One hundred thirty-three samples were collected spanning from the northwest US south into the Baja California peninsula and pooled using ddRADseq protocol. Our nuclear DNA analyses show a 335 km divergence between the two contact zones and low levels of admixture. Several individuals belonging to the southern clade have a northern mitochondrial haplotype, suggesting introgression. This introgression and lack of admixture suggests that there may have been ancestral hybridization between the now reproductively isolated populations.No embargoAcademic Major: Zoolog

    Generating social network data using partially described networks: an example informing avian influenza control in the British poultry industry

    Get PDF
    <p>Background: Targeted sampling can capture the characteristics of more vulnerable sectors of a population, but may bias the picture of population level disease risk. When sampling network data, an incomplete description of the population may arise leading to biased estimates of between-host connectivity. Avian influenza (AI) control planning in Great Britain (GB) provides one example where network data for the poultry industry (the Poultry Network Database or PND), targeted large premises and is consequently demographically biased. Exposing the effect of such biases on the geographical distribution of network properties could help target future poultry network data collection exercises. These data will be important for informing the control of potential future disease outbreaks.</p> <p>Results: The PND was used to compute between-farm association frequencies, assuming that farms sharing the same slaughterhouse or catching company, or through integration, are potentially epidemiologically linked. The fitted statistical models were extrapolated to the Great Britain Poultry Register (GBPR); this dataset is more representative of the poultry industry but lacks network information. This comparison showed how systematic biases in the demographic characterisation of a network, resulting from targeted sampling procedures, can bias the derived picture of between-host connectivity within the network.</p> <p>Conclusions: With particular reference to the predictive modeling of AI in GB, we find significantly different connectivity patterns across GB when network estimates incorporate the more demographically representative information provided by the GBPR; this has not been accounted for by previous epidemiological analyses. We recommend ranking geographical regions, based on relative confidence in extrapolated estimates, for prioritising further data collection. Evaluating whether and how the between-farm association frequencies impact on the risk of between-farm transmission will be the focus of future work.</p&gt

    Nonparametric Transient Classification using Adaptive Wavelets

    Full text link
    Classifying transients based on multi band light curves is a challenging but crucial problem in the era of GAIA and LSST since the sheer volume of transients will make spectroscopic classification unfeasible. Here we present a nonparametric classifier that uses the transient's light curve measurements to predict its class given training data. It implements two novel components: the first is the use of the BAGIDIS wavelet methodology - a characterization of functional data using hierarchical wavelet coefficients. The second novelty is the introduction of a ranked probability classifier on the wavelet coefficients that handles both the heteroscedasticity of the data in addition to the potential non-representativity of the training set. The ranked classifier is simple and quick to implement while a major advantage of the BAGIDIS wavelets is that they are translation invariant, hence they do not need the light curves to be aligned to extract features. Further, BAGIDIS is nonparametric so it can be used for blind searches for new objects. We demonstrate the effectiveness of our ranked wavelet classifier against the well-tested Supernova Photometric Classification Challenge dataset in which the challenge is to correctly classify light curves as Type Ia or non-Ia supernovae. We train our ranked probability classifier on the spectroscopically-confirmed subsample (which is not representative) and show that it gives good results for all supernova with observed light curve timespans greater than 100 days (roughly 55% of the dataset). For such data, we obtain a Ia efficiency of 80.5% and a purity of 82.4% yielding a highly competitive score of 0.49 whilst implementing a truly "model-blind" approach to supernova classification. Consequently this approach may be particularly suitable for the classification of astronomical transients in the era of large synoptic sky surveys.Comment: 14 pages, 8 figures. Published in MNRA
    corecore