29 research outputs found

    Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services

    Get PDF
    Objective Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies. Methods We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset. Results Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics. Conclusions We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost

    On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models

    Get PDF
    A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character

    Comparative analysis of long DNA sequences by per element information content using different contexts

    Get PDF
    BACKGROUND: Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features exposed by such models. RESULTS: We apply such a model to find features across chromosomes of Cyanidioschyzon merolae. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts. We also show how to highlight all sets of self-repetition features, in this case within Plasmodium falciparum chromosome 2. CONCLUSION: The methodology finds features that are significant and that biologists confirm. The exploration of long information sequences in linear time and space is fast and the saved results are self documenting.

    Serum antibodies against genitourinary infectious agents in prostate cancer and benign prostate hyperplasia patients: a case-control study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Infection plays a role in the pathogenesis of many human malignancies. Whether prostate cancer (PCa) - an important health issue in the aging male population in the Western world - belongs to these conditions has been a matter of research since the 1970 s. Persistent serum antibodies are a proof of present or past infection. The aim of this study was to compare serum antibodies against genitourinary infectious agents between PCa patients and controls with benign prostate hyperplasia (BPH). We hypothesized that elevated serum antibody levels or higher seroprevalence in PCa patients would suggest an association of genitourinary infection in patient history and elevated PCa risk.</p> <p>Methods</p> <p>A total of 434 males who had undergone open prostate surgery in a single institution were included in the study: 329 PCa patients and 105 controls with BPH. The subjects' serum samples were analysed by means of enzyme-linked immunosorbent assay, complement fixation test and indirect immunofluorescence for the presence of antibodies against common genitourinary infectious agents: human papillomavirus (HPV) 6, 11, 16, 18, 31 and 33, herpes simplex virus (HSV) 1 and 2, human cytomegalovirus (CMV), Chlamydia trachomatis, Mycoplasma hominis, Ureaplasma urealyticum, Neisseria gonorrhoeae and Treponema pallidum. Antibody seroprevalence and mean serum antibody levels were compared between cases and controls. Tumour grade and stage were correlated with serological findings.</p> <p>Results</p> <p>PCa patients were more likely to harbour antibodies against Ureaplasma urealyticum (odds ratio (OR) 2.06; 95% confidence interval (CI) 1.08-4.28). Men with BPH were more often seropositive for HPV 18 and Chlamydia trachomatis (OR 0.23; 95% CI 0.09-0.61 and OR 0.45; 95% CI 0.21-0.99, respectively) and had higher mean serum CMV antibody levels than PCa patients (p = 0.0004). Among PCa patients, antibodies against HPV 6 were associated with a higher Gleason score (p = 0.0305).</p> <p>Conclusions</p> <p>Antibody seropositivity against the analyzed pathogens with the exception of Ureaplasma does not seem to be a risk factor for PCa pathogenesis. The presence or higher levels of serum antibodies against the genitourinary pathogens studied were not consistently associated with PCa. Serostatus was not a predictor of disease stage in the studied population.</p

    Contributions to lossless data compression

    No full text

    Effects of clinical mastitis on ovarian function in postpartum dairy cows

    No full text
    Mastitis-induced ovarian abnormalities were studied in a field trial. At 1-3 day after calving, ≥2 parity cows not affected with chronic recurrent mastitis and yielding <400 000/ml somatic cell count (SCC) individual milk in the previous lactation, were enrolled in the study. Thereafter milk samples were collected three times weekly for 95-100 day for progesterone (P 4) assay. Individual P4 profiles were used to monitor ovarian cyclicity. When mastitis was diagnosed in the first 80 day post-partum (pp), clinical signs were recorded and scored, and aseptic milk samples were taken to identify the mastitis pathogens. Depending on the isolated pathogens the cows were blocked into one of the three sub-groups affected by either Gram-positive (GP), or Gram-negative (GN) bacteria, or of those with no detected pathogens (NDP). Cows suffering from any type of mastitis between days 15 and 28 (n = 27) showed a delay in the onset of ovarian cyclicity, and estrus was postponed compared to cows affected during the first 14 day pp (n = 59) and controls (n = 175) (38.6 ± 2.3 vs 33.4 ± 2.1 and 32.0 ± 1.0 day, respectively, for onset of ovarian cyclicity and 90.7 ± 2.5 vs 80.2 ± 2.8 and 83.9 ± 2.1 day, respectively, for estrus; both p < 0.05). The percentage of cows ovulating by day 28 was lower in those affected by mastitis between days 14 and 28 compared to cows between days 1 and 14 and controls (22.2% vs 47.5 and 50.3%, respectively; p < 0.05). A significantly higher rate of premature luteolysis was observed in GN + NDP compared to GP mastitis and healthy cows (46.7% vs 8.3 and 2.0%, respectively; p < 0.001). If the mastitis outbreak occurred during the follicular phase, the duration of this cycle segment was lengthened in GN + NDP mastitis compared to GP mastitis and healthy cows (10.8 ± 0.9 vs 7.9 ± 0.1 and 7.2 ± 0.1, respectively; p < 0.001). The results indicate that mastitis can affect the resumption of ovarian activity in pp dairy cows. Mastitis may also impair reproduction also in cyclic cows: this effect can be the consequence of premature luteolysis or a prolonged follicular phas

    Relationship between thyroid function and seasonal reproductive activity in mares

    No full text
    International audienc
    corecore