408 research outputs found

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    Transcript quantification with RNA-Seq data

    Get PDF
    Motivation Novel high-throughput sequencing technologies open exciting new approaches to transcriptome profiling. Sequencing transcript populations of interest, e.g. from different tissues or variable stress conditions, with RNA sequencing (RNA-Seq) [1] generates millions of short reads. Accurately aligned to a reference genome, they provide digital counts and thus facilitate transcript quantification. As the observed read counts only provide the summation of all expressed sequences at one locus, the inference of the underlying transcript abundances is crucial for further quantitative analyses. Methods To approach this problem, we have developed a new technique, called rQuant, based on quadratic programming. Given a gene annotation and position-wise exon/intron read coverage from read alignments, we determine the abundances for each annotated transcript by minimising a suitable loss function. It penalises the deviation of the observed from the expected read coverage given the transcript weights. The observed read coverage is typically non-uniformly distributed over the transcript due to several biases in the generation of the sequencing libraries and the sequencing. This leads to distortions of the transcript abundances, if not corrected properly. We therefore extended our approach to jointly optimise transcript profiles, modeling the coverage deviations depending on the position in the transcript. Our method can be applied without knowledge of the underlying transcript abundances and equally benefits from loci with and without alternative transcripts. Results To quantitatively evaluate the quality of our abundance predictions, we used a set of simulated reads from transcripts with known expression as a benchmark set. It was generated using the Flux Simulator [2] modeling biases in RNA-Seq as well as preparation experiments. Table 1 shows preliminary results with segment- and position-based loss as well as with and without the transcript profiles. Our results indicate that the position-based modeling together with transcript profiles allows us to accurately infer the underlying expression of single transcripts as well as of multiple isoforms of one gene locus

    Reproducing Kernels of Generalized Sobolev Spaces via a Green Function Approach with Distributional Operators

    Full text link
    In this paper we introduce a generalized Sobolev space by defining a semi-inner product formulated in terms of a vector distributional operator P\mathbf{P} consisting of finitely or countably many distributional operators PnP_n, which are defined on the dual space of the Schwartz space. The types of operators we consider include not only differential operators, but also more general distributional operators such as pseudo-differential operators. We deduce that a certain appropriate full-space Green function GG with respect to L:=PTPL:=\mathbf{P}^{\ast T}\mathbf{P} now becomes a conditionally positive definite function. In order to support this claim we ensure that the distributional adjoint operator P\mathbf{P}^{\ast} of P\mathbf{P} is well-defined in the distributional sense. Under sufficient conditions, the native space (reproducing-kernel Hilbert space) associated with the Green function GG can be isometrically embedded into or even be isometrically equivalent to a generalized Sobolev space. As an application, we take linear combinations of translates of the Green function with possibly added polynomial terms and construct a multivariate minimum-norm interpolant sf,Xs_{f,X} to data values sampled from an unknown generalized Sobolev function ff at data sites located in some set XRdX \subset \mathbb{R}^d. We provide several examples, such as Mat\'ern kernels or Gaussian kernels, that illustrate how many reproducing-kernel Hilbert spaces of well-known reproducing kernels are isometrically equivalent to a generalized Sobolev space. These examples further illustrate how we can rescale the Sobolev spaces by the vector distributional operator P\mathbf{P}. Introducing the notion of scale as part of the definition of a generalized Sobolev space may help us to choose the "best" kernel function for kernel-based approximation methods.Comment: Update version of the publish at Num. Math. closed to Qi Ye's Ph.D. thesis (\url{http://mypages.iit.edu/~qye3/PhdThesis-2012-AMS-QiYe-IIT.pdf}

    Androgen receptor abnormalities

    Get PDF
    The human androgen receptor is a member of the superfamily of steroid hormone receptors. Proper functioning of this protein is a prerequisite for normal male sexual differentiation and development. The cloning of the human androgen receptor cDNA and the elucidation of the genomic organization of the corresponding gene has enabled us to study androgen receptors in subjects with the clinical manifestation of androgen insensitivity and in a human prostate carcinoma cell line (LNCaP). Using PCR amplification, subcloning and sequencing of exons 2–8, we identified a G → T mutation in the androgen receptor gene of a subject with the complete form of androgen insensitivity, which inactivates the splice donor site at the exon 4/intron 4 boundary. This mutation causes the inactivation of a cryptic splice donor site in exon 4, which results in the deletion of 41 amino acids from the steroid binding domain. In two other independently arising cases we identified two different nucleotide alterations in codon 686 (GAC; aspartic acid) located in exon 4. One mutation (G → C) results in an aspartic acid → histidine substitution (with negligible androgen binding), whereas the other mutation (G → A) leads to an aspartic acid → asparagine substitution (normal androgen binding, but a rapidly dissociating androgen receptor complex). Sequence analysis of the androgen receptor in human LNCaP-cells (lymph node carcinoma of the prostate) revealed a point mutation (A → G) in codon 868 in exon 8 resulting in the substitution of threonine by alanine. This mutation is the cause of the altered steroid binding specificity of the LNCaP-cell androgen receptor. The functional consequences of the observed mutations with respect to protein expression, specific ligand binding and transcriptional activation, were established after transient expression of the mutant receptors in COS and HeLa cells. These findings illustrate that functional error

    Variation of health-related quality of life assessed by caregivers and patients affected by severe childhood infections.

    Get PDF
    BACKGROUND: The agreement between self-reported and proxy measures of health status in ill children is not well established. This study aimed to quantify the variation in health-related quality of life (HRQOL) derived from young patients and their carers using different instruments. METHODS: A hospital-based cross-sectional survey was conducted between August 2010 and March 2011. Children with meningitis, bacteremia, pneumonia, acute otitis media, hearing loss, chronic lung disease, epilepsy, mild mental retardation, severe mental retardation, and mental retardation combined with epilepsy, aged between five to 14 years in seven tertiary hospitals were selected for participation in this study. The Health Utilities Index Mark 2 (HUI2), and Mark 3 (HUI3), and the EuroQoL Descriptive System (EQ-5D) and Visual Analogue Scale (EQ-VAS) were applied to both paediatric patients (self-assessment) and caregivers (proxy-assessment). RESULTS: The EQ-5D scores were lowest for acute conditions such as meningitis, bacteremia, and pneumonia, whereas the HUI3 scores were lowest for most chronic conditions such as hearing loss and severe mental retardation. Comparing patient and proxy scores (n = 74), the EQ-5D exhibited high correlation (r = 0.77) while in the HUI2 and HUI3 patient and caregiver scores were moderately correlated (r = 0.58 and 0.67 respectively). The mean difference between self and proxy-assessment using the HUI2, HUI3, EQ-5D and EQ-VAS scores were 0.03, 0.05, -0.03 and -0.02, respectively. In hearing-impaired and chronic lung patients the self-rated HRQOL differed significantly from their caregivers. CONCLUSIONS: The use of caregivers as proxies for measuring HRQOL in young patients affected by pneumococcal infection and its sequelae should be employed with caution. Given the high correlation between instruments, each of the HRQOL instruments appears acceptable apart from the EQ-VAS which exhibited low correlation with the others

    DGW: an exploratory data analysis tool for clustering and visualisation of epigenomic marks

    Get PDF
    Background Functional genomic and epigenomic research relies fundamentally on sequencing based methods like ChIP-seq for the detection of DNA-protein interactions. These techniques return large, high dimensional data sets with visually complex structures, such as multi-modal peaks extended over large genomic regions. Current tools for visualisation and data exploration represent and leverage these complex features only to a limited extent. Results We present DGW, an open source software package for simultaneous alignment and clustering of multiple epigenomic marks. DGW uses Dynamic Time Warping to adaptively rescale and align genomic distances which allows to group regions of interest with similar shapes, thereby capturing the structure of epigenomic marks. We demonstrate the effectiveness of the approach in a simulation study and on a real epigenomic data set from the ENCODE project. Conclusions Our results show that DGW automatically recognises and aligns important genomic features such as transcription start sites and splicing sites from histone marks. DGW is available as an open source Python package

    Adiposity has differing associations with incident coronary heart disease and mortality in the Scottish population: cross-sectional surveys with follow-up

    Get PDF
    Objective: Investigation of the association of excess adiposity with three different outcomes: all-cause mortality, coronary heart disease (CHD) mortality and incident CHD. Design: Cross-sectional surveys linked to hospital admissions and death records. Subjects: 19 329 adults (aged 18–86 years) from a representative sample of the Scottish population. Measurements: Gender-stratified Cox proportional hazards models were used to estimate hazard ratios (HRs) for all-cause mortality, CHD mortality and incident CHD. Separate models incorporating the anthropometric measurements body mass index (BMI), waist circumference (WC) or waist–hip ratio (WHR) were created adjusted for age, year of survey, smoking status and alcohol consumption. Results: For both genders, BMI-defined obesity (greater than or equal to30 kg m−2) was not associated with either an increased risk of all-cause mortality or CHD mortality. However, there was an increased risk of incident CHD among the obese men (hazard ratio (HR)=1.78; 95% confidence interval=1.37–2.31) and obese women (HR=1.93; 95% confidence interval=1.44–2.59). There was a similar pattern for WC with regard to the three outcomes; for incident CHD, the HR=1.70 (1.35–2.14) for men and 1.71 (1.28–2.29) for women in the highest WC category (men greater than or equal to102 cm, women greater than or equal to88 cm), synonymous with abdominal obesity. For men, the highest category of WHR (greater than or equal to1.0) was associated with an increased risk of all-cause mortality (1.29; 1.04–1.60) and incident CHD (1.55; 1.19–2.01). Among women with a high WHR (greater than or equal to0.85) there was an increased risk of all outcomes: all-cause mortality (1.56; 1.26–1.94), CHD mortality (2.49; 1.36–4.56) and incident CHD (1.76; 1.31–2.38). Conclusions: In this study excess adiposity was associated with an increased risk of incident CHD but not necessarily death. One possibility is that modern medical intervention has contributed to improved survival of first CHD events. The future health burden of increased obesity levels may manifest as an increase in the prevalence of individuals living with CHD and its consequences

    Exploiting physico-chemical properties in string kernels

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.</p> <p>Results</p> <p>We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.</p> <p>Conclusions</p> <p>In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.</p> <p>Availability</p> <p>Data sets, code and additional information are available from <url>http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask</url>. Implementations of the developed kernels are available as part of the Shogun toolbox.</p

    Inferring latent task structure for Multitask Learning by Multiple Kernel Learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics, many problems can be cast into the Multitask Learning scenario by incorporating data from several organisms. However, combining information from several tasks requires careful consideration of the degree of similarity between tasks. Our proposed method simultaneously learns or refines the similarity between tasks along with the Multitask Learning classifier. This is done by formulating the Multitask Learning problem as Multiple Kernel Learning, using the recently published <it>q</it>-Norm MKL algorithm.</p> <p>Results</p> <p>We demonstrate the performance of our method on two problems from Computational Biology. First, we show that our method is able to improve performance on a splice site dataset with given hierarchical task structure by refining the task relationships. Second, we consider an MHC-I dataset, for which we assume no knowledge about the degree of task relatedness. Here, we are able to learn the task similarities<it> ab initio</it> along with the Multitask classifiers. In both cases, we outperform baseline methods that we compare against.</p> <p>Conclusions</p> <p>We present a novel approach to Multitask Learning that is capable of learning task similarity along with the classifiers. The framework is very general as it allows to incorporate prior knowledge about tasks relationships if available, but is also able to identify task similarities in absence of such prior information. Both variants show promising results in applications from Computational Biology.</p
    corecore