1,140 research outputs found

    Kernels for Protein Homology Detection

    Get PDF
    Determining protein sequence similarity is an important task for protein classification and homology detection, which is typically performed using sequence alignment algorithms. Fast and accurate alignment-free kernel based classifiers exist, that treat protein sequences as a “bag of words”. Kernels implicitly map the sequences to a high dimensional feature space, and can be thought of as an inner product between two vectors in that space. This allows an algorithm that can be expressed purely in terms of inner products to be ‘kernelised’, where the algorithm implicitly operates in the kernel’s feature space. A weighted string kernel, where the weighting is derived using probabilistic methods, is implemented using a binary data representation, and the results reported. Alternative forms of data representation, such as Ising and frequency forms, are implemented and the results discussed. These results are then used to inform the development of a variety of novel kernels for protein sequence comparison. Alternative forms of classifier are investigated, such as nearest neighbour, support vector machines, and multiple kernel learning. A kernelized Gaussian classifier is derived and tested, which is informative as it returns a score related to the probability of a sequence belonging to a particular classification. Support vector machines are tested with the introduced kernels, and the results compared to alternate classifiers. As similarity can be thought of as having different components, such as composition and position, multiple kernel learning is investigated with the novel kernels developed here. The results show that a support vector machine, using either single or multiple kernels, is the best classifier for remote protein homology detection out of all the classifiers tested in this thesis.EPSR

    Requirements analysis document

    Get PDF
    This document details the purpose, features, and expected interfaces for the complete EXCELERATE WP9 (Use Case D: ELIXIR framework for secure archiving, dissemination and analysis of human access-controlled data). It outlines the tasks the system will perform, the constraints under which it operates, and how it reacts in certain circumstances. This document is intended for stakeholders, designers and developers as well as users of the system, and derives from a joint analysis carried out with these groups

    Measurements of the pp → ZZ production cross section and the Z → 4ℓ branching fraction, and constraints on anomalous triple gauge couplings at √s = 13 TeV

    Get PDF
    Four-lepton production in proton-proton collisions, pp -> (Z/gamma*)(Z/gamma*) -> 4l, where l = e or mu, is studied at a center-of-mass energy of 13 TeV with the CMS detector at the LHC. The data sample corresponds to an integrated luminosity of 35.9 fb(-1). The ZZ production cross section, sigma(pp -> ZZ) = 17.2 +/- 0.5 (stat) +/- 0.7 (syst) +/- 0.4 (theo) +/- 0.4 (lumi) pb, measured using events with two opposite-sign, same-flavor lepton pairs produced in the mass region 60 4l) = 4.83(-0.22)(+0.23) (stat)(-0.29)(+0.32) (syst) +/- 0.08 (theo) +/- 0.12(lumi) x 10(-6) for events with a four-lepton invariant mass in the range 80 4GeV for all opposite-sign, same-flavor lepton pairs. The results agree with standard model predictions. The invariant mass distribution of the four-lepton system is used to set limits on anomalous ZZZ and ZZ. couplings at 95% confidence level: -0.0012 < f(4)(Z) < 0.0010, -0.0010 < f(5)(Z) < 0.0013, -0.0012 < f(4)(gamma) < 0.0013, -0.0012 < f(5)(gamma) < 0.0013

    A Balkán és az Oszmán Birodalom III. : Társadalmi és gazdasági átalakulások a 18. század végétől a 20. század közepéig : Szerbia, Macedónia, Bosznia

    Get PDF
    High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer

    Search for heavy resonances decaying to two Higgs bosons in final states containing four b quarks

    Get PDF
    A search is presented for narrow heavy resonances X decaying into pairs of Higgs bosons (H) in proton-proton collisions collected by the CMS experiment at the LHC at root s = 8 TeV. The data correspond to an integrated luminosity of 19.7 fb(-1). The search considers HH resonances with masses between 1 and 3 TeV, having final states of two b quark pairs. Each Higgs boson is produced with large momentum, and the hadronization products of the pair of b quarks can usually be reconstructed as single large jets. The background from multijet and t (t) over bar events is significantly reduced by applying requirements related to the flavor of the jet, its mass, and its substructure. The signal would be identified as a peak on top of the dijet invariant mass spectrum of the remaining background events. No evidence is observed for such a signal. Upper limits obtained at 95 confidence level for the product of the production cross section and branching fraction sigma(gg -> X) B(X -> HH -> b (b) over barb (b) over bar) range from 10 to 1.5 fb for the mass of X from 1.15 to 2.0 TeV, significantly extending previous searches. For a warped extra dimension theory with amass scale Lambda(R) = 1 TeV, the data exclude radion scalar masses between 1.15 and 1.55 TeV

    Solve-RD: systematic pan-European data sharing and collaborative analysis to solve rare diseases.

    Get PDF
    For the first time in Europe hundreds of rare disease (RD) experts team up to actively share and jointly analyse existing patient\u27s data. Solve-RD is a Horizon 2020-supported EU flagship project bringing together \u3e300 clinicians, scientists, and patient representatives of 51 sites from 15 countries. Solve-RD is built upon a core group of four European Reference Networks (ERNs; ERN-ITHACA, ERN-RND, ERN-Euro NMD, ERN-GENTURIS) which annually see more than 270,000 RD patients with respective pathologies. The main ambition is to solve unsolved rare diseases for which a molecular cause is not yet known. This is achieved through an innovative clinical research environment that introduces novel ways to organise expertise and data. Two major approaches are being pursued (i) massive data re-analysis of \u3e19,000 unsolved rare disease patients and (ii) novel combined -omics approaches. The minimum requirement to be eligible for the analysis activities is an inconclusive exome that can be shared with controlled access. The first preliminary data re-analysis has already diagnosed 255 cases form 8393 exomes/genome datasets. This unprecedented degree of collaboration focused on sharing of data and expertise shall identify many new disease genes and enable diagnosis of many so far undiagnosed patients from all over Europe

    Machine learning-enabled phenotyping for GWAS and TWAS of WUE traits in 869 field-grown sorghum accessions

    Get PDF
    Sorghum (Sorghum bicolor) is a model C4 crop made experimentally tractable by extensive genomic and genetic resources. Biomass sorghum is studied as a feedstock for biofuel and forage. Mechanistic modeling suggests that reducing stomatal conductance (gs) could improve sorghum intrinsic water use efficiency (iWUE) and biomass production. Phenotyping to discover genotype-to-phenotype associations remains a bottleneck in understanding the mechanistic basis for natural variation in gs and iWUE. This study addressed multiple methodological limitations. Optical tomography and a machine learning tool were combined to measure stomatal density (SD). This was combined with rapid measurements of leaf photosynthetic gas exchange and specific leaf area (SLA). These traits were the subject of genome-wide association study and transcriptome-wide association study across 869 field-grown biomass sorghum accessions. The ratio of intracellular to ambient CO2 was genetically correlated with SD, SLA, gs, and biomass production. Plasticity in SD and SLA was interrelated with each other and with productivity across wet and dry growing seasons. Moderate-to-high heritability of traits studied across the large mapping population validated associations between DNA sequence variation or RNA transcript abundance and trait variation. A total of 394 unique genes underpinning variation in WUE-related traits are described with higher confidence because they were identified in multiple independent tests. This list was enriched in genes whose Arabidopsis (Arabidopsis thaliana) putative orthologs have functions related to stomatal or leaf development and leaf gas exchange, as well as genes with nonsynonymous/missense variants. These advances in methodology and knowledge will facilitate improving C4 crop WUE

    Discovering and linking public omics data sets using the Omics Discovery Index.

    Get PDF
    Biomedical data are being produced at an unprecedented rate owing to the falling cost of experiments and wider access to genomics, transcriptomics, proteomics and metabolomics platforms1, 2. As a result, public deposition of omics data is on the increase. This presents new challenges, including finding ways to store, organize and access different types of biomedical data stored on different platforms. Here, we present the Omics Discovery Index (OmicsDI; http://www.omicsdi.org), an open-source platform that enables access, discovery and dissemination of omics data sets
    corecore