84 research outputs found

    Functional Sequence Annotation in an Error-prone Environment

    Get PDF
    As more and more sequences are submitted to public databases, so will grow more computationally challenging sequence retrieval systems. When for example the UniProtKB/TrEMBL doubles in size annually, the tools used today might not be sufficient tomorrow. Faster and computationally lighter methods are needed for sequence retrieval. This study presents a computationally more efficient tool. The Suffix Array Neighbourhood Search (SANS) tool is a hundred fold faster than the most commonly used tool BLAST. The sequence databases do not only grow in size but also in the number of different functional annotations they contain. Recent studies have shown that a large number of these annotations are assigned incorrectly. When the error level of functional annotations in the databases grows to a statistically significant figure, better methods and the use of error detection statistics are highly recommended. In the present study we introduce novel methods for weighted statistical testing of functional annotations. Also novel methods for the calculation of information content value are presented. The information content value enables the discrimination of informative from uninformative annotations. A growing number of functional annotation tools are introduced annually. Since no gold standard evaluation sets exist, it is impossible to determine the reliability of the different methods. The Critical Assessment of Functional Annotations (CAFA) challenge is the first attempt to evaluate functional annotation tools by using blind testing on a large scale. The first CAFA challenge included the evaluation of 54 state-of-the-art methods in two different Gene Ontology categories. The results show that there is a plenty of room for improvement in the prediction accuracy of the existing tools.Samaan aikaan, kun uusia sekvenssejä lisätään kiihtyvällä vauhdilla julkisiin biologisiin sekvenssitietopankkeihin, tietopankkien käyttäjät kohtaavat haasteita massivisten tietomäärien käsittelyssä. Esimerkiksi UniProtKB sekvenssitietokannan koko kaksinkertaistuu vuosittain, mikä johtaa väistämättä siihen tilanteeseen, että nykyisin käytössä olevat algoritmit tiedon etsimiseen vanhentuvat, koska eivät vastaa tehokkuudeltaan tulevaisuuden haasteita. Uusia, laskennallisesti tehokkaampia menetelmiä tarvitaan jatkuvasti. Tässä väitöskirjassa esitellään menetelmä joka on laskennallisesti tehokkaampi kuin nykyisin käytössä olevat menetelmät. Väitöskirjassa esitellyllä SANS algoritmilla päästään satakertaisiin parannuksiin suoritusajoissa verrattuna yleisimpään käytössä olevaan ohjelmaan BLAST. Biologiset sekvenssitietokannat eivät kasva ainoastaan niiden sekvenssimäärissä. Samalla kasvaa sekvensseihin liittyvä tiedon määrä. Viime aikoina kuitenkin on herännyt huolen aiheita tiedon oikeellisuuden puolesta. On arvioitu, että miltei puolet sekvenssitietokantojen tiedosta on virheellistä. Virheellisen tiedon käyttäminen esimerkiksi tutkimuksessa johtaa helposti vääriin johtopäätöksiin ja vääriin tuloksiin. Tässä väitöskirjassa esitellään menetelmä PANNZER, joka laskee tilastollisesti haetun tiedon luotettavuutta ja näin maksimoi tiedon oikeellisuuden. Oikeellisen tiedon saaminen julkisista biologisista sekvenssitietokannoista on kasvavissa määrin haasteellisempaa. Tähän ollaan herätty myös kansainvälisissä tutkijaryhmissä. Yksi tapa mitata olemassa olevien menetelmien suorituskykyä oikeellisen tiedon etsimisessä on järjestää kansainvälinen kilpailu tiedonhakumenetelmille. Ensimmäiseen kilpailuun nimeltä Critical Assessment of Functional Annotations (CAFA) osallistui 54 kilpailevaa menetelmää ympäri maailman. Tässä väitöskirjassa käsitellään myös kyseistä kilpailua sekä sen tuloksia

    BARCOSEL: a tool for selecting an optimal barcode set for high-throughput sequencing

    Get PDF
    Abstract Background Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection. Results We have developed a tool which can be used for three different tasks: 1) selecting an optimal barcode set from a larger set of candidates, 2) checking the compatibility of user-defined set of barcodes, e.g. whether two or more libraries with existing barcodes can be combined in a single sequencing pool, and 3) augmenting an existing set of barcodes. In our approach the selection process is formulated as a minimization problem. We define the cost function and a set of constraints and use integer programming to solve the resulting combinatorial problem. Based on the desired number of barcodes to be selected and the set of candidate sequences given by user, the necessary constraints are automatically generated and the optimal solution can be found. The method is implemented in C programming language and web interface is available at http://ekhidna2.biocenter.helsinki.fi/barcosel . Conclusions Increasing capacity of sequencing platforms raises the challenge of mixing barcodes. Our method allows the user to select a given number of barcodes among the larger existing barcode set so that both sequencing errors are tolerated and the nucleotide balance is optimized. The tool is easy to access via web browser

    BARCOSEL : a tool for selecting an optimal barcode set for high-throughput sequencing

    Get PDF
    Background: Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection. Results: We have developed a tool which can be used for three different tasks: 1) selecting an optimal barcode set from a larger set of candidates, 2) checking the compatibility of user-defined set of barcodes, e.g. whether two or more libraries with existing barcodes can be combined in a single sequencing pool, and 3) augmenting an existing set of barcodes. In our approach the selection process is formulated as a minimization problem. We define the cost function and a set of constraints and use integer programming to solve the resulting combinatorial problem. Based on the desired number of barcodes to be selected and the set of candidate sequences given by user, the necessary constraints are automatically generated and the optimal solution can be found. The method is implemented in C programming language and web interface is available at http://ekhidna2.biocenter.helsinki.fi/barcosel. Conclusions: Increasing capacity of sequencing platforms raises the challenge of mixing barcodes. Our method allows the user to select a given number of barcodes among the larger existing barcode set so that both sequencing errors are tolerated and the nucleotide balance is optimized. The tool is easy to access via web browser.Peer reviewe

    Multi-type quantum well semiconductor membrane external-cavity surface-emitting lasers (MECSELs) for widely tunable continuous wave operation

    Full text link
    Membrane external-cavity surface-emitting lasers (MECSELs) are at the forefront of pushing the performance limits of vertically emitting semiconductor lasers. Their simple idea of using just a very thin (hundreds of nanometers to few microns) gain membrane opens up new possibilities through uniform double side optical pumping and superior heat extraction from the active area. Moreover, these advantages of MECSELs enable more complex band gap engineering possibilities for the active region by the introduction of multiple types of quantum wells (QWs) to a single laser gain structure. In this paper, we present a new design strategy for laser gain structures with several types of QWs. The aim is to achieve broadband gain with relatively high power operation and potentially a flat spectral tuning range. The emphasis in our design is on ensuring sufficient gain over a wide wavelength range, having uniform pump absorption, and restricted carrier mobility between the different quantum wells during laser operation. A full-width half-maximum tuning range of > 70 nm (> 21.7 THz) with more than 125 mW of power through the entire tuning range at room temperature is demonstrated

    Genomic features separating ten strains of Neorhizobium galegae with different symbiotic phenotypes

    Get PDF
    Abstract Background The symbiotic phenotype of Neorhizobium galegae, with strains specifically fixing nitrogen with either Galega orientalis or G. officinalis, has made it a target in research on determinants of host specificity in nitrogen fixation. The genomic differences between representative strains of the two symbiovars are, however, relatively small. This introduced a need for a dataset representing a larger bacterial population in order to make better conclusions on characteristics typical for a subset of the species. In this study, we produced draft genomes of eight strains of N. galegae having different symbiotic phenotypes, both with regard to host specificity and nitrogen fixation efficiency. These genomes were analysed together with the previously published complete genomes of N. galegae strains HAMBI 540T and HAMBI 1141. Results The results showed that the presence of an additional rpoN sigma factor gene in the symbiosis gene region is a characteristic specific to symbiovar orientalis, required for nitrogen fixation. Also the nifQ gene was shown to be crucial for functional symbiosis in both symbiovars. Genome-wide analyses identified additional genes characteristic of strains of the same symbiovar and of strains having similar plant growth promoting properties on Galega orientalis. Many of these genes are involved in transcriptional regulation or in metabolic functions. Conclusions The results of this study confirm that the only symbiosis-related gene that is present in one symbiovar of N. galegae but not in the other is an rpoN gene. The specific function of this gene remains to be determined, however. New genes that were identified as specific for strains of one symbiovar may be involved in determining host specificity, while others are defined as potential determinant genes for differences in efficiency of nitrogen fixation

    gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

    Get PDF
    Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.Peer reviewe

    Genome sequence of the model plant pathogen Pectobacterium carotovorum SCC1

    Get PDF
    Bacteria of the genus Pectobacterium are economically important plant pathogens that cause soft rot disease on a wide variety of plant species. Here, we report the genome sequence of Pectobacterium carotovorum strain SCC1, a Finnish soft rot model strain isolated from a diseased potato tuber in the early 1980's. The genome of strain SCC1 consists of one circular chromosome of 4,974,798 bp and one circular plasmid of 5524 bp. In total 4451 genes were predicted, of which 4349 are protein coding and 102 are RNA genes.Peer reviewe

    Genomic features separating ten strains of Neorhizobium galegae with different symbiotic phenotypes

    Get PDF
    Background The symbiotic phenotype of Neorhizobium galegae, with strains specifically fixing nitrogen with either Galega orientalis or G. officinalis, has made it a target in research on determinants of host specificity in nitrogen fixation. The genomic differences between representative strains of the two symbiovars are, however, relatively small. This introduced a need for a dataset representing a larger bacterial population in order to make better conclusions on characteristics typical for a subset of the species. In this study, we produced draft genomes of eight strains of N. galegae having different symbiotic phenotypes, both with regard to host specificity and nitrogen fixation efficiency. These genomes were analysed together with the previously published complete genomes of N. galegae strains HAMBI 540T and HAMBI 1141. Results The results showed that the presence of an additional rpoN sigma factor gene in the symbiosis gene region is a characteristic specific to symbiovar orientalis, required for nitrogen fixation. Also the nifQ gene was shown to be crucial for functional symbiosis in both symbiovars. Genome-wide analyses identified additional genes characteristic of strains of the same symbiovar and of strains having similar plant growth promoting properties on Galega orientalis. Many of these genes are involved in transcriptional regulation or in metabolic functions. Conclusions The results of this study confirm that the only symbiosis-related gene that is present in one symbiovar of N. galegae but not in the other is an rpoN gene. The specific function of this gene remains to be determined, however. New genes that were identified as specific for strains of one symbiovar may be involved in determining host specificity, while others are defined as potential determinant genes for differences in efficiency of nitrogen fixation.Peer reviewe

    Design and characterization of MECSELs for widely tunable (>25 THz) continuous wave operation

    Get PDF
    Membrane external-cavity surface-emitting lasers (MECSELs) are vertically emitting semiconductor lasers that combine all the benefits of VECSELs (vertical-external-cavity surface-emitting lasers) with the new degree of freedom in creating gain structures without monolithically integrated distributed Bragg reflectors (DBRs). The absence of the DBR and the substrate, and the use of a very thin gain membrane (typically some hundreds of nanometers), which can be sandwiched between two transparent heat spreaders, represents the best solution for heat removal. The membrane configuration also allows the option of double side pumping, which in turn makes it possible to utilize an extensive amount of quantum well (QW) groups as well as multiple kinds of QWs in a periodic laser gain structure. Here we report on design strategy and results of different kinds of approaches on broadband, relatively high power MECSEL gain structures. Especially efficient pump absorption, sufficient gain on several different wavelengths and carrier mobility during laser operation, are discussed. We also present the characteristics of the laser systems created. Results show ∼83 nm (∼25 THz) tuning range with more than 100 mW of power at all wavelengths at room temperature operation. Strategies for further development are discussed as well.Peer reviewe
    corecore