7 research outputs found

    Privacy-preserving document similarity detection

    Get PDF
    The document similarity detection is an important technique used in many applications. The existence of the tool that guarantees the privacy protection of the documents during the comparison will expand the area where this technique can be applied. The goal of this project is to develop a method for privacy-preserving document similarity detection capable to identify either semantically or syntactically similar documents. As the result two methods were designed, implemented, and evaluated. In the first method privacy-preserving data comparison protocol was applied for secure comparison. This original protocol was created as a part of this thesis. In the second method modified private-matching scheme was used. In both methods the Natural Language processing techniques were utilized to capture the semantic relations between documents. During the testing phase the first method was found to be too slow for the practical application. The second method, on the contrary, was rather fast and effective. It can be used for creation of the tool for detecting syntactical and semantic similarity in a privacy-preserving way

    Improved reference genome uncovers novel sex-linked regions in the Guppy (Poecilia reticulata)

    Get PDF
    This is the author accepted manuscript. The final version is available on open access from Oxford University Press via the DOI in this recordData availability: Population genomics data are available on ENA: Study: PRJEB10680 PCR-free data are available on ENA: Study PRJEB36450 Genome assembly is available on ENA ID: PRJEB36704; ERP119926 All scripts and pipelines are available on github: https://github.com/bfrasercommits/guppy_genomeTheory predicts that the sexes can achieve greater fitness if loci with sexually antagonistic polymorphisms become linked to the sex determining loci, and this can favour the spread of reduced recombination around sex determining regions. Given that sex-linked regions are frequently repetitive and highly heterozygous, few complete Y chromosome assemblies are available to test these ideas. The guppy system (Poecilia reticulata) has long been invoked as an example of sex chromosome formation resulting from sexual conflict. Early genetics studies revealed that male colour patterning genes are mostly but not entirely Y-linked, and that X-linkage may be most common in low predation populations. More recent population genomic studies of guppies have reached varying conclusions about the size and placement of the Y-linked region. However, this previous work used a reference genome assembled from short-read sequences from a female guppy. Here, we present a new guppy reference genome assembly from a male, using long-read PacBio single-molecule real-time sequencing (SMRT) and chromosome contact information. Our new assembly sequences across repeat- and GC-rich regions and thus closes gaps and corrects mis-assemblies found in the short-read female-derived guppy genome. Using this improved reference genome, we then employed broad population sampling to detect sex differences across the genome. We identified two small regions that showed consistent male-specific signals. Moreover, our results help reconcile the contradictory conclusions put forth by past population genomic studies of the guppy sex chromosome. Our results are consistent with a small Y-specific region and rare recombination in male guppies.Max Planck SocietyEuropean Research Council (ERC)Natural Environment Research Council (NERC

    NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences

    No full text
    Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license

    Additional file 1: of NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences

    No full text
    Figure S1. Reference fragments placement order depending on query fragment orientations during detection of local differences. Figure S2. Circular genome alignment alternatives. Figure S3. Number of differences in each category obtained by NucDiff with the default parameter settings for all assemblers. Figure S4. Comparison of multiple assemblies against one reference using NucDiff. Figure S5. Examples of detection of long deletions located in all assemblies at the same place in the reference sequence. Table S1. Alignment fragmentation cases caused by simple differences. Table S2. Genome modifications implemented during the simulation process. Table S3. List of E. coli genomes used in the Comparison of genomes from different strains of the same species section. Table S4. Parameter values used for each parameter settings. Table S5. Correspondence between the QUAST difference types and the simulated difference types. Table S6. Correspondence between the QUAST, dnadiff and NucDiff difference types and the expected difference types. (PDF 989 kb
    corecore