7 research outputs found
GaKCo: a Fast Gapped k-mer string Kernel using Counting - supplementary information, code and data
This data record consists of supplementary information files, all datasets and code used in the ECML PKDD 2017 paper <b>GaKCo: a Fast Gapped k-mer string Kernel using Counting</b> (README included).<div><br></div><div><div>GaKCo is a fast and naturally parallelizable algorithm for gapped k-mer based string kernel calculation. GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is scalable to larger dictionary size and more number of mismatches.</div><div><br></div><div><b>GaKCo_ECML17_Supplementary.pdf</b> - presents schematics of the GakCo algorithm, formal proof regarding Hamming Distance Property, justification of GaKCo's Sort and Count Method, connections to previous studies, details of the datasets, an overview of empirical Performance of GaKCo versus Neural Networks and other related experiments.</div><div><br></div><div><b>data.zip</b> - contains 38 training and testing nucloetide and peptide datasets in <b>.fasta</b> format: a text-based format in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. Protein, DNA and text dictionaries are provided in <b>.txt</b> format. Both formats are openly accessible via text edit software.</div><div><br></div><div><div>**Datasets for GaKCo:** </div><div>We perform 19 different classification tasks to evaluate the performance of GaKCo. These tasks belong to the discussed three categories: (1) TF binding site prediction (DNA dataset), (2) Remote Protein Homology prediction (protein dataset), and (3) Character-based English text classification (text dataset).</div><div><br></div><div><b>code.zip - </b>contains code files in C++ format: <b>.cpp</b>, <b>.h,</b> and bash file: <b>.sh</b> to compile GaKCo using the openMP g++ compiler, and to obtain kernel output using data files and dictionaries above. See below for further detail.<br></div><div><br></div><div>Compiling GaKCo (with openMP) : </div><div>```</div><div>g++ -c GaKCo.cpp -o GaKCo -fopenmp</div><div>```</div><div>To get kernel output: </div><div>```</div><div>GaKCo </div><div>#User options: > 5, =(0,..,g-1), = 0 for single-thread/ 1 for multithread</div><div>```</div><div>Bash script to run end-to-end kernel calculation:</div><div>```</div><div>processing.sh</div><div><br></div><div><br></div><div><b>Background</b></div><div>String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (<i>ÎŁ</i>) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to O(<i>ÎŁ</i><sup>M</sup>). We propose a fast algorithm for calculating Gapped k-mer Kernel using Counting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger <i>ÎŁ</i> and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the ÎŁ<sup>M</sup> term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets).<br></div></div></div
Discovery of CTCF-Sensitive Cis-Spliced Fusion RNAs between Adjacent Genes in Human Prostate Cells
<div><p>Genes or their encoded products are not expected to mingle with each other unless in some disease situations. In cancer, a frequent mechanism that can produce gene fusions is chromosomal rearrangement. However, recent discoveries of RNA trans-splicing and cis-splicing between adjacent genes (cis-SAGe) support for other mechanisms in generating fusion RNAs. In our transcriptome analyses of 28 prostate normal and cancer samples, 30% fusion RNAs on average are the transcripts that contain exons belonging to same-strand neighboring genes. These fusion RNAs may be the products of cis-SAGe, which was previously thought to be rare. To validate this finding and to better understand the phenomenon, we used LNCaP, a prostate cell line as a model, and identified 16 additional cis-SAGe events by silencing transcription factor CTCF and paired-end RNA sequencing. About half of the fusions are expressed at a significant level compared to their parental genes. Silencing one of the in-frame fusions resulted in reduced cell motility. Most out-of-frame fusions are likely to function as non-coding RNAs. The majority of the 16 fusions are also detected in other prostate cell lines, as well as in the 14 clinical prostate normal and cancer pairs. By studying the features associated with these fusions, we developed a set of rules: 1) the parental genes are same-strand-neighboring genes; 2) the distance between the genes is within 30kb; 3) the 5′ genes are actively transcribing; and 4) the chimeras tend to have the second-to-last exon in the 5′ genes joined to the second exon in the 3′ genes. We then randomly selected 20 neighboring genes in the genome, and detected four fusion events using these rules in prostate cancer and non-cancerous cells. These results suggest that splicing between neighboring gene transcripts is a rather frequent phenomenon, and it is not a feature unique to cancer cells.</p></div
Procedures to further identify cis-SAGe fusions.
<p><b>A</b>, cis-SAGe fusions should not have interstitial deletions between fused intergenic exons. Shown is the sequencing data of GSM947411 for the <i>MFGE8-HAPLN3</i> fusion as one example. <b>B</b>, we required the cis-SAGe fusions to contain CTCF bindings in the intergenic region between two parental genes. Shown is CTCF ChIP-seq data on kidney, LNCaP, and lung in USCS genome browser for the <i>MFGE8-HAPLN3</i> fusion as one example. <b>C</b>, the relative expression of 16 fusions by quantitative RT-PCR in si- and siCTCF treated LNCaP cells. <b>D</b>, reverse transcription using antisense primers annealing to the first exons of 3′ parental genes, and PCR of the intergenic transcripts for the 16 fusions. “+RT”, with reverse transcriptase; “-RT”, no reverse transcriptase. <b>E,</b> ratios of fusion FKPM versus parental gene FKPM were plotted following an order from the highest to the lowest.</p
Landscape of software-identified chimeric RNAs.
<p><b>A</b>, Venn gram showing the number of fusions in si- and siCTCF groups. <b>B,</b> Circos plot depicting chimeric RNAs discovered across the genome. Ring: chromosomes. Within the ring, lines denote the chimeric RNAs connecting two parental genes. <b>C</b>, putative chimeric RNAs were categorized into INTERCHR, INTRACHR-SS-0GAP, and INTRACHR-OTHER.</p
Detection of cis-SAGe chimeric mRNA in prostate cell lines and clinical tissues.
<p><b>A</b>, distribution of 16 fusion RNAs in nuclear vs. cytoplasmic fractions. Traditional protein-coding gene <i>GAPDH</i> and a known long non-coding RNA MALAT1 were used as controls. The protein-coding potential of each fusion is marked below. N: not affecting protein coding, applies to fusions where junction sites fall in the UTR. O: out-of-frame, applies to fusions where the reading frame of the 5′ gene is different from that of the 3′ gene. I: in-frame, the reading frame of 5′ gene is the same as that of 3′ gene. <b>B,</b> qRT-PCR for <i>ADCK4-NUMBL</i> and the parental genes. LNCaP cells were transfected with siRNAs targeting the fusion RNA (si-AN1 and si-AN2). Levels of various transcripts were normalized to that in si-negative control (si-). <b>C,</b> cell motility was measured by wound healing assay. Cells were transfected with siRNAs targeting <i>ADCK4-NUMBL</i> and si-negative control. The changes of the wound size were normalized to that in the si-negative control group (n>3, p<0.0001) <b>D,</b> detection of the 16 cis-SAGe chimeras in RWPE-1 (benign prostate cell), LNCaP, and PC-3 cells by RT-PCR and followed by agarose electrophoresis. GAPDH as internal control. <b>E</b>, summary of the 16 cis-SAGe chimeric RNAs in 14 clinical prostate cancer and normal tissues. STAR software was used to align the chimeras onto the clinical RNA-seq data; samtools and IGV were used to map spanning reads across the fusion junction. Black indicates the absence of a fusion, while red indicates the detection of a fusion.</p
Identification of novel cis-SAGe candidate events.
<p><b>A,</b> configuration of most common cis-SAGe events based on the 16 fusions. Bars represent exons and lines represent introns and intergenic regions. Arrows represent primers used to detect novel cis-SAGe events. <b>B,</b> Sanger sequencing of four novel fusions detected in LNCaP cells. <b>C,</b> RT-PCR of the same four fusions in LHS, RWPE-1 and PC3 cells.</p
High percentage of fusion RNAs involving neighboring genes, and strategy for identification novel cis-SAGe chimeric RNAs.
<p><b>A,</b> fusion RNAs categorized into INTRACHR-SS-0GAP, INTRACHR-OTHER and INTERCHR. Percentages of each category in individual tumor (upper) and matched normal (lower) prostate samples were plotted. <b>B,</b> correlation of the percentage of INTRACHR-SS-0GAP fusions in matched tumor and normal cases. Peason R = 0.6. <b>C,</b> CTCF knockdown induced chimeric <i>SLC45A3-ELK4</i> RNA expression. LNCaP cells were transfected with either si—or siCTCF. <i>CTCF</i> and <i>SLC45A3-ELK4</i> expression were monitored by qRT-PCR. Transcript amount was normalized to internal control GAPDH. The level of these transcripts was set to 1 in si- transfected cells. <b>D</b>, experimental flow for identification and validation of cis-SAGe events. Quality of RAW sequencing data was checked using FastQC. Paired reads were mapped to both human genome and transcriptome to identify chimeric RNAs using SOAPfuse software. Three groups of chimeric RNAs, classified by genomic features between two parental parts, were validated by RT-PCR and Sanger sequencing. Five additional steps were then applied to remove potential non-neighboring fusions, or fusions resulting from interstitial deletion, and to identify cis-SAGe events.</p