198 research outputs found
Hierarchical maximum likelihood clustering approach
Objective: 
In this work, we focused on developing a clustering approach for biological data. In many biological
analyses, such as multi-omics data analysis and genome-wide
association studies (GWAS) analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. Methods:
Conventionally, the k-means clustering algorithm is
overwhelmingly applied in many areas including biological
sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework. 
Results: This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality. 
Conclusion: The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood based methods.
Significance: This algorithm uses distribution and centroid
information to cluster a sample and was applied to biological data. A Matlab implementation of this method can be downloaded from the web-link
http://www.riken.jp/en/research/labs/ims/med_sci_math/
Recommended from our members
Combined burden and functional impact tests for cancer driver discovery using DriverPower
The discovery of driver mutations is one of the key motivations for cancer genome sequencing. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumour types, we describe DriverPower, a software package that uses mutational burden and functional impact evidence to identify driver mutations in coding and non-coding sites within cancer whole genomes. Using a total of 1373 genomic features derived from public sources, DriverPower's background mutation model explains up to 93% of the regional variance in the mutation rate across multiple tumour types. By incorporating functional impact scores, we are able to further increase the accuracy of driver discovery. Testing across a collection of 2583 cancer genomes from the PCAWG project, DriverPower identifies 217 coding and 95 non-coding driver candidates. Comparing to six published methods used by the PCAWG Drivers and Functional Interpretation Working Group, DriverPower has the highest F1 score for both coding and non-coding driver discovery. This demonstrates that DriverPower is an effective framework for computational driver discovery
Comparative Genomics Identifies Candidate Genes for Infectious Salmon Anemia (ISA) Resistance in Atlantic Salmon (Salmo salar)
Infectious salmon anemia (ISA) has been described as the hoof and mouth disease of salmon farming. ISA is caused by a lethal and highly communicable virus, which can have a major impact on salmon aquaculture, as demonstrated by an outbreak in Chile in 2007. A quantitative trait locus (QTL) for ISA resistance has been mapped to three microsatellite markers on linkage group (LG) 8 (Chr 15) on the Atlantic salmon genetic map. We identified bacterial artificial chromosome (BAC) clones and three fingerprint contigs from the Atlantic salmon physical map that contains these markers. We made use of the extensive BAC end sequence database to extend these contigs by chromosome walking and identified additional two markers in this region. The BAC end sequences were used to search for conserved synteny between this segment of LG8 and the fish genomes that have been sequenced. An examination of the genes in the syntenic segments of the tetraodon and medaka genomes identified candidates for association with ISA resistance in Atlantic salmon based on differential expression profiles from ISA challenges or on the putative biological functions of the proteins they encode. One gene in particular, HIV-EP2/MBP-2, caught our attention as it may influence the expression of several genes that have been implicated in the response to infection by infectious salmon anemia virus (ISAV). Therefore, we suggest that HIV-EP2/MBP-2 is a very strong candidate for the gene associated with the ISAV resistance QTL in Atlantic salmon and is worthy of further study
Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen
The effectiveness of most cancer targeted therapies is short-lived. Tumors often develop resistance that might be overcome with drug combinations. However, the number of possible combinations is vast, necessitating data-driven approaches to find optimal patient-specific treatments. Here we report AstraZeneca's large drug combination dataset, consisting of 11,576 experiments from 910 combinations across 85 molecularly characterized cancer cell lines, and results of a DREAM Challenge to evaluate computational strategies for predicting synergistic drug pairs and biomarkers. 160 teams participated to provide a comprehensive methodological development and benchmarking. Winning methods incorporate prior knowledge of drug-target interactions. Synergy is predicted with an accuracy matching biological replicates for >60% of combinations. However, 20% of drug combinations are poorly predicted by all methods. Genomic rationale for synergy predictions are identified, including ADAM17 inhibitor antagonism when combined with PIK3CB/D inhibition contrasting to synergy when combined with other PI3K-pathway inhibitors in PIK3CA mutant cells
An integrative machine learning approach for prediction of toxicity - related drug safety
Recent trends in drug development have been marked by diminishing returns caused by the escalating costs and falling rates of new drug approval. Unacceptable drug toxicity is a substantial cause of drug failure during clinical trials and the leading cause of drug withdraws after release to the market. Computational methods capable of predicting these failures can reduce the waste of resources and time devoted to the investigation of compounds that ultimately fail. We propose an original machine learning method that leverages identity of drug targets and off-targets, functional impact score computed from Gene Ontology annotations, and biological network data to predict drug toxicity. We demonstrate that our method (TargeTox) can distinguish potentially idiosyncratically toxic drugs from safe drugs and is also suitable for speculative evaluation of different target sets to support the design of optimal low-toxicity combinations
Prediction Models of Breast Cancer Outcome
The goal of this study is to establish a method for predicting overall survival (OS ) and disease‐free survival (DFS ) in breast cancer patients after surgical operation. The gene expression profiles of cancer tissues from the patients, who underwent complete surgical resection of breast cancer and were subsequently monitored for postoperative survival, were analyzed using cDNA microarrays. We detected seven and three probes/genes associated with the postoperative OS and DFS , respectively, from our discovery cohort data. By incorporating these genes associated with the postoperative survival into MammaPrint genes, often used to predict prognosis of patients with early‐stage breast cancer, we constructed postoperative OS and DFS prediction models from the discovery cohort data using a Cox proportional hazard model. The predictive ability of the models was evaluated in another independent cohort using Kaplan–Meier (KM ) curves and the area under the receiver operating characteristic curve (AUC ). The KM curves showed a statistically significant difference between the predicted high‐ and low‐risk groups in both OS (log‐rank trend test P  = 0.0033) and DFS (log‐rank trend test P  = 0.00030). The models also achieved high AUC scores of 0.71 in OS and of 0.60 in DFS . Furthermore, our models had improved KM curves when compared to the models using MammaPrint genes (OS : P  = 0.0058, DFS : P  = 0.00054). Similar results were observed when our model was tested in publicly available datasets. These observations indicate that there is still room for improvement in the current methods of predicting postoperative OS and DFS in breast cancer
DeepInsight: a methodology to transform a non - image data to an image for convolution neural network architecture
It is critical, but difficult, to catch the small variation in genomic or other kinds of data that differentiates phenotypes or categories. A plethora of data is available, but the information from its genes or elements is spread over arbitrarily, making it challenging to extract relevant details for identification. However, an arrangement of similar genes into clusters makes these differences more accessible and allows for robust identification of hidden mechanisms (e.g. pathways) than dealing with elements individually. Here we propose, DeepInsight, which converts non-image samples into a well-organized image-form. Thereby, the power of convolution neural network (CNN), including GPU utilization, can be realized for non-image samples. Furthermore, DeepInsight enables feature extraction through the application of CNN for non-image samples to seize imperative information and shown promising results. To our knowledge, this is the first work to apply CNN simultaneously on different kinds of non-image datasets: RNA-seq, vowels, text, and artificial
- …
