93 research outputs found

    Utilizing Selected Di- and Trinucleotides of siRNA to Predict RNAi Activity

    Get PDF
    Small interfering RNAs (siRNAs) induce posttranscriptional gene silencing in various organisms. siRNAs targeted to different positions of the same gene show different effectiveness; hence, predicting siRNA activity is a crucial step. In this paper, we developed and evaluated a powerful tool named “siRNApred” with a new mixed feature set to predict siRNA activity. To improve the prediction accuracy, we proposed 2-3NTs as our new features. A Random Forest siRNA activity prediction model was constructed using the feature set selected by our proposed Binary Search Feature Selection (BSFS) algorithm. Experimental data demonstrated that the binding site of the Argonaute protein correlates with siRNA activity. “siRNApred” is effective for selecting active siRNAs, and the prediction results demonstrate that our method can outperform other current siRNA activity prediction methods in terms of prediction accuracy

    Interactive Learning for the Analysis of Biomedical and Industrial Imagery

    Get PDF
    In der vorliegenden Dissertation werden Methoden des überwachten Lernens untersucht und auf die Analyse und die Segmentierung digitaler Bilddaten angewendet, die aus diversen Forschungsgebieten stammen. Die Segmentierung und die Klassifikation spielen eine wichtige Rolle in der biomedizinischen und industriellen Bildverarbeitung, häufig basiert darauf weitere Erkennung und Quantifikation. Viele problemspezifische Ansätze existieren für die unterschiedlichsten Fragestellungen und nutzen meist spezifisches Vorwissen aus den jeweiligen Bilddaten aus. In dieser Arbeit wird ein überwachtes Lernverfahren vorgestellt, das mehrere Objekte und deren Klassen gleichzeitig segmentieren und unterscheiden kann. Die Methode ist generell genug um einen wichtigen Bereich von Anwendungen abzudecken, für deren Lösung lokale Merkmale eine Rolle spielen. Segmentierungsergebnisse dieses Ansatzes werden auf verschiedenen Datensätzen mit unterschiedlichen Problemstellungen gezeigt. Die Resultate unterstreichen die Anwendbarkeit der Lernmethode für viele biomedizinische und industrielle Anwendungen, ohne dass explizite Kenntnisse der Bildverarbeitung und Programmierung vorausgesetzt werden müssen. Der Ansatz basiert auf generellen Merkmalsklassen, die es erlauben lokal Strukturen wie Farbe, Textur und Kanten zu beschreiben. Zu diesem Zweck wurde eine interaktive Software implementiert, welche, für gewöhnliche Bildgrößen, in Echtzeit arbeitet und es somit einem Domänenexperten erlaubt Segmentierungs- und Klassifikationsaufgaben interaktiv zu bearbeiten. Dafür sind keine Kenntnisse in der Bildverarbeitung nötig, da sich die Benutzerinteraktion auf intuitives Markieren mit einem Pinselwerkzeug beschränkt. Das interaktiv trainierte System kann dann ohne weitere Benutzerinteraktion auf viele neue Bilder angewendet werden. Der Ansatz ist auf Segmentierungsprobleme beschränkt, für deren Lösung lokale diskriminative Merkmale ausreichen. Innerhalb dieser Einschränkung zeigt der Algorithmus jedoch erstaunlich gute Resultate, die in einer applikationsspezifischen Prozedur weiter verbessert werden können. Das Verfahren unterstützt bis zu vierdimensionale, multispektrale Bilddaten in vereinheitlichter Weise. Um die Anwendbar- und Übertragbarkeit der Methode weiter zu illustrieren wurden mehrere echte Anwendungsfälle, kommend aus verschiedenen bildgebenden Bereichen, untersucht. Darunter sind u. A. die Segmentierung von Tumorgewebe, aufgenommen mittelsWeitfeldmikroskopie, die Quantifikation von Zellwanderungen in konfokalmikroskopischen Aufnahmen für die Untersuchung der adulten Neurogenese, die Segmentierung von Blutgefäßen in der Retina des Auges, das Verfolgen von Kupferdrähten in einer Anwendung zur Produktauthentifikation und die Qualitätskontrolle von Mikroskopiebildern im Kontext von Hochdurchsatz-Experimenten. Desweiteren wurde eine neue Klassifikationsmethode basierend auf globalen Frequenzschätzungen für die Prozesskontrolle des Papieranlegers an Druckmaschinen entwickelt

    ER-targeted Intrabodies Mediating Specific In Vivo Knockdown of Transitory Proteins in Comparison to RNAi

    Get PDF
    In animals and mammalian cells, protein function can be analyzed by nucleotide sequence-based methods such as gene knockout, targeted gene disruption, CRISPR/Cas, TALEN, zinc finger nucleases, or the RNAi technique. Alternatively, protein knockdown approaches are available based on direct interference of the target protein with the inhibitor

    Tensor Decomposition in Multiple Kernel Learning

    Get PDF
    Modern data processing and analytic tasks often deal with high dimensional matrices or tensors; for example: environmental sensors monitor (time, location, temperature, light) data. For large scale tensors, efficient data representation plays a major role in reducing computational time and finding patterns. The thesis firstly studies about fundamental matrix, tensor decomposition algorithms and applications, in connection with Tensor Train decomposition algorithm. The second objective is applying the tensor perspective in Multiple Kernel Learning problems, where the stacking of kernels can be seen as a tensor. Decomposition this kind of tensor leads to an efficient factorization approach in finding the best linear combination of kernels through the similarity alignment. Interestingly, thanks to the symmetry of the kernel matrix, a novel decomposition algorithm for multiple kernels is derived for reducing the computational complexity. In term of applications, this new approach allows the manipulation of large scale multiple kernels problems. For example, with P kernels and n samples, it reduces the memory complexity of O(P^2n^2) to O(P^2r^2+ 2rn) where r < n is the number of low-rank components. This compression is also valuable in pair-wise multiple kernel learning problem which models the relation among pairs of objects and its complexity is in the double scale. This study proposes AlignF_TT, a kernel alignment algorithm which is based on the novel decomposition algorithm for the tensor of kernels. Regarding the predictive performance, the proposed algorithm can gain an improvement in 18 artificially constructed datasets and achieve comparable performance in 13 real-world datasets in comparison with other multiple kernel learning algorithms. It also reveals that the small number of low-rank components is sufficient for approximating the tensor of kernels

    Literature-based discovery of known and potential new mechanisms for relating the status of cholesterol to the progression of breast cancer

    Get PDF
    Breast cancer has been studied for a long period of time and from a variety of perspectives in order to understand its pathogeny. The pathogeny of breast cancer can be classified into two groups: hereditary and spontaneous. Although cancer in general is considered a genetic disease, spontaneous factors are responsible for most of the pathogeny of breast cancer. In other words, breast cancer is more likely to be caused and deteriorated by the dysfunction of a physical molecule than be caused by germline mutation directly. Interestingly, cholesterol, as one of those molecules, has been discovered to correlate with breast cancer risk. However, the mechanisms of how cholesterol helps breast cancer progression are not thoroughly understood. As a result, this study aims to study known and discover potential new mechanisms regarding to the correlation of cholesterol and breast cancer progression using literature review and literature-based discovery. The known mechanisms are further classified into four groups: cholesterol membrane content, transport of cholesterol, cholesterol metabolites, and other. The potential mechanisms, which are intended to provide potential new treatments, have been identified and checked for feasibility by an expert

    Multi-task and Multi-view Learning for Predicting Adverse Drug Reactions

    Get PDF
    Adverse drug reactions (ADRs) present a major concern for drug safety and are a major obstacle in modern drug development. They account for about one-third of all late-stage drug failures, and approximately 4% of all new chemical entities are withdrawn from the market due to severe ADRs. Although off-target drug interactions are considered to be the major causes of ADRs, the adverse reaction profile of a drug depends on a wide range of factors such as specific features of drug chemical structures, its ADME/PK properties, interactions with proteins, the metabolic machinery of the cellular environment, and the presence of other diseases and drugs. Hence computational modeling for ADRs prediction is highly complex and challenging. We propose a set of statistical learning models for effective ADRs prediction systematically from multiple perspectives. We first discuss available data sources for protein-chemical interactions and adverse drug reactions, and how the data can be represented for effective modeling. We also employ biological network analysis approaches for deeper understanding of the chemical biological mechanisms underlying various ADRs. In addition, since protein-chemical interactions are an important component for ADRs prediction, identifying these interactions is a crucial step in both modern drug discovery and ADRs prediction. The performance of common supervised learning methods for predicting protein-chemical interactions have been largely limited by insufficient availability of binding data for many proteins. We propose two multi-task learning (MTL) algorithms for jointly predicting active compounds of multiple proteins, and our methods outperform existing states of the art significantly. All these related data, methods, and preliminary results are helpful for understanding the underlying mechanisms of ADRs and further studies. ADRs data are complex and noisy, and in many cases we do not fully understand the molecular mechanisms of ADRs. Due to the noisy and heterogeneous data set available for some ADRs, we propose a sparse multi-view learning (MVL) algorithm for predicting a specific ADR - drug-induced QT prolongation, a major life-threatening adverse drug effect. It is crucial to predict the QT prolongation effect as early as possible in drug development. MVL algorithms work very well when complex data from diverse domains are involved and only limited labeled examples are available. Unlike existing MVL methods that use L2-norm co-regularization to obtain a smooth objective function, we propose an L1-norm co-regularized MVL algorithm for predicting QT prolongation, reformulate the objective function, and obtain its gradient in the analytic form. We optimize the decision functions on all views simultaneously and achieve 3-4 fold higher computational speedup, comparing to previous L2-norm co-regularized MVL methods that alternately optimizes one view with the other views fixed until convergence. L1-norm co-regularization enforces sparsity in the learned mapping functions and hence the results are expected to be more interpretable. The proposed MVL method can only predict one ADR at a time. It would be advantageous to predict multiple ADRs jointly, especially when these ADRs are highly related. Advanced modeling techniques should be investigated to better utilize ADR data for more effective ADRs prediction. We study the quantitative relationship among drug structures, drug-protein interaction profiles, and drug ADRs. We formalize the modeling problem as a multi-view (drug structure data and drug-protein interaction profile data) multi-task (one drug may cause multiple ADRs and each ADR is a task) classification problem. We apply the co-regularized MVL on each ADR and use regularized MTL to increase the total sample size and improve model performance. Experimental studies on the ADR data set demonstrate the effectiveness of our MVMT algorithm. Cluster analysis and significant feature identification using the results of our models reveal interesting hidden insight. In summary, we use computational methods such as biological network analysis, multi-task learning, multi-view learning, and inductive multi-view multi-task learning to systematically investigate the modeling of various ADRs, and construct highly accurate models for ADRs prediction. We also have significant contribution on proposing novel supervised and semi-supervised learning algorithms, which can be applied to many other real-world applications

    DEFINITION OF BIOLOGICAL RESPONSES THROUGH THE ANALYSIS OF GENE EXPRESSION PROFILES

    Get PDF
    The aim of this PhD project was the development of a pipeline for the analysis of expression data and a set of of different strategies to extract biological informations from micrarray experiments. The computational pipeline for processing raw microarray data (images) to define gene expression levels, to provide experiment quality assessment and significativity statistical tests, was implemented in R, using mostly Bioconductor packages. The first fase had as purpose the determination of the gene function combining experiments of silecing with the gene expression analysis. Caspase-2 is a member of a cystein-protease family that carry out important roles in the apoptosis and in the inflammation. Altough it is highly conserved from the evolutionary point of view, in the literature several contradictory results are found. Being expressed at high level during the neurological development and with a strong involvement in the apoptotic processes in the adult central nervous system, we decided to proceed with the silecing of the gene that codifies for this enzyme using glioblastoma cells, a very aggressive cerebral tumor. The comparative analysis of expression profiles of silenced cells respect to the control ones, highlighted the relation between CASP2 and genes involved in the cholesterol metabolism. Previuos studies have suggested for this enzime a role in the control of intracellular level of this metabolite. Therefor, we decided to use data stored in public databases in order to to extend the investigation, including all the other caspases and all the genes in same way connected to cholesterol. After we had obtained the data related to several different experiments, we went ahead with the computation of the correlation between expression levels and, then, based of these values, with the clustring analysis in order to see which among the caspases has the same corralational profile. After that, the analysis was expanded to normal brain and liver tissues, in order to know whether the situation observed in the patological condition is unique or if it can be overlayed to that present in normal tissues. In the second phase, I performed an analysis of expression data with a completely different purpose. The aim of this project was the definition of the signaling pathways and of the resistence mechanisms induced by the treatment of cancer cells obtained from patients affected by cronic lymphocytic leukemia and treated with a new category of ubiquitin proteasome system (UPS) inhibitors. Through the comparison of trascriptional profiles before and after the treatment, many genes connected with the drug action at cellular level, whose expression was altered by the UPS inhibitor, were identified. Furthermore, considering the difference in terms of responsiveness of the analized patients, we could determine some genes responsible of the different efficacy of the farmacological treatment

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease
    • …
    corecore