1,274 research outputs found
A Rule Based Classification Model to Predict Colon Cancer Survival
Introduction: Colon cancer is the second most common cancer in the world and fourth most common
cancer in both sexes in Iran, whose % 8.12 of all cancers in the covers. Predict the outcome of cancer and
basic clinical data about it is very important. Data mining techniques can be used to predict cancer outcome.
In our country, data mining studies on colon cancer, not covered as lung or breast cancers. It seems can be
with identify factors influencing on survival and modify them, increased survival of colon cancer patients.
Then according to high rates of colon cancer and the benefits of data mining to predict survival, in this study
examined factors influencing on the survival of these patients.
Materials and Methods: We use a dataset with four attributes that include the records of 570 patients in
which 327 Patients (57.4%) and 243 (42.6%) patients were males and females respectively. Trees Random
Forest (TRF), AdaBoost (AD), RBF Network (RBFN), and Multilayer Perceptron (MLP) machine learning
techniques with 10-cross fold technique were used with the proposed model for the prediction of colon
cancer survival. The performance of machine learning techniques were evaluated with accuracy, precision,
sensitivity, specificity, and area under ROC curve.
Results: Out of 570 patients, 338 patients and 232 patients were alive and dead respectively. In this Study,
at first sight it seems that among this techniques, Trees Random Forest (TRF) technique showed better
results in comparison to other techniques (AD, RBFN and MLP). The accuracy, sensitivity, specificity and
the area under ROC curve of TRF are 0.76, 0.808, 0.70 and 0.83, respectively.
Conclusions: In this study seems that Trees Random Forest model (TRF) which is a rule based
classification model was the best model with the highest level of accuracy. Therefore, this model is
recommended as a useful tool for colon cancer survival prediction as well as medical decision making
Advancing the cell culture landscape:the instructive potential of artificial and natural geometries
This research focuses on how surface structures can influence the behaviour of cells. There is a great diversity of surface structures, which makes the identification of an optimal physical environment for a specific phenotype difficult. Therefore, platforms that allow screening of many different designs at the same time facilitate the identification of an optimal cultural environment. Using the TopoChip, which contains 2176 unique microtopographies, structures have been identified that support the tenocyte phenotype, the primary cell type of the tendon. In addition, this also applies to mesenchymal stem cells (MSCs), which experience an activation of tendon-related genes. Furthermore, the library has been creatively expanded by using natural surface topographies that cause unique cell behaviour, such as promoting osteogenesis
Identification of a minimum number of genes to predict triple-negative breast cancer subgroups from gene expression profiles
Background: Triple-negative breast cancer (TNBC) is a very heterogeneous disease. Several gene expression and mutation profiling approaches were used to classify it, and all converged to the identification of distinct molecular subtypes, with some overlapping across different approaches. However, a standardised tool to routinely classify TNBC in the clinics and guide personalised treatment is lacking. We aimed at defining a specific gene signature for each of the six TNBC subtypes proposed by Lehman et al. in 2011 (basal-like 1 (BL1); basal-like 2 (BL2); mesenchymal (M); immunomodulatory (IM); mesenchymal stem-like (MSL); and luminal androgen receptor (LAR)), to be able to accurately predict them. Methods: Lehmanâs TNBCtype subtyping tool was applied to RNA-sequencing data from 482 TNBC (GSE164458), and a minimal subtype-specific gene signature was defined by combining two class comparison techniques with seven attribute selection methods. Several machine learning algorithms for subtype prediction were used, and the best classifier was applied on microarray data from 72 Italian TNBC and on the TNBC subset of the BRCA-TCGA data set. Results: We identified two signatures with the 120 and 81 top up- and downregulated genes that define the six TNBC subtypes, with prediction accuracy ranging from 88.6 to 89.4%, and even improving after removal of the least important genes. Network analysis was used to identify highly interconnected genes within each subgroup. Two druggable matrix metalloproteinases were found in the BL1 and BL2 subsets, and several druggable targets were complementary to androgen receptor or aromatase in the LAR subset. Several secondary drugâtarget interactions were found among the upregulated genes in the M, IM and MSL subsets. Conclusions: Our study took full advantage of available TNBC data sets to stratify samples and genes into distinct subtypes, according to gene expression profiles. The development of a data mining approach to acquire a large amount of information from several data sets has allowed us to identify a well-determined minimal number of genes that may help in the recognition of TNBC subtypes. These genes, most of which have been previously found to be associated with breast cancer, have the potential to become novel diagnostic markers and/or therapeutic targets for specific TNBC subsets
Analysis of Microarray Data to Confirm Novel Subtype of Breast Cancer
Microarray data of 1,056 breast tumors from NCBI\u27s GEO database were collected and analyzed. The goal was to discover microarrays most similar to a potentially novel subtype of breast cancer characterized by low claudin expression. Claudins form tight junctions, which regulate cell-cell movement of solutes and ions; faulty tight junctions have been correlated with cancer metastasis. Results showed little support for a Claudin-Low cluster. Nine clusters were found, when 5-6 were expected. These additional clusters may reveal novel subtypes
Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org
Genealogy Reconstruction: Methods and applications in cancer and wild populations
Genealogy reconstruction is widely used in biology when relationships among entities are studied. Phylogenies, or evolutionary trees, show the differences between species. They are of profound importance because they help to obtain better understandings of evolutionary processes. Pedigrees, or family trees, on the other hand visualize the relatedness between individuals in a population. The reconstruction of pedigrees and the inference of parentage in general is now a cornerstone in molecular ecology. Applications include the direct infer- ence of gene flow, estimation of the effective population size and parameters describing the populationâs mating behaviour such as rates of inbreeding.
In the first part of this thesis, we construct genealogies of various types of cancer. Histopatho- logical classification of human tumors relies in part on the degree of differentiation of the tumor sample. To date, there is no objective systematic method to categorize tumor subtypes by maturation. We introduce a novel algorithm to rank tumor subtypes according to the dis- similarity of their gene expression from that of stem cells and fully differentiated tissue, and thereby construct a phylogenetic tree of cancer. We validate our methodology with expression data of leukemia and liposarcoma subtypes and then apply it to a broader group of sarcomas and of breast cancer subtypes. This ranking of tumor subtypes resulting from the application of our methodology allows the identification of genes correlated with differentiation and may help to identify novel therapeutic targets. Our algorithm represents the first phylogeny-based tool to analyze the differentiation status of human tumors.
In contrast to asexually reproducing cancer cell populations, pedigrees of sexually reproduc- ing populations cannot be represented by phylogenetic trees. Pedigrees are directed acyclic graphs (DAGs) and therefore resemble more phylogenetic networks where reticulate events are indicated by vertices with two incoming arcs. We present a software package for pedigree reconstruction in natural populations using co-dominant genomic markers such as microsatel- lites and single nucleotide polymorphism (SNPs) in the second part of the thesis. If available, the algorithm makes use of prior information such as known relationships (sub-pedigrees) or the age and sex of individuals. Statistical confidence is estimated by Markov chain Monte Carlo (MCMC) sampling. The accuracy of the algorithm is demonstrated for simulated data as well as an empirical data set with known pedigree. The parentage inference is robust even in the presence of genotyping errors. We further demonstrate the accuracy of the algorithm on simulated clonal populations. We show that the joint estimation of parameters of inter- est such as the rate of self-fertilization or clonality is possible with high accuracy even with marker panels of moderate power. Classical methods can only assign a very limited number of statistically significant parentages in this case and would therefore fail. The method is implemented in a fast and easy to use open source software that scales to large datasets with many thousand individuals.:Abstract v
Acknowledgments vii
1 Introduction 1
2 Cancer Phylogenies 7
2.1 Introduction..................................... 7
2.2 Background..................................... 9
2.2.1 PhylogeneticTrees............................. 9
2.2.2 Microarrays................................. 10
2.3 Methods....................................... 11
2.3.1 Datasetcompilation ............................ 11
2.3.2 Statistical Methods and Analysis..................... 13
2.3.3 Comparison of our methodology to other methods . . . . . . . . . . . 15
2.4 Results........................................ 16
2.4.1 Phylogenetic tree reconstruction method. . . . . . . . . . . . . . . . . 16
2.4.2 Comparison of tree reconstruction methods to other algorithms . . . . 28
2.4.3 Systematic analysis of methods and parameters . . . . . . . . . . . . . 30
2.5 Discussion...................................... 32
3 Wild Pedigrees 35
3.1 Introduction..................................... 35
3.2 The molecular ecologistâs tools of the trade ................... 36
3.2.1 3.2.2 3.2.3
3.2.1 Sibship inference and parental reconstruction . . . . . . . . . . . . . . 37
3.2.2 Parentage and paternity inference .................... 39
3.2.3 Multigenerational pedigree reconstruction . . . . . . . . . . . . . . . . 40
3.3 Background..................................... 40
3.3.1 Pedigrees .................................. 40
3.3.2 Genotypes.................................. 41
3.3.3 Mendelian segregation probability .................... 41
3.3.4 LOD Scores................................. 43
3.3.5 Genotyping Errors ............................. 43
3.3.6 IBD coefficients............................... 45
3.3.7 Bayesian MCMC.............................. 46
3.4 Methods....................................... 47
3.4.1 Likelihood Model.............................. 47
3.4.2 Efficient Likelihood Calculation...................... 49
3.4.3 Maximum Likelihood Pedigree ...................... 51
3.4.4 Full siblings................................. 52
3.4.5 Algorithm.................................. 53
3.4.6 Missing Values ............................... 56
3.4.7 Allelefrequencies.............................. 58
3.4.8 Rates of Self-fertilization.......................... 60
3.4.9 Rates of Clonality ............................. 60
3.5 Results........................................ 61
3.5.1 Real Microsatellite Data.......................... 61
3.5.2 Simulated Human Population....................... 62
3.5.3 SimulatedClonalPlantPopulation.................... 64
3.6 Discussion...................................... 71
4 Conclusions 77
A FRANz 79
A.1 Availability ..................................... 79
A.2 Input files...................................... 79
A.2.1 Maininputfile ............................... 79
A.2.2 Knownrelationships ............................ 80
A.2.3 Allele frequencies.............................. 81
A.2.4 Sampling locations............................. 82
A.3 Output files..................................... 83
A.4 Web 2.0 Interface.................................. 86
List of Figures 87
List of Tables 88
List Abbreviations 90
Bibliography 92
Curriculum Vitae
- âŠ