338 research outputs found
Automated extraction of chemical structure information from digital raster images
Background: To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated. Results: This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader -- a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns. Conclusion: The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/90875/1/Saitou8.pd
Effects of Electricity and Altered Conductivity on Rainbow Trout Embryos: A Study to Determine Efficacy of Electricity for Eradication of Invasive Salmonids
Electricity has been an applied means of facilitating capture and removal of invasive fishes for many years. Current methods involve use of electrodes to establish a current through which passing fish will be susceptible to a brief shock to stun. This method, however, only affects free swimming individuals and is not inclusive of early life history stages such as embryos within spawning substrate. This study evaluates the susceptibility of embryonic and larval stage rainbow trout (Oncorhynchus mykiss) to direct DC current between 2-20v/cm in varying conductive waters to determine lethality for invasive salmonid eradication efforts. Rainbow trout embryos (n = 10 embryos/exposure) were initially exposed to homogeneous electric fields for 5 sec with a water conductivity of 220uS/cm from 1 day post fertilization (DPF)/ 27 temperature units (TU) to 15 DPF/405TU. Mortality was assessed 24 hours post exposure and the LV50 (lethal voltage) at 220uS/cm was determined for each TU. Embryos from six periods of development were then exposed to their respective LV50 voltages in varying conductive waters (20-600uS/cm). Susceptibility to direct DC voltages increased with voltage but overall susceptibility decreased with development. Susceptibility to a constant voltage increased with increasing conductivity and was consistent throughout early development (81TU-292TU), but the effects of increased conductivity were not enhanced in eyed embryos after 364TU. Results indicate that direct DC current applied prior to eyed embryonic stages, the period of greatest trout embryo susceptibility, is an effective means of eradicating invasive and nuisance salmonids
Classifications of ovarian cancer tissues by proteomic patterns
Ovarian cancer is a morphologically and biologically heterogeneous disease. The identification of type-specific protein markers for ovarian cancer would provide the basis for more tailored treatments, as well as clues for understanding the molecular mechanisms governing cancer progression. In the present study, we used a novel approach to classify 24 14ovarian cancer tissue samples based on the proteomic pattern of each sample. The method involved fractionation according to p I using chromatofocusing with analytical columns in the first dimension followed by separation of the proteins in each p I fraction using nonporous RP 14HPLC, which was coupled to an ESI-TOF mass analyzer for molecular weight 14(MW) analysis. A 2-D mass map of the protein content of each type of ovarian cancer tissue samples based upon p I versus intact protein MW was generated. Using this method, the clear cell and serous ovarian carcinoma samples were histologically distinguished by principal component analysis and clustering analysis based on their protein expression profiles and subtype-specific biomarker candidates of ovarian cancers were identified, which could be further investigated for future clinical study.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/55853/1/5846_ftp.pd
A Rational Approach to Personalized Anticancer Therapy: Chemoinformatic Analysis Reveals Mechanistic Gene-Drug Associations
Purpose . To predict the response of cells to chemotherapeutic agents based on gene expression profiles, we performed a chemoinformatic study of a set of standard anticancer agents assayed for activity against a panel of 60 human tumor-derived cell lines from the Developmental Therapeutics Program (DTP) at the National Cancer Institute (NCI).Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/41497/1/11095_2004_Article_465512.pd
Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data
BACKGROUND: A critical step in processing oligonucleotide microarray data is combining the information in multiple probes to produce a single number that best captures the expression level of a RNA transcript. Several systematic studies comparing multiple methods for array processing have used tightly controlled calibration data sets as the basis for comparison. Here we compare performances for seven processing methods using two data sets originally collected for disease profiling studies. An emphasis is placed on understanding sensitivity for detecting differentially expressed genes in terms of two key statistical determinants: test statistic variability for non-differentially expressed genes, and test statistic size for truly differentially expressed genes. RESULTS: In the two data sets considered here, up to seven-fold variation across the processing methods was found in the number of genes detected at a given false discovery rate (FDR). The best performing methods called up to 90% of the same genes differentially expressed, had less variable test statistics under randomization, and had a greater number of large test statistics in the experimental data. Poor performance of one method was directly tied to a tendency to produce highly variable test statistic values under randomization. Based on an overall measure of performance, two of the seven methods (Dchip and a trimmed mean approach) are superior in the two data sets considered here. Two other methods (MAS5 and GCRMA-EB) are inferior, while results for the other three methods are mixed. CONCLUSIONS: Choice of processing method has a major impact on differential expression analysis of microarray data. Previously reported performance analyses using tightly controlled calibration data sets are not highly consistent with results reported here using data from human tissue samples. Performance of array processing methods in disease profiling and other realistic biological studies should be given greater consideration when comparing Affymetrix processing methods
Optimally splitting cases for training and testing high dimensional classifiers
<p>Abstract</p> <p>Background</p> <p>We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?</p> <p>Results</p> <p>We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.</p> <p>Conclusions</p> <p>By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller <it>n </it>resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (<it>n </it>≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p
Metagenes Associated with Survival in Non-Small Cell Lung Cancer
NSCLC (non-small cell lung cancer) comprises about 80% of all lung cancer cases worldwide. Surgery is most effective treatment for patients with early-stage disease. However, 30%–55% of these patients develop recurrence within 5 years. Therefore, markers that can be used to accurately classify early-stage NSCLC patients into different prognostic groups may be helpful in selecting patients who should receive specific therapies
Analysis of gene expression data from non-small celllung carcinoma cell lines reveals distinct sub-classesfrom those identified at the phenotype level
Microarray data from cell lines of Non-Small Cell Lung Carcinoma (NSCLC) can be used to look for differences in gene expression between the cell lines derived from different tumour samples, and to investigate if these differences can be used to cluster the cell lines into distinct groups. Dividing the cell lines into classes can help to improve diagnosis and the development of screens for new drug candidates. The micro-array data is first subjected to quality control analysis and then subsequently normalised using three alternate methods to reduce the chances of differences being artefacts resulting from the normalisation process. The final clustering into sub-classes was carried out in a conservative manner such that subclasses were consistent across all three normalisation methods. If there is structure in the cell line population it was expected that this would agree with histological classifications, but this was not found to be the case. To check the biological consistency of the sub-classes the set of most strongly differentially expressed genes was be identified for each pair of clusters to check if the genes that most strongly define sub-classes have biological functions consistent with NSCLC
- …