16 research outputs found

    A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer

    Get PDF
    Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single gene classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single gene classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single gene classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single gene sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single gene classifiers for predicting outcome in breast cancer

    A regression model for estimating DNA copy number applied to capture sequencing data

    No full text
    International audienceMotivation: Target enrichment, also referred to as DNA capture, provides an effective way to focus sequencing efforts on a genomic region of interest. Capture data are typically used to detect single-nucleotide variants. It can also be used to detect copy number alterations, which is particularly useful in the context of cancer, where such changes occur frequently. In copy number analysis, it is a common practice to determine log-ratios between test and control samples, but this approach results in a loss of information as it disregards the total coverage or intensity at a locus. Results: We modeled the coverage or intensity of the test sample as a linear function of the control sample. This regression approach is able to deal with regions that are completely deleted, which are problematic for methods that use log-ratios. To demonstrate the utility of our approach, we used capture data to determine copy number for a set of 600 genes in a panel of nine breast cancer cell lines. We found high concordance between our results and those generated using a single-nucleotide polymorphsim genotyping platform. When we compared our results with other log-ratio-based methods, including ExomeCNV, we found that our approach produced better overall correlation with SNP data

    ENSEMBLE

    No full text

    Characterization and correction of stray light in TROPOMI-SWIR

    No full text
    The shortwave infrared (SWIR) spectrometer module of the Tropospheric Monitoring Instrument (TROPOMI), on board the ESA Copernicus Sentinel-5 Precursor satellite, is used to measure atmospheric CO and methane columns. For this purpose, calibrated radiance measurements are needed that are minimally contaminated by instrumental stray light. Therefore, a method has been developed and applied in an on-ground calibration campaign to characterize stray light in detail using a monochromatic quasi-point light source. The dynamic range of the signal was extended to more than 7 orders of magnitude by performing measurements with different exposure times, saturating detector pixels at the longer exposure times. Analysis of the stray light indicates about 4.4 % of the detected light is correctable stray light. An algorithm was then devised and implemented in the operational data processor to correct in-flight SWIR observations in near-real time, based on Van Cittert deconvolution. The stray light is approximated by a far-field kernel independent of position and wavelength and an additional kernel representing the main reflection. Applying this correction significantly reduces the stray-light signal, for example in a simulated dark forest scene close to bright clouds by a factor of about 10. Simulations indicate that this reduces the stray-light error sufficiently for accurate gas-column retrievals. In addition, the instrument contains five SWIR diode lasers that enable long-term, in-flight monitoring of the stray-light distribution

    Determination of the TROPOMI-SWIR instrument spectral response function

    No full text
    The Tropospheric Monitoring Instrument (TROPOMI) is the single instrument on board the ESA Copernicus Sentinel-5 Precursor satellite. TROPOMI is a nadir-viewing imaging spectrometer with bands in the ultraviolet and visible, the near infrared and the shortwave infrared (SWIR). An accurate instrument spectral response function (ISRF) is required in the SWIR band where absorption lines of CO, methane and water vapor overlap. In this paper, we report on the determination of the TROPOMI-SWIR ISRF during an extensive on-ground calibration campaign. Measurements are taken with a monochromatic light source scanning the whole detector, using the spectrometer itself to determine the light intensity and wavelength. The accuracy of the resulting ISRF calibration key data is well within the requirement for trace-gas retrievals. Long-term in-flight monitoring of SWIR ISRF is achieved using five on-board diode lasers

    Feature stability when corrected for gene set size.

    No full text
    <p>Box plots of the p-values of the Fisher exact test computed for all pairs of gene sets derived from two different data sets. The green box plots represent the values for genes constituting composite features, while the blue box plots (denoted as ‘Control for size SG’) represent the gene-size-corrected values for single genes classifiers. The white stars represent the means of the distributions.</p

    Classification results of the ER positive data only.

    No full text
    <p>The ER positive cases from a single data set were set aside as test set while ER positive cases from the remaining five data sets were merged into a single training set. This was repeated until each data set was employed as left-out test set, resulting in six AUC values. The red lines indicate the median. <b>A</b>: CV-optimized number of features; <b>B</b>: 50 best features.</p

    Classification results for merged and paired setting.

    No full text
    <p>In the merged setting one Affymetrix data set is set aside as test and the remaining four Affymetrix data sets are merged into a single data set. This is repeated until every one of the five data sets acted as a test set. <b>Top row:</b> Results for the merged setting. The red lines indicate the median. <b>Bottom row:</b> Only the five Affymetrix data sets were used in the paired setting.</p

    Performance of the NMC employing single genes and composite features constructed from different secondary data sources.

    No full text
    <p>For each combination of feature extraction method and secondary data source and each pair of data sets we obtained one AUC value resulting in 30 AUC values per combination. The number of features for each classifier was determined in the cross-validation procedure (CV-optimized). <b>A:</b> Each box plot shows the median, the 25% and 75% percentiles and the standard deviation of the 30 AUC values. Outliers are depicted by crosses. The boxes are sorted in descending order according to the median. <b>B:</b> This panel shows the result of pairwise comparisons between all combinations of feature extraction methods and secondary data sources. If, for a given combination of training and test data set, the AUC value of classifier <i>i</i> is higher (lower) than the AUC value of classifier <i>j</i> on the same test data set, it is counted as a win (loss) for classifier <i>i</i>. Element (<i>i</i>, <i>j</i>) in the matrix represents the ratio of wins to losses of method <i>i</i> compared to method <i>j</i>. Green indicates an overall win, red an overall loss and white represents draws. The rows and columns are sorted as in Panel A. <b>Abbreviations:</b> SG: Single genes; C: <i>Chuang</i>; L: <i>Lee</i> and T: <i>Taylor</i>.</p
    corecore