362 research outputs found
Definition and validation of a radiomics signature for loco-regional tumour control in patients with locally advanced head and neck squamous cell carcinoma
Purpose: To develop and validate a CT-based radiomics signature for the prognosis of loco-regional tumour control (LRC) in patients with locally advanced head and neck squamous cell carcinoma (HNSCC) treated by primary radiochemotherapy (RCTx) based on retrospective data from 6 partner sites of the German Cancer Consortium - Radiation Oncology Group (DKTK-ROG).
Material and methods: Pre-treatment CT images of 318 patients with locally advanced HNSCC were col-lected. Four-hundred forty-six features were extracted from each primary tumour volume and then fil-tered through stability analysis and clustering. First, a baseline signature was developed from demographic and tumour-associated clinical parameters. This signature was then supplemented by CT imaging features. A final signature was derived using repeated 3-fold cross-validation on the discovery cohort. Performance in external validation was assessed by the concordance index (C-Index). Furthermore, calibration and patient stratification in groups with low and high risk for loco-regional recurrence were analysed.
Results: For the clinical baseline signature, only the primary tumour volume was selected. The final sig-nature combined the tumour volume with two independent radiomics features. It achieved moderatel
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Motivations like domain adaptation, transfer learning, and feature learning
have fueled interest in inducing embeddings for rare or unseen words, n-grams,
synsets, and other textual features. This paper introduces a la carte
embedding, a simple and general alternative to the usual word2vec-based
approaches for building such representations that is based upon recent
theoretical results for GloVe-like embeddings. Our method relies mainly on a
linear transformation that is efficiently learnable using pretrained word
vectors and linear regression. This transform is applicable on the fly in the
future when a new text feature or rare word is encountered, even if only a
single usage example is available. We introduce a new dataset showing how the a
la carte method requires fewer examples of words in context to learn
high-quality embeddings and we obtain state-of-the-art results on a nonce task
and some unsupervised document classification tasks.Comment: 11 pages, 2 figures, To appear in ACL 201
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Motivations like domain adaptation, transfer learning, and feature learning
have fueled interest in inducing embeddings for rare or unseen words, n-grams,
synsets, and other textual features. This paper introduces a la carte
embedding, a simple and general alternative to the usual word2vec-based
approaches for building such representations that is based upon recent
theoretical results for GloVe-like embeddings. Our method relies mainly on a
linear transformation that is efficiently learnable using pretrained word
vectors and linear regression. This transform is applicable on the fly in the
future when a new text feature or rare word is encountered, even if only a
single usage example is available. We introduce a new dataset showing how the a
la carte method requires fewer examples of words in context to learn
high-quality embeddings and we obtain state-of-the-art results on a nonce task
and some unsupervised document classification tasks.Comment: 11 pages, 2 figures, To appear in ACL 201
Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes
Cellular behaviors emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be defined from the statistical pattern of proteome variation measured across thousands of diverse bacteria and that these networks reflect the emergence of complex bacterial phenotypes. Our results are validated through gene-set enrichment analysis and comparison to existing experimentally derived databases. We demonstrate the biological utility of our approach by creating a model of motility i
A scalable bootstrap for massive data.
Summary. The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets-which are increasingly prevalentthe calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the 'bag of little bootstraps' (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data
Mapping Species Composition of Forests and Tree Plantations in Northeastern Costa Rica with an Integration of Hyperspectral and Multitemporal Landsat Imagery
An efficient means to map tree plantations is needed to detect tropical land use change and evaluate reforestation projects. To analyze recent tree plantation expansion in northeastern Costa Rica, we examined the potential of combining moderate-resolution hyperspectral imagery (2005 HyMap mosaic) with multitemporal, multispectral data (Landsat) to accurately classify (1) general forest types and (2) tree plantations by species composition. Following a linear discriminant analysis to reduce data dimensionality, we compared four Random Forest classification models: hyperspectral data (HD) alone; HD plus interannual spectral metrics; HD plus a multitemporal forest regrowth classification; and all three models combined. The fourth, combined model achieved overall accuracy of 88.5%. Adding multitemporal data significantly improved classification accuracy (p less than 0.0001) of all forest types, although the effect on tree plantation accuracy was modest. The hyperspectral data alone classified six species of tree plantations with 75% to 93% producer's accuracy; adding multitemporal spectral data increased accuracy only for two species with dense canopies. Non-native tree species had higher classification accuracy overall and made up the majority of tree plantations in this landscape. Our results indicate that combining occasionally acquired hyperspectral data with widely available multitemporal satellite imagery enhances mapping and monitoring of reforestation in tropical landscapes
Adversarial Random Forests for Density Estimation and Generative Modeling
We propose methods for density estimation and
data synthesis using a novel form of unsupervised
random forests. Inspired by generative adversarial
networks, we implement a recursive procedure in
which trees gradually learn structural properties
of the data through alternating rounds of generation and discrimination. The method is provably
consistent under minimal assumptions. Unlike
classic tree-based alternatives, our approach provides smooth (un)conditional densities and allows
for fully synthetic data generation. We achieve
comparable or superior performance to state-ofthe-art probabilistic circuits and deep learning
models on various tabular data benchmarks while
executing about two orders of magnitude faster
on average. An accompanying R package, arf,
is available on CRAN
- …