111 research outputs found
Small sample feature selection
High-throughput technologies for rapid measurement of vast numbers of biolog-
ical variables offer the potential for highly discriminatory diagnosis and prognosis;
however, high dimensionality together with small samples creates the need for fea-
ture selection, while at the same time making feature-selection algorithms less reliable.
Feature selection is required to avoid overfitting, and the combinatorial nature of the
problem demands a suboptimal feature-selection algorithm.
In this dissertation, we have found that feature selection is problematic in small-
sample settings via three different approaches. First we examined the feature-ranking
performance of several kinds of error estimators for different classification rules, by
considering all feature subsets and using 2 measures of performance. The results
show that their ranking is strongly affected by inaccurate error estimation. Secondly,
since enumerating all feature subsets is computationally impossible in practice, a
suboptimal feature-selection algorithm is often employed to find from a large set of
potential features a small subset with which to classify the samples. If error estimation
is required for a feature-selection algorithm, then the impact of error estimation can
be greater than the choice of algorithm. Lastly, we took a regression approach by
comparing the classification errors for the optimal feature sets and the errors for
the feature sets found by feature-selection algorithms. Our study shows that it is
unlikely that feature selection will yield a feature set whose error is close to that of
the optimal feature set, and the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist
Performance of Feature Selection Methods
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data
Recommended from our members
Candidate gene biodosimetry markers of exposure to external ionizing radiation in human blood: A systematic review
Purpose
To compile a list of genes that have been reported to be affected by external ionizing radiation (IR) and to assess their performance as candidate biomarkers for individual human radiation dosimetry.
Methods
Eligible studies were identified through extensive searches of the online databases from 1978 to 2017. Original English-language publications of microarray studies assessing radiation-induced changes in gene expression levels in human blood after external IR were included. Genes identified in at least half of the selected studies were retained for bio-statistical analysis in order to evaluate their diagnostic ability.
Results
24 studies met the criteria and were included in this study. Radiation-induced expression of 10,170 unique genes was identified and the 31 genes that have been identified in at least 50% of studies (12/24 studies) were selected for diagnostic power analysis. Twenty-seven genes showed a significant Spearman’s correlation with radiation dose. Individually, TNFSF4, FDXR, MYC, ZMAT3 and GADD45A provided the best discrimination of radiation dose < 2 Gy and dose ≥ 2 Gy according to according to their maximized Youden’s index (0.67, 0.55, 0.55, 0.55 and 0.53 respectively). Moreover, 12 combinations of three genes display an area under the Receiver Operating Curve (ROC) curve (AUC) = 1 reinforcing the concept of biomarker combinations instead of looking for an ideal and unique biomarker.
Conclusion
Gene expression is a promising approach for radiation dosimetry assessment. A list of robust candidate biomarkers has been identified from analysis of the studies published to date, confirming for example the potential of well-known genes such as FDXR and TNFSF4 or highlighting other promising gene such as ZMAT3. However, heterogeneity in protocols and analysis methods will require additional studies to confirm these results
Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls
Hierarchical and tree-like data sets arise in many applications, including
language processing, graph data mining, phylogeny and genomics. It is known
that tree-like data cannot be embedded into Euclidean spaces of finite
dimension with small distortion. This problem can be mitigated through the use
of hyperbolic spaces. When such data also has to be processed in a distributed
and privatized setting, it becomes necessary to work with new federated
learning methods tailored to hyperbolic spaces. As an initial step towards the
development of the field of federated learning in hyperbolic spaces, we propose
the first known approach to federated classification in hyperbolic spaces. Our
contributions are as follows. First, we develop distributed versions of convex
SVM classifiers for Poincar\'e discs. In this setting, the information conveyed
from clients to the global classifier are convex hulls of clusters present in
individual client data. Second, to avoid label switching issues, we introduce a
number-theoretic approach for label recovery based on the so-called integer
sequences. Third, we compute the complexity of the convex hulls in
hyperbolic spaces to assess the extent of data leakage; at the same time, in
order to limit the communication cost for the hulls, we propose a new
quantization method for the Poincar\'e disc coupled with Reed-Solomon-like
encoding. Fourth, at server level, we introduce a new approach for aggregating
convex hulls of the clients based on balanced graph partitioning. We test our
method on a collection of diverse data sets, including hierarchical single-cell
RNA-seq data from different patients distributed across different repositories
that have stringent privacy constraints. The classification accuracy of our
method is up to better than its Euclidean counterpart,
demonstrating the importance of privacy-preserving learning in hyperbolic
spaces
Inference of Gene Regulatory Networks Using Time-Series Data: A Survey
The advent of high-throughput technology like microarrays has provided the platform for studying how different cellular components work together, thus created an enormous interest in mathematically modeling biological network, particularly gene regulatory network (GRN). Of particular interest is the modeling and inference on time-series data, which capture a more thorough picture of the system than non-temporal data do. We have given an extensive review of methodologies that have been used on time-series data. In realizing that validation is an impartible part of the inference paradigm, we have also presented a discussion on the principles and challenges in performance evaluation of different methods. This survey gives a panoramic view on these topics, with anticipation that the readers will be inspired to improve and/or expand GRN inference and validation tool repository
miR-1254 and miR-574-5p: Serum-Based microRNA Biomarkers for Early-Stage Non-small Cell Lung Cancer
n/
RNAi phenotype profiling of kinases identifies potential therapeutic targets in Ewing's sarcoma
<p>Abstract</p> <p>Background</p> <p>Ewing's sarcomas are aggressive musculoskeletal tumors occurring most frequently in the long and flat bones as a solitary lesion mostly during the teen-age years of life. With current treatments, significant number of patients relapse and survival is poor for those with metastatic disease. As part of novel target discovery in Ewing's sarcoma, we applied RNAi mediated phenotypic profiling to identify kinase targets involved in growth and survival of Ewing's sarcoma cells.</p> <p>Results</p> <p>Four Ewing's sarcoma cell lines TC-32, TC-71, SK-ES-1 and RD-ES were tested in high throughput-RNAi screens using a siRNA library targeting 572 kinases. Knockdown of 25 siRNAs reduced the growth of all four Ewing's sarcoma cell lines in replicate screens. Of these, 16 siRNA were specific and reduced proliferation of Ewing's sarcoma cells as compared to normal fibroblasts. Secondary validation and preliminary mechanistic studies highlighted the kinases STK10 and TNK2 as having important roles in growth and survival of Ewing's sarcoma cells. Furthermore, knockdown of STK10 and TNK2 by siRNA showed increased apoptosis.</p> <p>Conclusion</p> <p>In summary, RNAi-based phenotypic profiling proved to be a powerful gene target discovery strategy, leading to successful identification and validation of STK10 and TNK2 as two novel potential therapeutic targets for Ewing's sarcoma.</p
Global, regional, and national burden of colorectal cancer and its risk factors, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019
Funding: F Carvalho and E Fernandes acknowledge support from Fundação para a Ciência e a Tecnologia, I.P. (FCT), in the scope of the project UIDP/04378/2020 and UIDB/04378/2020 of the Research Unit on Applied Molecular Biosciences UCIBIO and the project LA/P/0140/2020 of the Associate Laboratory Institute for Health and Bioeconomy i4HB; FCT/MCTES through the project UIDB/50006/2020. J Conde acknowledges the European Research Council Starting Grant (ERC-StG-2019-848325). V M Costa acknowledges the grant SFRH/BHD/110001/2015, received by Portuguese national funds through Fundação para a Ciência e Tecnologia (FCT), IP, under the Norma Transitória DL57/2016/CP1334/CT0006.proofepub_ahead_of_prin
The global burden of adolescent and young adult cancer in 2019 : a systematic analysis for the Global Burden of Disease Study 2019
Background In estimating the global burden of cancer, adolescents and young adults with cancer are often overlooked, despite being a distinct subgroup with unique epidemiology, clinical care needs, and societal impact. Comprehensive estimates of the global cancer burden in adolescents and young adults (aged 15-39 years) are lacking. To address this gap, we analysed results from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2019, with a focus on the outcome of disability-adjusted life-years (DALYs), to inform global cancer control measures in adolescents and young adults. Methods Using the GBD 2019 methodology, international mortality data were collected from vital registration systems, verbal autopsies, and population-based cancer registry inputs modelled with mortality-to-incidence ratios (MIRs). Incidence was computed with mortality estimates and corresponding MIRs. Prevalence estimates were calculated using modelled survival and multiplied by disability weights to obtain years lived with disability (YLDs). Years of life lost (YLLs) were calculated as age-specific cancer deaths multiplied by the standard life expectancy at the age of death. The main outcome was DALYs (the sum of YLLs and YLDs). Estimates were presented globally and by Socio-demographic Index (SDI) quintiles (countries ranked and divided into five equal SDI groups), and all estimates were presented with corresponding 95% uncertainty intervals (UIs). For this analysis, we used the age range of 15-39 years to define adolescents and young adults. Findings There were 1.19 million (95% UI 1.11-1.28) incident cancer cases and 396 000 (370 000-425 000) deaths due to cancer among people aged 15-39 years worldwide in 2019. The highest age-standardised incidence rates occurred in high SDI (59.6 [54.5-65.7] per 100 000 person-years) and high-middle SDI countries (53.2 [48.8-57.9] per 100 000 person-years), while the highest age-standardised mortality rates were in low-middle SDI (14.2 [12.9-15.6] per 100 000 person-years) and middle SDI (13.6 [12.6-14.8] per 100 000 person-years) countries. In 2019, adolescent and young adult cancers contributed 23.5 million (21.9-25.2) DALYs to the global burden of disease, of which 2.7% (1.9-3.6) came from YLDs and 97.3% (96.4-98.1) from YLLs. Cancer was the fourth leading cause of death and tenth leading cause of DALYs in adolescents and young adults globally. Interpretation Adolescent and young adult cancers contributed substantially to the overall adolescent and young adult disease burden globally in 2019. These results provide new insights into the distribution and magnitude of the adolescent and young adult cancer burden around the world. With notable differences observed across SDI settings, these estimates can inform global and country-level cancer control efforts. Copyright (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe
- …