692 research outputs found

    Representing complex data using localized principal components with application to astronomical data

    Full text link
    Often the relation between the variables constituting a multivariate data space might be characterized by one or more of the terms: ``nonlinear'', ``branched'', ``disconnected'', ``bended'', ``curved'', ``heterogeneous'', or, more general, ``complex''. In these cases, simple principal component analysis (PCA) as a tool for dimension reduction can fail badly. Of the many alternative approaches proposed so far, local approximations of PCA are among the most promising. This paper will give a short review of localized versions of PCA, focusing on local principal curves and local partitioning algorithms. Furthermore we discuss projections other than the local principal components. When performing local dimension reduction for regression or classification problems it is important to focus not only on the manifold structure of the covariates, but also on the response variable(s). Local principal components only achieve the former, whereas localized regression approaches concentrate on the latter. Local projection directions derived from the partial least squares (PLS) algorithm offer an interesting trade-off between these two objectives. We apply these methods to several real data sets. In particular, we consider simulated astrophysical data from the future Galactic survey mission Gaia.Comment: 25 pages. In "Principal Manifolds for Data Visualization and Dimension Reduction", A. Gorban, B. Kegl, D. Wunsch, and A. Zinovyev (eds), Lecture Notes in Computational Science and Engineering, Springer, 2007, pp. 180--204, http://www.springer.com/dal/home/generic/search/results?SGWID=1-40109-22-173750210-

    Survival associated pathway identification with group Lp penalized global AUC maximization

    Get PDF
    It has been demonstrated that genes in a cell do not act independently. They interact with one another to complete certain biological processes or to implement certain molecular functions. How to incorporate biological pathways or functional groups into the model and identify survival associated gene pathways is still a challenging problem. In this paper, we propose a novel iterative gradient based method for survival analysis with group Lp penalized global AUC summary maximization. Unlike LASSO, Lp (p < 1) (with its special implementation entitled adaptive LASSO) is asymptotic unbiased and has oracle properties [1]. We first extend Lp for individual gene identification to group Lp penalty for pathway selection, and then develop a novel iterative gradient algorithm for penalized global AUC summary maximization (IGGAUCS). This method incorporates the genetic pathways into global AUC summary maximization and identifies survival associated pathways instead of individual genes. The tuning parameters are determined using 10-fold cross validation with training data only. The prediction performance is evaluated using test data. We apply the proposed method to survival outcome analysis with gene expression profile and identify multiple pathways simultaneously. Experimental results with simulation and gene expression data demonstrate that the proposed procedures can be used for identifying important biological pathways that are related to survival phenotype and for building a parsimonious model for predicting the survival times

    Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

    Get PDF
    Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test the algorithm developed, studying which of them provides better results to obtain the best possible clustering

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    V3 Loop Sequence Space Analysis Suggests Different Evolutionary Patterns of CCR5- and CXCR4-Tropic HIV

    Get PDF
    The V3 loop of human immunodeficiency virus type 1 (HIV-1) is critical for coreceptor binding and is the main determinant of which of the cellular coreceptors, CCR5 or CXCR4, the virus uses for cell entry. The aim of this study is to provide a large-scale data driven analysis of HIV-1 coreceptor usage with respect to the V3 loop evolution and to characterize CCR5- and CXCR4-tropic viral phenotypes previously studied in small- and medium-scale settings. We use different sequence similarity measures, phylogenetic and clustering methods in order to analyze the distribution in sequence space of roughly 1000 V3 loop sequences and their tropism phenotypes. This analysis affords a means of characterizing those sequences that are misclassified by several sequence-based coreceptor prediction methods, as well as predicting the coreceptor using the location of the sequence in sequence space and of relating this location to the CD4+ T-cell count of the patient. We support previous findings that the usage of CCR5 is correlated with relatively high sequence conservation whereas CXCR4-tropic viruses spread over larger regions in sequence space. The incorrectly predicted sequences are mostly located in regions in which their phenotype represents the minority or in close vicinity of regions dominated by the opposite phenotype. Nevertheless, the location of the sequence in sequence space can be used to improve the accuracy of the prediction of the coreceptor usage. Sequences from patients with high CD4+ T-cell counts are relatively highly conserved as compared to those of immunosuppressed patients. Our study thus supports hypotheses of an association of immune system depletion with an increase in V3 loop sequence variability and with the escape of the viral sequence to distant parts of the sequence space

    Weighted Fisher Discriminant Analysis in the Input and Feature Spaces

    Full text link
    Fisher Discriminant Analysis (FDA) is a subspace learning method which minimizes and maximizes the intra- and inter-class scatters of data, respectively. Although, in FDA, all the pairs of classes are treated the same way, some classes are closer than the others. Weighted FDA assigns weights to the pairs of classes to address this shortcoming of FDA. In this paper, we propose a cosine-weighted FDA as well as an automatically weighted FDA in which weights are found automatically. We also propose a weighted FDA in the feature space to establish a weighted kernel FDA for both existing and newly proposed weights. Our experiments on the ORL face recognition dataset show the effectiveness of the proposed weighting schemes.Comment: Accepted (to appear) in International Conference on Image Analysis and Recognition (ICIAR) 2020, Springe

    Dating Phylogenies with Hybrid Local Molecular Clocks

    Get PDF
    BACKGROUND: Because rates of evolution and species divergence times cannot be estimated directly from molecular data, all current dating methods require that specific assumptions be made before inferring any divergence time. These assumptions typically bear either on rates of molecular evolution (molecular clock hypothesis, local clocks models) or on both rates and times (penalized likelihood, Bayesian methods). However, most of these assumptions can affect estimated dates, oftentimes because they underestimate large amounts of rate change. PRINCIPAL FINDINGS: A significant modification to a recently proposed ad hoc rate-smoothing algorithm is described, in which local molecular clocks are automatically placed on a phylogeny. This modification makes use of hybrid approaches that borrow from recent theoretical developments in microarray data analysis. An ad hoc integration of phylogenetic uncertainty under these local clock models is also described. The performance and accuracy of the new methods are evaluated by reanalyzing three published data sets. CONCLUSIONS: It is shown that the new maximum likelihood hybrid methods can perform better than penalized likelihood and almost as well as uncorrelated Bayesian models. However, the new methods still tend to underestimate the actual amount of rate change. This work demonstrates the difficulty of estimating divergence times using local molecular clocks

    Vertical-external-cavity surface-emitting lasers and quantum dot lasers

    Full text link
    The use of cavity to manipulate photon emission of quantum dots (QDs) has been opening unprecedented opportunities for realizing quantum functional nanophotonic devices and also quantum information devices. In particular, in the field of semiconductor lasers, QDs were introduced as a superior alternative to quantum wells to suppress the temperature dependence of the threshold current in vertical-external-cavity surface-emitting lasers (VECSELs). In this work, a review of properties and development of semiconductor VECSEL devices and QD laser devices is given. Based on the features of VECSEL devices, the main emphasis is put on the recent development of technological approach on semiconductor QD VECSELs. Then, from the viewpoint of both single QD nanolaser and cavity quantum electrodynamics (QED), a single-QD-cavity system resulting from the strong coupling of QD cavity is presented. A difference of this review from the other existing works on semiconductor VECSEL devices is that we will cover both the fundamental aspects and technological approaches of QD VECSEL devices. And lastly, the presented review here has provided a deep insight into useful guideline for the development of QD VECSEL technology and future quantum functional nanophotonic devices and monolithic photonic integrated circuits (MPhICs).Comment: 21 pages, 4 figures. arXiv admin note: text overlap with arXiv:0904.369

    Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics

    Get PDF
    Timm W, Scherbart A, Boecker S, Kohlbacher O, Nattkemper TW. Peak intensity prediction in MALDI-TOF mass spectrometry: A machine learning study to support quantitative proteomics. BMC Bioinformatics. 2008;9(1):443.Background: Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e. g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification. Results: In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation). Conclusion: The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics

    Multiple Frequencies Sequential Coding for SSVEP-Based Brain-Computer Interface

    Get PDF
    BACKGROUND: Steady-state visual evoked potential (SSVEP)-based brain-computer interface (BCI) has become one of the most promising modalities for a practical noninvasive BCI system. Owing to both the limitation of refresh rate of liquid crystal display (LCD) or cathode ray tube (CRT) monitor, and the specific physiological response property that only a very small number of stimuli at certain frequencies could evoke strong SSVEPs, the available frequencies for SSVEP stimuli are limited. Therefore, it may not be enough to code multiple targets with the traditional frequencies coding protocols, which poses a big challenge for the design of a practical SSVEP-based BCI. This study aimed to provide an innovative coding method to tackle this problem. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we present a novel protocol termed multiple frequencies sequential coding (MFSC) for SSVEP-based BCI. In MFSC, multiple frequencies are sequentially used in each cycle to code the targets. To fulfill the sequential coding, each cycle is divided into several coding epochs, and during each epoch, certain frequency is used. Obviously, different frequencies or the same frequency can be presented in the coding epochs, and the different epoch sequence corresponds to the different targets. To show the feasibility of MFSC, we used two frequencies to realize four targets and carried on an offline experiment. The current study shows that: 1) MFSC is feasible and efficient; 2) the performance of SSVEP-based BCI based on MFSC can be comparable to some existed systems. CONCLUSIONS/SIGNIFICANCE: The proposed protocol could potentially implement much more targets with the limited available frequencies compared with the traditional frequencies coding protocol. The efficiency of the new protocol was confirmed by real data experiment. We propose that the SSVEP-based BCI under MFSC might be a promising choice in the future
    corecore