64 research outputs found

    Analyzing imputed financial data: a new approach to cluster analysis

    Get PDF
    The authors introduce a novel statistical modeling technique to cluster analysis and apply it to financial data. Their two main goals are to handle missing data and to find homogeneous groups within the data. Their approach is flexible and handles large and complex data structures with missing observations and with quantitative and qualitative measurements. The authors achieve this result by mapping the data to a new structure that is free of distributional assumptions in choosing homogeneous groups of observations. Their new method also provides insight into the number of different categories needed for classifying the data. The authors use this approach to partition a matched sample of stocks. One group offers dividend reinvestment plans, and the other does not. Their method partitions this sample with almost 97 percent accuracy even when using only easily available financial variables. One interpretation of their result is that the misclassified companies are the best candidates either to adopt a dividend reinvestment plan (if they have none) or to abandon one (if they currently offer one). The authors offer other suggestions for applications in the field of finance.

    Postgenomics: Proteomics and Bioinformatics in Cancer Research

    Get PDF
    Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the “proteome,” referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus nondiseased states. These technologies provide a wealth of information and rapidly generate large quantities of data. Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and identification of metabolic disorders. Here, we present an overview of the current status and future research approaches in defining the cancer cell's proteome in combination with different bioinformatics and computational biology tools toward a better understanding of health and disease

    Supervised cross-modal factor analysis for multiple modal data classification

    Full text link
    In this paper we study the problem of learning from multiple modal data for purpose of document classification. In this problem, each document is composed two different modals of data, i.e., an image and a text. Cross-modal factor analysis (CFA) has been proposed to project the two different modals of data to a shared data space, so that the classification of a image or a text can be performed directly in this space. A disadvantage of CFA is that it has ignored the supervision information. In this paper, we improve CFA by incorporating the supervision information to represent and classify both image and text modals of documents. We project both image and text data to a shared data space by factor analysis, and then train a class label predictor in the shared space to use the class label information. The factor analysis parameter and the predictor parameter are learned jointly by solving one single objective function. With this objective function, we minimize the distance between the projections of image and text of the same document, and the classification error of the projection measured by hinge loss function. The objective function is optimized by an alternate optimization strategy in an iterative algorithm. Experiments in two different multiple modal document data sets show the advantage of the proposed algorithm over other CFA methods

    Multiple graph regularized protein domain ranking

    Get PDF
    Background Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods. Results To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG- Rank. Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an ob- jective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods. Conclusion The problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.Comment: 21 page

    Functional Clustering Algorithm for High-Dimensional Proteomics Data

    Get PDF
    Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is largely smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of features of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. We propose here a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. We present a specific application of functional data analysis (FDA) to a high-throughput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and human T-cell leukemia virus type 1 (HTLV-1)-infected patients samples

    Detection of statistically significant network changes in complex biological networks

    Get PDF
    Table S1. Description of data: GHD and MRA Results for all the 457 considered transcription factors on the TCGA and Rembrandt datasets. (XLSX 62.7 kb

    An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity

    Get PDF
    Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner

    The triglyceride glucose-waist-to-height ratio outperforms obesity and other triglyceride-related parameters in detecting prediabetes in normal-weight Qatari adults: A cross-sectional study

    Get PDF
    IntroductionThe triglyceride-glucose (TyG)-driven indices, incorporating obesity indices, have been proposed as reliable markers of insulin resistance and related comorbidities such as diabetes. This study evaluated the effectiveness of these indices in detecting prediabetes in normal-weight individuals from a Middle Eastern population.MethodsUsing the data of 5,996 adult Qatari participants from the Qatar Biobank cohort, we employed adjusted logistic regression to assess the ability of various obesity and triglyceride-related indices to detect prediabetes in normal-weight (18.5 ≤ BMI <25 kg/m2) adults (≥18 years).ResultsOf the normal-weight adults, 13.62% had prediabetes. TyG-waist-to-height ratio (TyG-WHTR) was significantly associated with prediabetes among normal-weight men [OR per 1-SD 2.68; 95% CI (1.67–4.32)] and women [OR per 1-SD 2.82; 95% CI (1.61–4.94)]. Compared with other indices, TyG-WHTR had the highest area under the curve (AUC) value for prediabetes in men [AUC: 0.76, 95% CI (0.70–0.81)] and women [AUC: 0.73, 95% CI (0.66–0.80)], and performed significantly higher than other indices (p < 0.05) in detecting prediabetes in men. Tyg-WHTR shared similar diagnostic values as fasting plasma glucose (FPG).DiscussionOur findings suggest that the TyG-WHTR index could be a better indicator of prediabetes for general clinical usage in normal weight Qatari adult men than other obesity and TyG-related indices. TyG-WHTR can help identify a person’s risk for developing prediabetes in both men and women when combined with FPG results
    corecore