64 research outputs found
Analyzing imputed financial data: a new approach to cluster analysis
The authors introduce a novel statistical modeling technique to cluster analysis and apply it to financial data. Their two main goals are to handle missing data and to find homogeneous groups within the data. Their approach is flexible and handles large and complex data structures with missing observations and with quantitative and qualitative measurements. The authors achieve this result by mapping the data to a new structure that is free of distributional assumptions in choosing homogeneous groups of observations. Their new method also provides insight into the number of different categories needed for classifying the data. The authors use this approach to partition a matched sample of stocks. One group offers dividend reinvestment plans, and the other does not. Their method partitions this sample with almost 97 percent accuracy even when using only easily available financial variables. One interpretation of their result is that the misclassified companies are the best candidates either to adopt a dividend reinvestment plan (if they have none) or to abandon one (if they currently offer one). The authors offer other suggestions for applications in the field of finance.
Postgenomics: Proteomics and Bioinformatics in Cancer Research
Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the “proteome,” referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus nondiseased states. These technologies provide a wealth of information and rapidly generate large quantities of data. Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and identification of metabolic disorders. Here, we present an overview of the current status and future research approaches in defining the cancer cell's proteome in combination with different bioinformatics and computational biology tools toward a better understanding of health and disease
Supervised cross-modal factor analysis for multiple modal data classification
In this paper we study the problem of learning from multiple modal data for
purpose of document classification. In this problem, each document is composed
two different modals of data, i.e., an image and a text. Cross-modal factor
analysis (CFA) has been proposed to project the two different modals of data to
a shared data space, so that the classification of a image or a text can be
performed directly in this space. A disadvantage of CFA is that it has ignored
the supervision information. In this paper, we improve CFA by incorporating the
supervision information to represent and classify both image and text modals of
documents. We project both image and text data to a shared data space by factor
analysis, and then train a class label predictor in the shared space to use the
class label information. The factor analysis parameter and the predictor
parameter are learned jointly by solving one single objective function. With
this objective function, we minimize the distance between the projections of
image and text of the same document, and the classification error of the
projection measured by hinge loss function. The objective function is optimized
by an alternate optimization strategy in an iterative algorithm. Experiments in
two different multiple modal document data sets show the advantage of the
proposed algorithm over other CFA methods
Multiple graph regularized protein domain ranking
Background Protein domain ranking is a fundamental task in structural
biology. Most protein domain ranking methods rely on the pairwise comparison of
protein domains while neglecting the global manifold structure of the protein
domain database. Recently, graph regularized ranking that exploits the global
structure of the graph defined by the pairwise similarities has been proposed.
However, the existing graph regularized ranking methods are very sensitive to
the choice of the graph model and parameters, and this remains a difficult
problem for most of the protein domain ranking methods.
Results To tackle this problem, we have developed the Multiple Graph
regularized Ranking algorithm, MultiG- Rank. Instead of using a single graph to
regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold
of protein domain distribution by combining multiple initial graphs for the
regularization. Graph weights are learned with ranking scores jointly and
automatically, by alternately minimizing an ob- jective function in an
iterative algorithm. Experimental results on a subset of the ASTRAL SCOP
protein domain database demonstrate that MultiG-Rank achieves a better ranking
performance than single graph regularized ranking methods and pairwise
similarity based ranking methods.
Conclusion The problem of graph model and parameter selection in graph
regularized protein domain ranking can be solved effectively by combining
multiple graphs. This aspect of generalization introduces a new frontier in
applying multiple graphs to solving protein domain ranking applications.Comment: 21 page
Functional Clustering Algorithm for High-Dimensional Proteomics Data
Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is largely smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of features of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. We propose here a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. We present a specific application of functional data analysis (FDA) to a high-throughput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and human T-cell leukemia virus type 1 (HTLV-1)-infected patients samples
Detection of statistically significant network changes in complex biological networks
Table S1. Description of data: GHD and MRA Results for all the 457 considered transcription factors on the TCGA and Rembrandt datasets. (XLSX 62.7 kb
An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity
Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner
Recommended from our members
Coevolution Analysis of HIV-1 Envelope Glycoprotein Complex
The HIV-1 Env spike is the main protein complex that facilitates HIV-1 entry into CD4+ host cells. HIV-1 entry is a multistep process that is not yet completely understood. This process involves several protein-protein interactions between HIV-1 Env and a variety of host cell receptors along with many conformational changes within the spike. HIV-1 Env developed due to high mutation rates and plasticity escape strategies from immense immune pressure and entry inhibitors. We applied a coevolution and residue-residue contact detecting method to identify coevolution patterns within HIV-1 Env protein sequences representing all group M subtypes. We identified 424 coevolving residue pairs within HIV-1 Env. The majority of predicted pairs are residue-residue contacts and are proximal in 3D structure. Furthermore, many of the detected pairs have functional implications due to contributions in either CD4 or coreceptor binding, or variable loop, gp120-gp41, and interdomain interactions. This study provides a new dimension of information in HIV research. The identified residue couplings may not only be important in assisting gp120 and gp41 coordinate structure prediction, but also in designing new and effective entry inhibitors that incorporate mutation patterns of HIV-1 Env
The triglyceride glucose-waist-to-height ratio outperforms obesity and other triglyceride-related parameters in detecting prediabetes in normal-weight Qatari adults: A cross-sectional study
IntroductionThe triglyceride-glucose (TyG)-driven indices, incorporating obesity indices, have been proposed as reliable markers of insulin resistance and related comorbidities such as diabetes. This study evaluated the effectiveness of these indices in detecting prediabetes in normal-weight individuals from a Middle Eastern population.MethodsUsing the data of 5,996 adult Qatari participants from the Qatar Biobank cohort, we employed adjusted logistic regression to assess the ability of various obesity and triglyceride-related indices to detect prediabetes in normal-weight (18.5 ≤ BMI <25 kg/m2) adults (≥18 years).ResultsOf the normal-weight adults, 13.62% had prediabetes. TyG-waist-to-height ratio (TyG-WHTR) was significantly associated with prediabetes among normal-weight men [OR per 1-SD 2.68; 95% CI (1.67–4.32)] and women [OR per 1-SD 2.82; 95% CI (1.61–4.94)]. Compared with other indices, TyG-WHTR had the highest area under the curve (AUC) value for prediabetes in men [AUC: 0.76, 95% CI (0.70–0.81)] and women [AUC: 0.73, 95% CI (0.66–0.80)], and performed significantly higher than other indices (p < 0.05) in detecting prediabetes in men. Tyg-WHTR shared similar diagnostic values as fasting plasma glucose (FPG).DiscussionOur findings suggest that the TyG-WHTR index could be a better indicator of prediabetes for general clinical usage in normal weight Qatari adult men than other obesity and TyG-related indices. TyG-WHTR can help identify a person’s risk for developing prediabetes in both men and women when combined with FPG results
- …