38 research outputs found

    Predicting human microRNA precursors based on an optimized feature subset generated by GA–SVM

    Get PDF
    AbstractMicroRNAs (miRNAs) are non-coding RNAs that play important roles in post-transcriptional regulation. Identification of miRNAs is crucial to understanding their biological mechanism. Recently, machine-learning approaches have been employed to predict miRNA precursors (pre-miRNAs). However, features used are divergent and consequently induce different performance. Thus, feature selection is critical for pre-miRNA prediction. We generated an optimized feature subset including 13 features using a hybrid of genetic algorithm and support vector machine (GA–SVM). Based on SVM, the classification performance of the optimized feature subset is much higher than that of the two feature sets used in microPred and miPred by five-fold cross-validation. Finally, we constructed the classifier miR-SF to predict the most recently identified human pre-miRNAs in miRBase (version 16). Compared with microPred and miPred, miR-SF achieved much higher classification performance. Accuracies were 93.97%, 86.21% and 64.66% for miR-SF, microPred and miPred, respectively. Thus, miR-SF is effective for identifying pre-miRNAs

    A novel framework of credit risk feature selection for SMEs during industry 4.0

    No full text
    With the development of industry 4.0, the credit data of SMEs are characterized by a large volume, high speed, diversity and low-value density. How to select the key features that affect the credit risk from the high-dimensional data has become the critical point to accurately measure the credit risk of SMEs and alleviate their financing constraints. In doing so, this paper proposes a credit risk feature selection approach that integrates the binary opposite whale optimization algorithm (BOWOA) and the Kolmogorov–Smirnov (KS) statistic. Furthermore, we use seven machine learning classifiers and three discriminant methods to verify the robustness of the proposed model by using three actual bank data from SMEs. The empirical results show that although no one artificial intelligence credit evaluation method is universal for different SMEs’ credit data, the performance of the BOWOA-KS model proposed in this paper is better than other methods if the number of indicators in the optimal subset of indicators and the prediction performance of the classifier are considered simultaneously. By providing a high-dimensional data feature selection method and improving the predictive performance of credit risk, it could help SMEs focus on the factors that will allow them to improve their creditworthiness and more easily access loans from financial institutions. Moreover, it will also help government agencies and policymakers develop policies to help SMEs reduce their credit risks

    A Quantitative Analysis of the Impact on Chromatin Accessibility by Histone Modifications and Binding of Transcription Factors in DNase I Hypersensitive Sites

    No full text
    It is known that chromatin features such as histone modifications and the binding of transcription factors exert a significant impact on the "openness" of chromatin. In this study, we present a quantitative analysis of the genome-wide relationship between chromatin features and chromatin accessibility in DNase I hypersensitive sites. We found that these features show distinct preference to localize in open chromatin. In order to elucidate the exact impact, we derived quantitative models to directly predict the "openness" of chromatin using histone modification features and transcription factor binding features, respectively. We show that these two types of features are highly predictive for chromatin accessibility in a statistical viewpoint. Moreover, our results indicate that these features are highly redundant and only a small number of features are needed to achieve a very high predictive power. Our study provides new insights into the true biological phenomena and the combinatorial effects of chromatin features to differential DNase I hypersensitivity

    Integrated Analysis of Mutation Data from Various Sources Identifies Key Genes and Signaling Pathways in Hepatocellular Carcinoma

    No full text
    <div><p>Background</p><p>Recently, a number of studies have performed genome or exome sequencing of hepatocellular carcinoma (HCC) and identified hundreds or even thousands of mutations in protein-coding genes. However, these studies have only focused on a limited number of candidate genes, and many important mutation resources remain to be explored.</p><p>Principal Findings</p><p>In this study, we integrated mutation data obtained from various sources and performed pathway and network analysis. We identified 113 pathways that were significantly mutated in HCC samples and found that the mutated genes included in these pathways contained high percentages of known cancer genes, and damaging genes and also demonstrated high conservation scores, indicating their important roles in liver tumorigenesis. Five classes of pathways that were mutated most frequently included (a) proliferation and apoptosis related pathways, (b) tumor microenvironment related pathways, (c) neural signaling related pathways, (d) metabolic related pathways, and (e) circadian related pathways. Network analysis further revealed that the mutated genes with the highest betweenness coefficients, such as the well-known cancer genes TP53, CTNNB1 and recently identified novel mutated genes GNAL and the ADCY family, may play key roles in these significantly mutated pathways. Finally, we highlight several key genes (e.g., RPS6KA3 and PCLO) and pathways (e.g., axon guidance) in which the mutations were associated with clinical features.</p><p>Conclusions</p><p>Our workflow illustrates the increased statistical power of integrating multiple studies of the same subject, which can provide biological insights that would otherwise be masked under individual sample sets. This type of bioinformatics approach is consistent with the necessity of making the best use of the ever increasing data provided in valuable databases, such as TCGA, to enhance the speed of deciphering human cancers.</p></div

    Overlap of four sets of significant pathways obtained using the pathway coverage method.

    No full text
    <p>Note: The diagonal is the number of significant pathways. The percentages above (or below) the diagonal represent the number of the overlapping pathways divided by the number of the longer (or shorter) set of pathways. The values in bold font are the comparison result between the larger and smaller sample sizes.</p

    The difference in A, the percentage of known cancer genes or damaging genes and B, the conservative score between five groups of mutated genes and control genes.

    No full text
    <p>Five groups of mutated genes were ranked in the top 50, 100, 150, 200 and 250 by betweenness coefficient of the network. Control genes are mutated genes with a betweenness coefficient of zero. The horizontal line parallel to the x axis represents the longitudinal coordinates of the control genes. * represents a significant difference (p<0.05).</p

    Integrative Analysis of Transcriptional Regulatory Network and Copy Number Variation in Intrahepatic Cholangiocarcinoma

    No full text
    <div><p>Background</p><p>Transcriptional regulatory network (TRN) is used to study conditional regulatory relationships between transcriptional factors and genes. However few studies have tried to integrate genomic variation information such as copy number variation (CNV) with TRN to find causal disturbances in a network. Intrahepatic cholangiocarcinoma (ICC) is the second most common hepatic carcinoma with high malignancy and poor prognosis. Research about ICC is relatively limited comparing to hepatocellular carcinoma, and there are no approved gene therapeutic targets yet.</p><p>Method</p><p>We first constructed TRN of ICC (ICC-TRN) using forward-and-reverse combined engineering method, and then integrated copy number variation information with ICC-TRN to select CNV-related modules and constructed CNV-ICC-TRN. We also integrated CNV-ICC-TRN with KEGG signaling pathways to investigate how CNV genes disturb signaling pathways. At last, unsupervised clustering method was applied to classify samples into distinct classes.</p><p>Result</p><p>We obtained CNV-ICC-TRN containing 33 modules which were enriched in ICC-related signaling pathways. Integrated analysis of the regulatory network and signaling pathways illustrated that CNV might interrupt signaling through locating on either genomic sites of nodes or regulators of nodes in a signaling pathway. In the end, expression profiles of nodes in CNV-ICC-TRN were used to cluster the ICC patients into two robust groups with distinct biological function features.</p><p>Conclusion</p><p>Our work represents a primary effort to construct TRN in ICC, also a primary effort to try to identify key transcriptional modules based on their involvement of genetic variations shown by gene copy number variations (CNV). This kind of approach may bring the traditional studies of TRN based only on expression data one step further to genetic disturbance. Such kind of approach can easily be extended to other disease samples with appropriate data.</p></div

    Significantly mutated pathways.

    No full text
    <p><b>A,</b> Top 30 of 113 significantly mutated pathways and the difference in <b>B,</b> the percentage of known cancer genes or damaging genes and <b>C,</b> the conservative score between mutated genes in significantly mutated pathways (In SMP) and those not in significantly mutated pathways (Not In SMP). Coverage represents the fraction of tumors with at least one mutated gene in the specified pathway. Known cancer genes were obtained from the F-census database, and damaging genes were predicted using PolyPhen.</p

    Overview of genes with mutations in at least 10 of 207 patient samples.

    No full text
    <p>The heatmap shows genes (rows) and tumors (columns) with mutations (blue). The number of events per gene is indicated to the left.</p

    Overview of module subtype and size in CNV-ICC-TRN.

    No full text
    <p>In both A and B figures, blue color represents CNV-gene-only enriched module, green color represents CNV-TF-only regulated module, red color represents both CNV-TF regulated and CNV-gene enriched module.</p