45 research outputs found

    Integrated mining of feature spaces for bioinformatics domain discovery

    Get PDF
    One of the major challenges in the field of bioinformatics is the elucidation of protein folding for the functional annotation of proteins. The factors that govern protein folding include the chemical, physical, and environmental conditions of the protein\u27s surroundings, which can be measured and exploited for computational discovery purposes. These conditions enable the protein to transform from a sequence of amino acids to a globular three-dimensional structure. Information concerning the folded state of a protein has significant potential to explain biochemical pathways and their involvement in disorders and diseases. This information impacts the ways in which genetic diseases are characterized and cured and in which designer drugs are created. With the exponential growth of protein databases and the limitations of experimental protein structure determination, sophisticated computational methods have been developed and applied to search for, detect, and compare protein homology. Most computational tools developed for protein structure prediction are primarily based on sequence similarity searches. These approaches have improved the prediction accuracy of high sequence similarity proteins but have failed to perform well with proteins of low sequence similarity. Data mining offers unique algorithmic computational approaches that have been used widely in the development of automatic protein structure classification and prediction. In this dissertation, we present a novel approach for the integration of physico-chemical properties and effective feature extraction techniques for the classification of proteins. Our approaches overcome one of the major obstacles of data mining in protein databases, the encapsulation of different hydrophobicity residue properties into a much reduced feature space that possess high degrees of specificity and sensitivity in protein structure classification. We have developed three unique computational algorithms for coherent feature extraction on selected scale properties of the protein sequence. When plagued by the problem of the unequal cardinality of proteins, our proposed integration scheme effectively handles the varied sizes of proteins and scales well with increasing dimensionality of these sequences. We also detail a two-fold methodology for protein functional annotation. First, we exhibit our success in creating an algorithm that provides a means to integrate multiple physico-chemical properties in the form of a multi-layered abstract feature space, with each layer corresponding to a physico-chemical property. Second, we discuss a wavelet-based segmentation approach that efficiently detects regions of property conservation across all layers of the created feature space. Finally, we present a unique graph-theory based algorithmic framework for the identification of conserved hydrophobic residue interaction patterns using identified scales of hydrophobicity. We report that these discriminatory features are specific to a family of proteins, which consist of conserved hydrophobic residues that are then used for structural classification. We also present our rigorously tested validation schemes, which report significant degrees of accuracy to show that homologous proteins exhibit the conservation of physico-chemical properties along the protein backbone. We conclude our discussion by summarizing our results and contributions and by listing our goals for future research

    Financial Malware Detect With Job Anomaly

    Get PDF
    It is well-known that financial frauds, such as money laundering, also facilitate terrorism or other illegal activity. A lot of this kind of this kind of illicit dealings entails a complicated trading and financial exchange, and that makes it impossible to uncover the frauds. Additionally, dynamic financial networks and features can be leveraged for trading. The trading network shows the relationship between organizations, thereby allowing investigators to identify fraudulent activity; while entity features filter out fraudulent behavior. Thus, the characteristics of the network and characteristics include knowledge that has the ability to enhance fraud identification. However, most of the current approaches operate on either networks or content. In this study, we propose a novel approach, dubbed CoDetect, that capitalizes on network and feature details. Another excellent aspect of the CoDetect is that it is able to simultaneously track both financial transactions and patterns of fraud. Extensive laboratory testing on both synthetic evidence and actual cases demonstrates the framework's capacity to tackle financial fraud

    Identify Credit Tag Scheme Using Enhance And The Bulk Of Votes

    Get PDF
    In financial services, credit card theft is a major concern. Thousands of dollars are lost per year because of credit card theft. Research reports on the analysis of credit card data from the real world are lacking due to problems with secrecy. The paper is used to diagnose credit card fraud using machine learning algorithms. First of all, standard versions are included. Hybrid procedures are then used using AdaBoost and plurality voting methods. A public credit card data collection is used to test the efficiency of the model. An analysis of a financial institution's own credit card records is then conducted. In order to better evaluate the robustness of the algorithms, noise is applied to the samples. The experimental findings show that the plurality vote system has strong rates of accuracy in the detection of cases of fraud on credit cards

    Machine learning on normalized protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.</p> <p>Findings</p> <p>We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.</p> <p>Conclusions</p> <p>We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.</p

    A framework for computer aided diagnosis and analysis of the neuro-vasculature

    No full text
    Various types of vascular diseases such as carotid stenosis, aneurysms, Arterio-venous Malformations (AVM) and Subarachnoid Hemorrhage (SAH) caused by the rupture of an aneurysm are some of the common causes of stroke. The diagnosis and management of such vascular conditions presents a challenge. In this dissertation we present a vascular analysis framework for Computer Aided Diagnosis (CAD) of the neuro-vasculature. We develop methods for 3D vascular decomposition, vascular skeleton extraction and identification of vascular structures such as aneurysms. Owing to the complex and highly tortuous nature of the vasculature, analysis is often only attempted on a subset of the vessel network. In our framework we first, compute the decomposition of the vascular tree into meaningful sub-components. A novel spectral segmentation approach is presented that focuses on the eigenfunctions of the Laplace-Beltrami operator (LBO), FEM discretization. In this approach, we attain a set of vessel segmentations induced by the nodal sets of the LBO. This discretization produces a family of real valued functions, which provide interesting insights in the structure and morphology of the vasculature. Next, a novel Weighted Approximate Convex Decomposition (WACD) strategy is proposed to understand the nature of complex vessel structures. We start by addressing this problem of vascular decomposition as a cluster optimization problem and introduce a methodology for compact geometric decomposition. These decomposed vessel structures are then grouped into anatomically relevant sections using a novel vessel skeleton extraction methodology that utilizes a Laplace based operator. Vascular analysis is performed by obtaining a surface mapping between decomposed vessel sections. A non-rigid correspondence between vessel surfaces are achieved using Thin Plate Splines (TPS), and changes between corresponding surface morphologies are detected using Gaussian curvature maps and mean curvature maps. Finally, characteristic vascular structures such as vessel bifurcations and aneurysms are identified using a Support Vector Machine (SVM) on the most relevant eigenvalues, obtained through feature selection. The proposed CAD framework was validated using pre-segmented sections of vasculature archived for 98 aneurysms in 112 patients. We first test our methodologies for vascular segmentation and next for detection. Our vascular segmentation approaches produced promising results, 81% of the vessel sections correctly segmented. For vascular classification, Recursive Feature Elimination (RFE) was performed to find the most compact and informative set of features. We showed that the selected sub-set of eigenvalues produces minimum error and improved classifier precision. This analysis framework was also tested on longitudinal cases of patients having internal cerebral aneurysms. Volumetric and surface area comparisons were made by establishing a correspondence between segmented vascular sections. Our results suggest that the CAD framework was able to decompose, classify and detect changes in aneurysm volumes and surface areas close to that segmented by an expert

    Data mining for bioinformatics

    No full text
    xix, 328 p. ; 24 c

    Overcoming Confounding Bias in Causal Discovery Using Minimum Redundancy and Maximum Relevancy Constraint

    No full text
    Causal discovery is the process of modeling cause and effect relationships among features. Unlike traditional model-based approaches, that rely on fitting data to the models, methods of causal discovery determine the causal structure from data. In clinical and EHR data analysis, causal discovery is used to identify dependencies among features that are difficult to estimate using model-based approaches. The resultant structures are represented as Directed Acyclic Graphs (DAG) consisting of nodes and arcs. Here, the direction of the arcs in a DAG indicates the influence of one feature over the other. These dependencies are fundamental to the discovery of novel insights obtained from data. However, causal discovery solely relies on establishing feature dependencies based on their conditional dependencies, that could lead to inaccurate inferences brought about by confounding bias. Our contribution in this work is &#x2018;Non-Confounding Causal Discovery&#x2019; (NCCD), a framework aimed at overcoming confounding bias leveraging maximum relevancy and minimum redundancy between features using the concepts of information theory. The work presented uses threshold conditioned values on which the features in the graphical structure are connected to one another. Validation was carried out on three clinical trial benchmark datasets and compared the results against the previously known Na&#x00EF;ve Bayes (NB) and Tree Augmented Na&#x00EF;ve Bayes (TAN) algorithms. We observe a reduction in the complexity of the graph, evidenced by a decrease in the number of arcs. Notably, the graphs generated through NCCD exhibited a capacity to eliminate confounding dependencies while concurrently preserving the overall score of the network

    Wavelet-based energy features for glaucomatous image classification

    No full text
    Texture features within images are actively pursued for accurate and efficient glaucoma classification. Energy distribution over wavelet subbands is applied to find these important texture features. In this paper, we investigate the discriminatory potential of wavelet features obtained from the daubechies (db3), symlets (sym3), and biorthogonal (bio3.3, bio3.5, and bio3.7) wavelet filters. We propose a novel technique to extract energy signatures obtained using 2-D discrete wavelet transform, and subject these signatures to different feature ranking and feature selection strategies. We have gauged the effectiveness of the resultant ranked and selected subsets of features using a support vector machine, sequential minimal optimization, random forest, and naïve Bayes classification strategies. We observed an accuracy of around 93% using tenfold cross validations to demonstrate the effectiveness of these methods
    corecore