35 research outputs found

    Protein fold recognition using genetic algorithm optimized voting scheme and profile bigram

    Get PDF
    In biology, identifying the tertiary structure of a protein helps determine its functions. A step towards tertiary structure identification is predicting a protein’s fold. Computational methods have been applied to determine a protein’s fold by assembling information from its structural, physicochemical and/or evolutionary properties. It has been shown that evolutionary information helps improve prediction accuracy. In this study, a scheme is proposed that uses the genetic algorithm (GA) to optimize a weighted voting scheme to improve protein fold recognition. This scheme incorporates k-separated bigram transition probabilities for feature extraction, which are based on the Position Specific Scoring Matrix (PSSM). A set of SVM classifiers are used for initial classification, whereupon their predictions are consolidated using the optimized weighted voting scheme. This scheme has been demonstrated on the Ding and Dubchak (DD), Extended Ding and Dubchak (EDD) and Taguchi and Gromhia (TG) datasets benchmarked data sets

    Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

    Get PDF
    The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

    Brain wave classification using long short - term memory based OPTICAL predictor

    Get PDF
    Brain-computer interface (BCI) systems having the ability to classify brain waves with greater accuracy are highly desirable. To this end, a number of techniques have been proposed aiming to be able to classify brain waves with high accuracy. However, the ability to classify brain waves and its implementation in real-time is still limited. In this study, we introduce a novel scheme for classifying motor imagery (MI) tasks using electroencephalography (EEG) signal that can be implemented in real-time having high classification accuracy between different MI tasks. We propose a new predictor, OPTICAL, that uses a combination of common spatial pattern (CSP) and long short-term memory (LSTM) network for obtaining improved MI EEG signal classification. A sliding window approach is proposed to obtain the time-series input from the spatially filtered data, which becomes input to the LSTM network. Moreover, instead of using LSTM directly for classification, we use regression based output of the LSTM network as one of the features for classification. On the other hand, linear discriminant analysis (LDA) is used to reduce the dimensionality of the CSP variance based features. The features in the reduced dimensional plane after performing LDA are used as input to the support vector machine (SVM) classifier together with the regression based feature obtained from the LSTM network. The regression based feature further boosts the performance of the proposed OPTICAL predictor. OPTICAL showed significant improvement in the ability to accurately classify left and right-hand MI tasks on two publically available datasets. The improvements in the average misclassification rates are 3.09% and 2.07% for BCI Competition IV Dataset I and GigaDB dataset, respectively. The Matlab code is available at https://github.com/ShiuKumar/OPTICAL

    Advanced Machine Learning Techniques and Meta-Heuristic Optimization for the Detection of Masquerading Attacks in Social Networks

    Get PDF
    According to the report published by the online protection firm Iovation in 2012, cyber fraud ranged from 1 percent of the Internet transactions in North America Africa to a 7 percent in Africa, most of them involving credit card fraud, identity theft, and account takeover or h¼acking attempts. This kind of crime is still growing due to the advantages offered by a non face-to-face channel where a increasing number of unsuspecting victims divulges sensitive information. Interpol classifies these illegal activities into 3 types: • Attacks against computer hardware and software. • Financial crimes and corruption. • Abuse, in the form of grooming or “sexploitation”. Most research efforts have been focused on the target of the crime developing different strategies depending on the casuistic. Thus, for the well-known phising, stored blacklist or crime signals through the text are employed eventually designing adhoc detectors hardly conveyed to other scenarios even if the background is widely shared. Identity theft or masquerading can be described as a criminal activity oriented towards the misuse of those stolen credentials to obtain goods or services by deception. On March 4, 2005, a million of personal and sensitive information such as credit card and social security numbers was collected by White Hat hackers at Seattle University who just surfed the Web for less than 60 minutes by means of the Google search engine. As a consequence they proved the vulnerability and lack of protection with a mere group of sophisticated search terms typed in the engine whose large data warehouse still allowed showing company or government websites data temporarily cached. As aforementioned, platforms to connect distant people in which the interaction is undirected pose a forcible entry for unauthorized thirds who impersonate the licit user in a attempt to go unnoticed with some malicious, not necessarily economic, interests. In fact, the last point in the list above regarding abuses has become a major and a terrible risk along with the bullying being both by means of threats, harassment or even self-incrimination likely to drive someone to suicide, depression or helplessness. California Penal Code Section 528.5 states: “Notwithstanding any other provision of law, any person who knowingly and without consent credibly impersonates another actual person through or on an Internet Web site or by other electronic means for purposes of harming, intimidating, threatening, or defrauding another person is guilty of a public offense punishable pursuant to subdivision [...]”. IV Therefore, impersonation consists of any criminal activity in which someone assumes a false identity and acts as his or her assumed character with intent to get a pecuniary benefit or cause some harm. User profiling, in turn, is the process of harvesting user information in order to construct a rich template with all the advantageous attributes in the field at hand and with specific purposes. User profiling is often employed as a mechanism for recommendation of items or useful information which has not yet considered by the client. Nevertheless, deriving user tendency or preferences can be also exploited to define the inherent behavior and address the problem of impersonation by detecting outliers or strange deviations prone to entail a potential attack. This dissertation is meant to elaborate on impersonation attacks from a profiling perspective, eventually developing a 2-stage environment which consequently embraces 2 levels of privacy intrusion, thus providing the following contributions: • The inference of behavioral patterns from the connection time traces aiming at avoiding the usurpation of more confidential information. When compared to previous approaches, this procedure abstains from impinging on the user privacy by taking over the messages content, since it only relies on time statistics of the user sessions rather than on their content. • The application and subsequent discussion of two selected algorithms for the previous point resolution: – A commonly employed supervised algorithm executed as a binary classifier which thereafter has forced us to figure out a method to deal with the absence of labeled instances representing an identity theft. – And a meta-heuristic algorithm in the search for the most convenient parameters to array the instances within a high dimensional space into properly delimited clusters so as to finally apply an unsupervised clustering algorithm. • The analysis of message content encroaching on more private information but easing the user identification by mining discriminative features by Natural Language Processing (NLP) techniques. As a consequence, the development of a new feature extraction algorithm based on linguistic theories motivated by the massive quantity of features often gathered when it comes to texts. In summary, this dissertation means to go beyond typical, ad-hoc approaches adopted by previous identity theft and authorship attribution research. Specifically it proposes tailored solutions to this particular and extensively studied paradigm with the aim at introducing a generic approach from a profiling view, not tightly bound to a unique application field. In addition technical contributions have been made in the course of the solution formulation intending to optimize familiar methods for a better versatility towards the problem at hand. In summary: this Thesis establishes an encouraging research basis towards unveiling subtle impersonation attacks in Social Networks by means of intelligent learning techniques

    Stock Market Random Forest-Text Mining (SMRF-TM) Approach to Analyse Critical Indicators of Stock Market Movements

    Get PDF
    The Stock Market is a significant sector of a country’s economy and has a crucial role in the growth of commerce and industry. Hence, discovering efficient ways to analyse and visualise stock market data is considered a significant issue in modern finance. The use of data mining techniques to predict stock market movements has been extensively studied using historical market prices but such approaches are constrained to make assessments within the scope of existing information, and thus they are not able to model any random behaviour of the stock market or identify the causes behind events. One area of limited success in stock market prediction comes from textual data, which is a rich source of information. Analysing textual data related to the Stock Market may provide better understanding of random behaviours of the market. Text Mining combined with the Random Forest algorithm offers a novel approach to the study of critical indicators, which contribute to the prediction of stock market abnormal movements. In this thesis, a Stock Market Random Forest-Text Mining system (SMRF-TM) is developed and is used to mine the critical indicators related to the 2009 Dubai stock market debt standstill. Random forest and expectation maximisation are applied to classify the extracted features into a set of meaningful and semantic classes, thus extending current approaches from three to eight classes: critical down, down, neutral, up, critical up, economic, social and political. The study demonstrates that Random Forest has outperformed other classifiers and has achieved the best accuracy in classifying the bigram features extracted from the corpus

    Alzheimer’s Dementia Recognition Through Spontaneous Speech

    Get PDF

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    Connected Attribute Filtering Based on Contour Smoothness

    Get PDF
    A new attribute measuring the contour smoothness of 2-D objects is presented in the context of morphological attribute filtering. The attribute is based on the ratio of the circularity and non-compactness, and has a maximum of 1 for a perfect circle. It decreases as the object boundary becomes irregular. Computation on hierarchical image representation structures relies on five auxiliary data members and is rapid. Contour smoothness is a suitable descriptor for detecting and discriminating man-made structures from other image features. An example is demonstrated on a very-high-resolution satellite image using connected pattern spectra and the switchboard platform
    corecore