Search CORE

35 research outputs found

Protein fold recognition using genetic algorithm optimized voting scheme and profile bigram

Author: Dehzangi Abdollah
Imoto S.
Lal Sunil P.
Raicar Gaurav
Saini Harsh
Sharma Alokanand
Publication venue: JSW
Publication date: 01/01/2016
Field of study

In biology, identifying the tertiary structure of a protein helps determine its functions. A step towards tertiary structure identification is predicting a protein’s fold. Computational methods have been applied to determine a protein’s fold by assembling information from its structural, physicochemical and/or evolutionary properties. It has been shown that evolutionary information helps improve prediction accuracy. In this study, a scheme is proposed that uses the genetic algorithm (GA) to optimize a weighted voting scheme to improve protein fold recognition. This scheme incorporates k-separated bigram transition probabilities for feature extraction, which are based on the Position Specific Scoring Matrix (PSSM). A set of SVM classifiers are used for initial classification, whereupon their predictions are consolidated using the optimized weighted voting scheme. This scheme has been demonstrated on the Ding and Dubchak (DD), Extended Ding and Dubchak (EDD) and Taguchi and Gromhia (TG) datasets benchmarked data sets

University of the South Pacific Electronic Research Repository

Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records

Author: Gu Shaopeng
Publication venue: Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange
Publication date: 01/01/2019
Field of study

The modern sequencing technology revolutionizes the genomic research and triggers explosive growth of DNA, RNA, and protein sequences. How to infer the structure and function from biological sequences is a fundamentally important task in genomics and proteomics fields. With the development of statistical and machine learning methods, an integrated and user-friendly tool containing the state-of-the-art data mining methods are needed. Here, we propose SeqFea-Learn, a comprehensive Python pipeline that integrating multiple steps: feature extraction, dimensionality reduction, feature selection, predicting model constructions based on machine learning and deep learning approaches to analyze sequences. We used enhancers, RNA N6- methyladenosine sites and protein-protein interactions datasets to evaluate the validation of the tool. The results show that the tool can effectively perform biological sequence analysis and classification tasks. Applying machine learning algorithms for Electronic medical record (EMR) data analysis is also included in this dissertation. Chronic kidney disease (CKD) is prevalent across the world and well defined by an estimated glomerular filtration rate (eGFR). The progression of kidney disease can be predicted if future eGFR can be accurately estimated using predictive analytics. Thus, I present a prediction model of eGFR that was built using Random Forest regression. The dataset includes demographic, clinical and laboratory information from a regional primary health care clinic. The final model included eGFR, age, gender, body mass index (BMI), obesity, hypertension, and diabetes, which achieved a mean coefficient of determination of 0.95. The estimated eGFRs were used to classify patients into CKD stages with high macro-averaged and micro-averaged metrics

Public Research Access Institutional Repository and Information Exchange

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Brain wave classification using long short - term memory based OPTICAL predictor

Author: Kumar Shiu
Sharma Alokanand
Tsunoda Tatsuhiko
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/06/2019
Field of study

Brain-computer interface (BCI) systems having the ability to classify brain waves with greater accuracy are highly desirable. To this end, a number of techniques have been proposed aiming to be able to classify brain waves with high accuracy. However, the ability to classify brain waves and its implementation in real-time is still limited. In this study, we introduce a novel scheme for classifying motor imagery (MI) tasks using electroencephalography (EEG) signal that can be implemented in real-time having high classification accuracy between different MI tasks. We propose a new predictor, OPTICAL, that uses a combination of common spatial pattern (CSP) and long short-term memory (LSTM) network for obtaining improved MI EEG signal classification. A sliding window approach is proposed to obtain the time-series input from the spatially filtered data, which becomes input to the LSTM network. Moreover, instead of using LSTM directly for classification, we use regression based output of the LSTM network as one of the features for classification. On the other hand, linear discriminant analysis (LDA) is used to reduce the dimensionality of the CSP variance based features. The features in the reduced dimensional plane after performing LDA are used as input to the support vector machine (SVM) classifier together with the regression based feature obtained from the LSTM network. The regression based feature further boosts the performance of the proposed OPTICAL predictor. OPTICAL showed significant improvement in the ability to accurately classify left and right-hand MI tasks on two publically available datasets. The improvements in the average misclassification rates are 3.09% and 2.07% for BCI Competition IV Dataset I and GigaDB dataset, respectively. The Matlab code is available at https://github.com/ShiuKumar/OPTICAL

University of the South Pacific Electronic Research Repository

Advanced Machine Learning Techniques and Meta-Heuristic Optimization for the Detection of Masquerading Attacks in Social Networks

Author: Villar-Rodriguez Esther
Publication venue: 'Universidad de Alcala'
Publication date: 01/01/2015
Field of study

According to the report published by the online protection firm Iovation in 2012, cyber fraud ranged from 1 percent of the Internet transactions in North America Africa to a 7 percent in Africa, most of them involving credit card fraud, identity theft, and account takeover or h¼acking attempts. This kind of crime is still growing due to the advantages offered by a non face-to-face channel where a increasing number of unsuspecting victims divulges sensitive information. Interpol classifies these illegal activities into 3 types: • Attacks against computer hardware and software. • Financial crimes and corruption. • Abuse, in the form of grooming or “sexploitation”. Most research efforts have been focused on the target of the crime developing different strategies depending on the casuistic. Thus, for the well-known phising, stored blacklist or crime signals through the text are employed eventually designing adhoc detectors hardly conveyed to other scenarios even if the background is widely shared. Identity theft or masquerading can be described as a criminal activity oriented towards the misuse of those stolen credentials to obtain goods or services by deception. On March 4, 2005, a million of personal and sensitive information such as credit card and social security numbers was collected by White Hat hackers at Seattle University who just surfed the Web for less than 60 minutes by means of the Google search engine. As a consequence they proved the vulnerability and lack of protection with a mere group of sophisticated search terms typed in the engine whose large data warehouse still allowed showing company or government websites data temporarily cached. As aforementioned, platforms to connect distant people in which the interaction is undirected pose a forcible entry for unauthorized thirds who impersonate the licit user in a attempt to go unnoticed with some malicious, not necessarily economic, interests. In fact, the last point in the list above regarding abuses has become a major and a terrible risk along with the bullying being both by means of threats, harassment or even self-incrimination likely to drive someone to suicide, depression or helplessness. California Penal Code Section 528.5 states: “Notwithstanding any other provision of law, any person who knowingly and without consent credibly impersonates another actual person through or on an Internet Web site or by other electronic means for purposes of harming, intimidating, threatening, or defrauding another person is guilty of a public offense punishable pursuant to subdivision [...]”. IV Therefore, impersonation consists of any criminal activity in which someone assumes a false identity and acts as his or her assumed character with intent to get a pecuniary benefit or cause some harm. User profiling, in turn, is the process of harvesting user information in order to construct a rich template with all the advantageous attributes in the field at hand and with specific purposes. User profiling is often employed as a mechanism for recommendation of items or useful information which has not yet considered by the client. Nevertheless, deriving user tendency or preferences can be also exploited to define the inherent behavior and address the problem of impersonation by detecting outliers or strange deviations prone to entail a potential attack. This dissertation is meant to elaborate on impersonation attacks from a profiling perspective, eventually developing a 2-stage environment which consequently embraces 2 levels of privacy intrusion, thus providing the following contributions: • The inference of behavioral patterns from the connection time traces aiming at avoiding the usurpation of more confidential information. When compared to previous approaches, this procedure abstains from impinging on the user privacy by taking over the messages content, since it only relies on time statistics of the user sessions rather than on their content. • The application and subsequent discussion of two selected algorithms for the previous point resolution: – A commonly employed supervised algorithm executed as a binary classifier which thereafter has forced us to figure out a method to deal with the absence of labeled instances representing an identity theft. – And a meta-heuristic algorithm in the search for the most convenient parameters to array the instances within a high dimensional space into properly delimited clusters so as to finally apply an unsupervised clustering algorithm. • The analysis of message content encroaching on more private information but easing the user identification by mining discriminative features by Natural Language Processing (NLP) techniques. As a consequence, the development of a new feature extraction algorithm based on linguistic theories motivated by the massive quantity of features often gathered when it comes to texts. In summary, this dissertation means to go beyond typical, ad-hoc approaches adopted by previous identity theft and authorship attribution research. Specifically it proposes tailored solutions to this particular and extensively studied paradigm with the aim at introducing a generic approach from a profiling view, not tightly bound to a unique application field. In addition technical contributions have been made in the course of the solution formulation intending to optimize familiar methods for a better versatility towards the problem at hand. In summary: this Thesis establishes an encouraging research basis towards unveiling subtle impersonation attacks in Social Networks by means of intelligent learning techniques

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

TECNALIA Publications

Stock Market Random Forest-Text Mining (SMRF-TM) Approach to Analyse Critical Indicators of Stock Market Movements

Author: ELAGAMY MAZEN NABIL
Publication venue
Publication date: 01/01/2017
Field of study

The Stock Market is a significant sector of a country’s economy and has a crucial role in the growth of commerce and industry. Hence, discovering efficient ways to analyse and visualise stock market data is considered a significant issue in modern finance. The use of data mining techniques to predict stock market movements has been extensively studied using historical market prices but such approaches are constrained to make assessments within the scope of existing information, and thus they are not able to model any random behaviour of the stock market or identify the causes behind events. One area of limited success in stock market prediction comes from textual data, which is a rich source of information. Analysing textual data related to the Stock Market may provide better understanding of random behaviours of the market. Text Mining combined with the Random Forest algorithm offers a novel approach to the study of critical indicators, which contribute to the prediction of stock market abnormal movements. In this thesis, a Stock Market Random Forest-Text Mining system (SMRF-TM) is developed and is used to mine the critical indicators related to the 2009 Dubai stock market debt standstill. Random forest and expectation maximisation are applied to classify the extracted features into a set of meaningful and semantic classes, thus extending current approaches from three to eight classes: critical down, down, neutral, up, critical up, economic, social and political. The study demonstrates that Random Forest has outperformed other classifiers and has achieved the best accuracy in classifying the bigram features extracted from the corpus

STORE - Staffordshire Online Repository

Alzheimer’s Dementia Recognition Through Spontaneous Speech

Author
Publication venue: 'Frontiers Media SA'
Publication date: 21/10/2021
Field of study

Edinburgh Research Explorer

Recommended from our members

Probabilistic Modeling for Whole Metagenome Profiling

Author: Burks David
Publication venue: 'University of North Texas Libraries'
Publication date: 01/05/2021
Field of study

To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling

UNT Digital Library

Natural Language Processing: Emerging Neural Approaches and Applications

Author
Publication venue: 'MDPI AG'
Publication date: 06/05/2022
Field of study

This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

Directory of Open Access Books (DOAB)

Connected Attribute Filtering Based on Contour Smoothness

Author: Ouzounis Georgios
Urbach Erik R.
Wilkinson M.H.F.
Publication venue: The Russian Academie of Science
Publication date: 01/01/2013
Field of study

A new attribute measuring the contour smoothness of 2-D objects is presented in the context of morphological attribute filtering. The attribute is based on the ratio of the circularity and non-compactness, and has a maximum of 1 for a perfect circle. It decreases as the object boundary becomes irregular. Computation on hierarchical image representation structures relies on five auxiliary data members and is rapid. Contour smoothness is a suitable descriptor for detecting and discriminating man-made structures from other image features. An example is demonstrated on a very-high-resolution satellite image using connected pattern spectra and the switchboard platform

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen