1,583 research outputs found
Machine Learning Methods for Finding Textual Features of Depression from Publications
Depression is a common but serious mood disorder. In 2015, WHO reports about 322 million people were living with some form of depression, which is the leading cause of ill health and disability worldwide. In USA, there are approximately 14.8 million American adults (about 6.7% percent of the US population) affected by major depressive disorder. Most individuals with depression are not receiving adequate care because the symptoms are easily neglected and most people are not even aware of their mental health problems. Therefore, a depression prescreen system is greatly beneficial for people to understand their current mental health status at an early stage. Diagnosis of depressions, however, is always extremely challenging due to its complicated, many and various symptoms. Fortunately, publications have rich information about various depression symptoms. Text mining methods can discover the different depression symptoms from literature. In order to extract these depression symptoms from publications, machine learning approaches are proposed to overcome four main obstacles: (1) represent publications in a mathematical form; (2) get abstracts from publications; (3) remove the noisy publications to improve the data quality; (4) extract the textual symptoms from publications. For the first obstacle, we integrate Word2Vec with LDA by either representing publications with document-topic distance distributions or augmenting the word-to-topic and word-to-word vectors. For the second obstacle, we calculate a document vector and its paragraph vectors by aggregating word vectors from Word2Vec. Feature vectors are calculated by clustering word vectors. Selected paragraphs are decided by the similarity of their distances to feature vectors and the document vector to feature vectors. For the third obstacle, one class SVM model is trained by vectored publications, and outlier publications are excluded by distance measurements. For the fourth obstacle, we fully evaluate the possibility of a word as a symptom according to its frequency in entire publications, and local relationship with its surrounding words in a publication
A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy
Text classification, the task of metadata to documents, needs a person to take significant time and effort. Since online-generated contents are explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Recently, various state-or-art text mining methods have been applied to classification process based on the keywords extraction. However, when using these keywords as features in the classification task, it is common that the number of feature dimensions is large. In addition, how to select keywords from documents as features in the classification task is a big challenge. Especially, when using traditional machine learning algorithms in big data, the computation time is very long. On the other hand, about 80% of real data is unstructured and non-labeled in the real world. The conventional supervised feature selection methods cannot be directly used in selecting entities from massive data. Usually, statistical strategies are utilized to extract features from unlabeled data for classification tasks according to their importance scores. We propose a novel method to extract key features effectively before feeding them into the classification assignment. Another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to documents. This problem makes text classification more complicated compared with a single label classification. For the above issues, we develop a framework for extracting data and reducing data dimension to solve the multi-label problem on labeled and unlabeled datasets. In order to reduce data dimension, we develop a hybrid feature selection method that extracts meaningful features according to the importance of each feature. The Word2Vec is applied to represent each document by a feature vector for the document categorization for the big dataset. The unsupervised approach is used to extract features from real online-generated data for text classification. Our unsupervised feature selection method is applied to extract depression symptoms from social media such as Twitter. In the future, these depression symptoms will be used for depression self-screening and diagnosis
A Physiological Signal Processing System for Optimal Engagement and Attention Detection.
In today’s high paced, hi-tech and high stress environment, with extended work hours, long to-do lists and neglected personal health, sleep deprivation has become common in modern culture. Coupled with these factors is the inherent repetitious and tedious nature of certain occupations and daily routines, which all add up to an undesirable fluctuation in individuals’ cognitive attention and capacity. Given certain critical professions, a momentary or prolonged lapse in attention level can be catastrophic and sometimes deadly. This research proposes to develop a real-time monitoring system which uses fundamental physiological signals such as the Electrocardiograph (ECG), to analyze and predict the presence or lack of cognitive attention in individuals during task execution. The primary focus of this study is to identify the correlation between fluctuating level of attention and its implications on the physiological parameters of the body. The system is designed using only those physiological signals that can be collected easily with small, wearable, portable and non-invasive monitors and thereby being able to predict well in advance, an individual’s potential loss of attention and ingression of sleepiness. Several advanced signal processing techniques have been implemented and investigated to derive multiple clandestine and informative features. These features are then applied to machine learning algorithms to produce classification models that are capable of differentiating between the cases of a person being attentive and the person not being attentive. Furthermore, Electroencephalograph (EEG) signals are also analyzed and classified for use as a benchmark for comparison with ECG analysis. For the study, ECG signals and EEG signals of volunteer subjects are acquired in a controlled experiment. The experiment is designed to inculcate and sustain cognitive attention for a period of time following which an attempt is made to reduce cognitive attention of volunteer subjects. The data acquired during the experiment is decomposed and analyzed for feature extraction and classification. The presented results show that to a fairly reasonable accuracy it is possible to detect the presence or lack of attention in individuals with just their ECG signal, especially in comparison with analysis done on EEG signals. The continual work of this research includes other physiological signals such as Galvanic Skin Response, Heat Flux, Skin Temperature and video based facial feature analysis
Review of feature selection techniques in Parkinson's disease using OCT-imaging data
Several spectral-domain optical coherence tomography studies (OCT) reported a decrease
on the macular region of the retina in Parkinson’s disease. Yet, the implication of retinal
thinning with visual disability is still unclear.
Macular scans acquired from patients with Parkinson’s disease (n = 100) and a control
group (n = 248) were used to train several supervised classification models. The goal was
to determine the most relevant retinal layers and regions for diagnosis, for which univari-
ate and multivariate filter and wrapper feature selection methods were used. In addition,
we evaluated the classification ability of the patient group to assess the applicability of
OCT measurements as a biomarker of the disease
Review of feature selection techniques in Parkinson's disease using OCT-imaging data
Several spectral-domain optical coherence tomography studies (OCT) reported a decrease
on the macular region of the retina in Parkinson’s disease. Yet, the implication of retinal
thinning with visual disability is still unclear.
Macular scans acquired from patients with Parkinson’s disease (n = 100) and a control
group (n = 248) were used to train several supervised classification models. The goal was
to determine the most relevant retinal layers and regions for diagnosis, for which univari-
ate and multivariate filter and wrapper feature selection methods were used. In addition,
we evaluated the classification ability of the patient group to assess the applicability of
OCT measurements as a biomarker of the disease
Machine learning techniques implementation in power optimization, data processing, and bio-medical applications
The rapid progress and development in machine-learning algorithms becomes a key factor in determining the future of humanity. These algorithms and techniques were utilized to solve a wide spectrum of problems extended from data mining and knowledge discovery to unsupervised learning and optimization. This dissertation consists of two study areas. The first area investigates the use of reinforcement learning and adaptive critic design algorithms in the field of power grid control. The second area in this dissertation, consisting of three papers, focuses on developing and applying clustering algorithms on biomedical data. The first paper presents a novel modelling approach for demand side management of electric water heaters using Q-learning and action-dependent heuristic dynamic programming. The implemented approaches provide an efficient load management mechanism that reduces the overall power cost and smooths grid load profile. The second paper implements an ensemble statistical and subspace-clustering model for analyzing the heterogeneous data of the autism spectrum disorder. The paper implements a novel k-dimensional algorithm that shows efficiency in handling heterogeneous dataset. The third paper provides a unified learning model for clustering neuroimaging data to identify the potential risk factors for suboptimal brain aging. In the last paper, clustering and clustering validation indices are utilized to identify the groups of compounds that are responsible for plant uptake and contaminant transportation from roots to plants edible parts --Abstract, page iv
Imparting Systems Engineering Experience via Interactive Fiction Serious Games
Serious games for education are becoming increasing popular. Interactive fiction games are some of the most popular in app stores and are also beginning to be heavily used in education to teach analysis and decision-making. Noting that it is difficult for systems engineers to experience all necessary situations which prepare them for the role of a chief engineer, in this paper, we explore the use of interactive fiction serious games to impart systems engineering experience and to teach systems engineering principles. The results of a cognitive viability, qualitative viability, and replayability analysis of 14 systems engineering serious games developed in the interactive fiction genre are presented. The analysis demonstrates that students with a systems engineering background are able to learn the Twine gaming engine and create a serious game aligned to the Apply level of Bloom’s Taxonomy which conveys a systems engineering experience and teaches a systems engineering principle within a four-week period of time. These quickly generated games cognitive, quality, and replayability scores indicate they provide some opportunity for high-level thinking, are of high quality, and with above average replayability, are likely to be played multiple times and/or recommended to others
Framework for data quality in knowledge discovery tasks
Actualmente la explosión de datos es tendencia en el universo digital debido a los
avances en las tecnologÃas de la información. En este sentido, el descubrimiento
de conocimiento y la minerÃa de datos han ganado mayor importancia debido a
la gran cantidad de datos disponibles. Para un exitoso proceso de descubrimiento
de conocimiento, es necesario preparar los datos. Expertos afirman que la fase de
preprocesamiento de datos toma entre un 50% a 70% del tiempo de un proceso de
descubrimiento de conocimiento.
Herramientas software basadas en populares metodologÃas para el descubrimiento
de conocimiento ofrecen algoritmos para el preprocesamiento de los datos.
Según el cuadrante mágico de Gartner de 2018 para ciencia de datos y plataformas
de aprendizaje automático, KNIME, RapidMiner, SAS, Alteryx, y H20.ai son las
mejores herramientas para el desucrimiento del conocimiento. Estas herramientas
proporcionan diversas técnicas que facilitan la evaluación del conjunto de datos,
sin embargo carecen de un proceso orientado al usuario que permita abordar los
problemas en la calidad de datos. Adem´as, la selección de las técnicas adecuadas
para la limpieza de datos es un problema para usuarios inexpertos, ya que estos
no tienen claro cuales son los métodos más confiables.
De esta forma, la presente tesis doctoral se enfoca en abordar los problemas
antes mencionados mediante: (i) Un marco conceptual que ofrezca un proceso
guiado para abordar los problemas de calidad en los datos en tareas de descubrimiento
de conocimiento, (ii) un sistema de razonamiento basado en casos
que recomiende los algoritmos adecuados para la limpieza de datos y (iii) una ontologÃa que representa el conocimiento de los problemas de calidad en los datos
y los algoritmos de limpieza de datos. Adicionalmente, esta ontologÃa contribuye
en la representacion formal de los casos y en la fase de adaptación, del sistema de
razonamiento basado en casos.The creation and consumption of data continue to grow by leaps and bounds. Due
to advances in Information and Communication Technologies (ICT), today the
data explosion in the digital universe is a new trend. The Knowledge Discovery
in Databases (KDD) gain importance due the abundance of data. For a successful
process of knowledge discovery is necessary to make a data treatment. The
experts affirm that preprocessing phase take the 50% to 70% of the total time of
knowledge discovery process.
Software tools based on Knowledge Discovery Methodologies offers algorithms
for data preprocessing. According to Gartner 2018 Magic Quadrant for
Data Science and Machine Learning Platforms, KNIME, RapidMiner, SAS, Alteryx
and H20.ai are the leader tools for knowledge discovery. These software
tools provide different techniques and they facilitate the evaluation of data analysis,
however, these software tools lack any kind of guidance as to which techniques
can or should be used in which contexts. Consequently, the use of suitable data
cleaning techniques is a headache for inexpert users. They have no idea which
methods can be confidently used and often resort to trial and error.
This thesis presents three contributions to address the mentioned problems:
(i) A conceptual framework to provide the user a guidance to address data quality
issues in knowledge discovery tasks, (ii) a Case-based reasoning system to
recommend the suitable algorithms for data cleaning, and (iii) an Ontology that
represent the knowledge in data quality issues and data cleaning methods. Also,
this ontology supports the case-based reasoning system for case representation
and reuse phase.Programa Oficial de Doctorado en Ciencia y TecnologÃa InformáticaPresidente: Fernando Fernández Rebollo.- Secretario: Gustavo Adolfo RamÃrez.- Vocal: Juan Pedro Caraça-Valente Hernánde
A new approach to securing passwords using a probabilistic neural network based on biometric keystroke dynamics
Passwords are a common means of identifying an individual user on a computer system. However, they are only as secure as the computer user is vigilant in keeping them confidential. This thesis presents new methods for the strengthening of password security by employing the biometric feature of keystroke dynamics. Keystroke dynamics refers to the unique rhythm generated when keys are pressed as a person types on a computer keyboard. The aim is to make the positive identification of a computer user more robust by analysing the way in which a password is typed and not just the content of what is typed. Two new methods for implementing a keystroke dynamic system utilising neural networks are presented. The probabilistic neural network is shown to perform well and be more suited to the application than traditional backpropagation method. An improvement of 6% in the false acceptance and false rejection errors is observed along with a significant decrease in training time. A novel time sequenced method using a cascade forward neural network is demonstrated. This is a totally new approach to the subject of keystroke dynamics and is shown to be a very promising method The problems encountered in the acquisition of keystroke dynamics which, are often ignored in other research in this area, are explored, including timing considerations and keyboard handling. The features inherent in keystroke data are explored and a statistical technique for dealing with the problem of outlier datum is implemented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Natural Selection For Disease Resistance In Hybrid Poplars Targets Stomatal Patterning Traits And Regulatory Genes.
The evolution of disease resistance in plants occurs within a framework of interacting
phenotypes, balancing natural selection for life-history traits along a continuum of
fast-growing and poorly defended, or slow-growing and well-defended lifestyles. Plant
populations connected by gene flow are physiologically limited to evolving along a
single axis of the spectrum of the growth-defense trade-off, and strong local selection
can purge phenotypic variance from a population or species, making it difficult to
detect variation linked to the trade-off. Hybridization between two species that have
evolved different growth-defense trade-off optima can reveal trade-offs hidden in either
species by introducing phenotypic and genetic variance. Here, I investigated the
phenotypic and genetic basis for variation of disease resistance in a set of naturally
formed hybrid poplars.
The focal species of this dissertation were the balsam poplar (Populus balsamifera),
black balsam poplar (P. trichocarpa), narrowleaf cottonwood (P. angustifolia), and
eastern cottonwood (P. deltoides). Vegetative cuttings of samples were collected from
natural populations and clonally replicated in a common garden. Ecophysiology and
stomata traits, and the severity of poplar leaf rust disease (Melampsora medusae)
were collected. To overcome the methodological bottleneck of manually phenotyping
stomata density for thousands of cuticle micrographs, I developed a publicly available
tool to automatically identify and count stomata. To identify stomata, a deep con-
volutional neural network was trained on over 4,000 cuticle images of over 700 plant
species. The neural network had an accuracy of 94.2% when applied to new cuticle
images and phenotyped hundreds of micrographs in a matter of minutes.
To understand how disease severity, stomata, and ecophysiology traits changed
as a result of hybridization, statistical models were fit that included the expected
proportion of the genome from either parental species in a hybrid. These models in-
dicated that the ratio of stomata on the upper surface of the leaf to the total number
of stomata was strongly linked to disease, was highly heritable, and wass sensitive
to hybridization. I further investigated the genomic basis of stomata-linked disease
variation by performing an association genetic analysis that explicitly incorporated
admixture. Positive selection in genes involved in guard cell regulation, immune sys-
tem negative regulation, detoxification, lipid biosynthesis, and cell wall homeostasis
were identified.
Together, my dissertation incorporated advances in image-based phenotyping with
evolutionary theory, directed at understanding how disease frequency changes when
hybridization alters the genomes of a population
- …