450 research outputs found
A Survey of Feature Selection Strategies for DNA Microarray Classification
Classification tasks are difficult and challenging in the bioinformatics field, that used to predict or diagnose patients at an early stage of disease by utilizing DNA microarray technology. However, crucial characteristics of DNA microarray technology are a large number of features and small sample sizes, which means the technology confronts a "dimensional curse" in its classification tasks because of the high computational execution needed and the discovery of biomarkers difficult. To reduce the dimensionality of features to find the significant features that can employ feature selection algorithms and not affect the performance of classification tasks. Feature selection helps decrease computational time by removing irrelevant and redundant features from the data. The study aims to briefly survey popular feature selection methods for classifying DNA microarray technology, such as filters, wrappers, embedded, and hybrid approaches. Furthermore, this study describes the steps of the feature selection process used to accomplish classification tasks and their relationships to other components such as datasets, cross-validation, and classifier algorithms. In the case study, we chose four different methods of feature selection on two-DNA microarray datasets to evaluate and discuss their performances, namely classification accuracy, stability, and the subset size of selected features. Keywords: Brief survey; DNA microarray data; feature selection; filter methods; wrapper methods; embedded methods; and hybrid methods. DOI: 10.7176/CEIS/14-2-01 Publication date:March 31st 202
March madness prediction using machine learning techniques
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceMarch Madness describes the final tournament of the college basketball championship, considered by many as the biggest sporting event in the United States - moving every year tons of dollars in both bets and television. Besides that, there are 60 million Americans who fill out their tournament bracket every year, and anything is more likely than hit all 68 games.
After collecting and transforming data from Sports-Reference.com, the experimental part consists of preprocess the data, evaluate the features to consider in the models and train the data. In this study, based on tournament data over the last 20 years, Machine Learning algorithms like Decision Trees Classifier, K-Nearest Neighbors Classifier, Stochastic Gradient Descent Classifier and others were applied to measure the accuracy of the predictions and to be compared with some benchmarks.
Despite of the most important variables seemed to be those related to seeds, shooting and the number of participations in the tournament, it was not possible to define exactly which ones should be used in the modeling and all ended up being used.
Regarding the results, when training the entire dataset, the accuracy ranges from 65 to 70%, where Support Vector Classification yields the best results. When compared with picking the highest seed, these results are slightly lower. On the other hand, when predicting the Tournament of 2017, the Support Vector Classification and the Multi-Layer Perceptron Classifier reach 85 and 79% of accuracy, respectively. In this sense, they surpass the previous benchmark and the most respected websites and statistics in the field.
Given some existing constraints, it is quite possible that these results could be improved and deepened in other ways. Meanwhile, this project can be referenced and serve as a basis for the future work
An Exploration Study of using the Universities Performance and Enrolments Features for Predicting the International Quality
Quality ranking systems are crucial in the assessment of the academic performance of an institution because these assessment systems give details about how different learning institutions deliver their services. Education quality is also of paramount importance to the students because it is through quality education that these students develop skills that are needed in the job market. Besides, education enhances a student\u27s academic and reasoning capacities. When universities are subjected to ranking systems, they are likely to improve their quality to be ranked high in the system. When the university administrators are exposed to ranking, competition gears up. Through competition, the quality of education also improves and through that the general education system improves. In addition, with rapid technological progress, increased human mobility and economic growth, the concept of quality assessment at the national level has shifted to an international level and now the evaluation of higher education quality is being conducted on the basis of international standards and comparisons. In the present context, a global ranking of a university has a significant influence on attracting research funding and academic talent. Universities are expected to collaborate and compete on an international level, and it is no longer enough to achieve excellence within any national group. It is therefore, not surprising that there is a rising tendency among universities to become centres of World class excellence . The findings of this study indicated that teaching, citations, income, number of students are key predictors for predicting the international outlook of universities. Also, it showed that geography is a significant contributor that recognized when it was added to the models for assessing the quality of the worldwide universities
Macro-micro approach for mining public sociopolitical opinion from social media
During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary.
In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus.
Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal.
Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order.
Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media
GA for feature selection of EEG heterogeneous data
The electroencephalographic (EEG) signals provide highly informative data on
brain activities and functions. However, their heterogeneity and high
dimensionality may represent an obstacle for their interpretation. The
introduction of a priori knowledge seems the best option to mitigate high
dimensionality problems, but could lose some information and patterns present
in the data, while data heterogeneity remains an open issue that often makes
generalization difficult. In this study, we propose a genetic algorithm (GA)
for feature selection that can be used with a supervised or unsupervised
approach. Our proposal considers three different fitness functions without
relying on expert knowledge. Starting from two publicly available datasets on
cognitive workload and motor movement/imagery, the EEG signals are processed,
normalized and their features computed in the time, frequency and
time-frequency domains. The feature vector selection is performed by applying
our GA proposal and compared with two benchmarking techniques. The results show
that different combinations of our proposal achieve better results in respect
to the benchmark in terms of overall performance and feature reduction.
Moreover, the proposed GA, based on a novel fitness function here presented,
outperforms the benchmark when the two different datasets considered are merged
together, showing the effectiveness of our proposal on heterogeneous data.Comment: submitted to Expert Systems with Application
Recommended from our members
Continuous learning of analytical and machine learning rate of penetration (ROP) models for real-time drilling optimization
Oil and gas operators strive to reach hydrocarbon reserves by drilling wells in the safest and fastest possible manner, providing indispensable energy to society at reduced costs while maintaining environmental sustainability. Real-time drilling optimization consists of selecting operational drilling parameters that maximize a desirable measure of drilling performance. Drilling optimization efforts often aspire to improve drilling speed, commonly referred to as rate of penetration (ROP). ROP is a function of the forces and moments applied to the bit, in addition to mud, formation, bit and hydraulic properties. Three operational drilling parameters may be constantly adjusted at surface to influence ROP towards a drilling objective: weight on bit (WOB), drillstring rotational speed (RPM), and drilling fluid (mud) flow rate. In the traditional, analytical approach to ROP modeling, inflexible equations relate WOB, RPM, flow rate and/or other measurable drilling parameters to ROP and empirical model coefficients are computed for each rock formation to best fit field data. Over the last decade, enhanced data acquisition technology and widespread cheap computational power have driven a surge in applications of machine learning (ML) techniques to ROP prediction. Machine learning algorithms leverage statistics to uncover relations between any prescribed inputs (features/predictors) and the quantity of interest (response). The biggest advantage of ML algorithms over analytical models is their flexibility in model form. With no set equation, ML models permit segmentation of the drilling operational parameter space. However, increased model complexity diminishes interpretability of how an adjustment to the inputs will affect the output. There is no single ROP model applicable in every situation. This study investigates all stages of the drilling optimization workflow, with emphasis on real-time continuous model learning. Sensors constantly record data as wells are drilled, and it is postulated that ROP models can be retrained in real-time to adapt to changing drilling conditions. Cross-validation is assessed as a methodology to select the best performing ROP model for each drilling optimization interval in real-time. Constrained to rig equipment and operational limitations, drilling parameters are optimized in intervals with the most accurate ROP model determined by cross-validation. Dynamic range and full range training data segmentation techniques contest the classical lithology-dependent approach to ROP modeling. Spatial proximity and parameter similarity sample weighting expand data partitioning capabilities during model training. The prescribed ROP modeling and drilling parameter optimization scenarios are evaluated according to model performance, ROP improvements and computational expensePetroleum and Geosystems Engineerin
System-Characterized Artificial Intelligence Approaches for Cardiac cellular systems and Molecular Signature analysis
The dissertation presents a significant advancement in the field of cardiac cellular systems and molecular signature systems by employing machine learning and generative artificial intelligence techniques. These methodologies are systematically characterized and applied to address critical challenges in these domains. A novel computational model is developed, which combines machine learning tools and multi-physics models. The main objective of this model is to accurately predict complex cellular dynamics, taking into account the intricate interactions within the cardiac cellular system. Furthermore, a comprehensive framework based on generative adversarial networks (GANs) is proposed. This framework is designed to generate synthetic data that faithfully represents an in-vitro cardiac cellular system. The generated data can be used to enhance the understanding and analysis of the system’s behavior. Additionally, a novel AI approach is formulated, which integrates deep learning and GAN techniques for Raman characterization. This approach enables efficient detection of multi-analyte mixtures by leveraging the power of deep learning algorithms and the generation of synthetic data through GANs. Overall, the integration of machine learning, generative artificial intelligence, and multi-physics modeling provides valuable insights and tools for precise prediction and efficient detection in cardiac cellular systems and molecular signature systems
- …