270 research outputs found
Intelligent Recommendation System for Higher Education
Education domain is very vast and the data is increasing every day. Extracting information from this data requires various data mining techniques. Educational data mining combines various methods of data mining, machine learning and statistics; which are appropriate for the unique data that comes from educational sector. Most of the education recommendation systems available help students to choose particular stream for graduate education after successful schooling or to choose particular career options after graduation. Counseling students during their course of graduate education will help him to comprehend subjects in better ways that will results in enhancing his understanding about subjects. This is possible by knowing the ability of student in learning subjects in past semesters and also mining the similar learning patterns from the past databases. Most educational systems allow students to plan out their subjects (particularly electives) during the beginning of the semester or course. The student is not fully aware about what subjects are good for his career, in which field he is interested in, or how would he perform. Recommending students to choose electives by considering his learning ability, his area of interest, extra-curricular activities and his performance in prerequisites would facilitate students to give a better performance and avoid their risk of failure. This would allow student to specialize in his domain of interest. This early prediction benefits the students to take necessary steps in advance to avoid poor performance and to improve their academic scores. To develop this system, various algorithms and recommendation techniques have to be applied. This paper reviews various data mining and machine learning approaches which are used in educational field and how it can be implemented
Semi-supervised learning and fairness-aware learning under class imbalance
With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms.
In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions:
⢠Tackling class-imbalance in semi-supervised learning â The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process.
⢠Mitigating unfairness in supervised models â Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes
A novel ensemble artificial intelligence approach for gully erosion mapping in a semi-arid watershed (Iran)
Š 2019 by the authors. Licensee MDPI, Basel, Switzerland. In this study, we introduced a novel hybrid artificial intelligence approach of rotation forest (RF) as a Meta/ensemble classifier based on alternating decision tree (ADTree) as a base classifier called RF-ADTree in order to spatially predict gully erosion at Klocheh watershed of Kurdistan province, Iran. A total of 915 gully erosion locations along with 22 gully conditioning factors were used to construct a database. Some soft computing benchmark models (SCBM) including the ADTree, the Support Vector Machine by two kernel functions such as Polynomial and Radial Base Function (SVM-Polynomial and SVM-RBF), the Logistic Regression (LR), and the Naïve Bayes Multinomial Updatable (NBMU) models were used for comparison of the designed model. Results indicated that 19 conditioning factors were effective among which distance to river, geomorphology, land use, hydrological group, lithology and slope angle were the most remarkable factors for gully modeling process. Additionally, results of modeling concluded the RF-ADTree ensemble model could significantly improve (area under the curve (AUC) = 0.906) the prediction accuracy of the ADTree model (AUC = 0.882). The new proposed model had also the highest performance (AUC = 0.913) in comparison to the SVM-Polynomial model (AUC = 0.879), the SVM-RBF model (AUC = 0.867), the LR model (AUC = 0.75), the ADTree model (AUC = 0.861) and the NBMU model (AUC = 0.811)
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
- âŚ