3,070 research outputs found
An intelligent rule-oriented framework for extracting key factors for grants scholarships in higher education
Education is a fundamental sector in all countries, where in some countries students com-pete to get an educational grant due to its high cost. The incorporation of artificial intelli-gence in education holds great promise for the advancement of educational systems and pro-cesses. Educational data mining involves the analysis of data generated within educational environments to extract valuable insights into student performance and other factors that enhance teaching and learning. This paper aims to analyze the factors influencing students' performance and consequently, assist granting organizations in selecting suitable students in the Arab region (Jordan as a use case). The problem was addressed using a rule-based tech-nique to facilitate the utilization and implementation of a decision support system. To this end, three classical rule induction algorithms, namely PART, JRip, and RIDOR, were em-ployed. The data utilized in this study was collected from undergraduate students at the University of Jordan from 2010 to 2020. The constructed models were evaluated based on metrics such as accuracy, recall, precision, and f1-score. The findings indicate that the JRip algorithm outperformed PART and RIDOR in most of the datasets based on f1-score metric. The interpreted decision rules of the best models reveal that both features; the average study years and high school averages play vital roles in deciding which students should receive scholarships. The paper concludes with several suggested implications to support and en-hance the decision-making process of granting agencies in the realm of higher education
Meta-learning algorithms and applications
Meta-learning in the broader context concerns how an agent learns about their own learning, allowing them to improve their learning process. Learning how to learn is not only beneficial for humans, but it has also shown vast benefits for improving how machines learn. In the context of machine learning, meta-learning enables models to improve their learning process by selecting suitable meta-parameters that influence the learning. For deep learning specifically, the meta-parameters typically describe details of the training of the model but can also include description of the model itself - the architecture. Meta-learning is usually done with specific goals in mind, for example trying to improve ability to generalize or learn new concepts from only a few examples.
Meta-learning can be powerful, but it comes with a key downside: it is often computationally costly. If the costs would be alleviated, meta-learning could be more accessible to developers of new artificial intelligence models, allowing them to achieve greater goals or save resources. As a result, one key focus of our research is on significantly improving the efficiency of meta-learning. We develop two approaches: EvoGrad and PASHA, both of which significantly improve meta-learning efficiency in two common scenarios. EvoGrad allows us to efficiently optimize the value of a large number of differentiable meta-parameters, while PASHA enables us to efficiently optimize any type of meta-parameters but fewer in number.
Meta-learning is a tool that can be applied to solve various problems. Most commonly it is applied for learning new concepts from only a small number of examples (few-shot learning), but other applications exist too. To showcase the practical impact that meta-learning can make in the context of neural networks, we use meta-learning as a novel solution for two selected problems: more accurate uncertainty quantification (calibration) and general-purpose few-shot learning. Both are practically important problems and using meta-learning approaches we can obtain better solutions than the ones obtained using existing approaches. Calibration is important for safety-critical applications of neural networks, while general-purpose few-shot learning tests model's ability to generalize few-shot learning abilities across diverse tasks such as recognition, segmentation and keypoint estimation.
More efficient algorithms as well as novel applications enable the field of meta-learning to make more significant impact on the broader area of deep learning and potentially solve problems that were too challenging before. Ultimately both of them allow us to better utilize the opportunities that artificial intelligence presents
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
Water level identification with laser sensors, inertial units, and machine learning
Flood risk management usually hinges on accurate water level identification in urban streams such as rivers or creeks. Although research has emphasised the applicability of ultrasonic sensors as a contactless technology for sensor-based water level monitoring, Light Detection and Ranging (LiDAR) sensors are less sensitive to weather conditions that typically happen during flood events, such as dust, fog and rainfall. However, there has been little research on the applicability of LiDAR sensors in this field. No previous literature has analysed the impact of complicating variables on the quality of predictions or evaluated the possible benefits of using a combined approach with Inertial Measurement Units (IMU) and machine learning to produce superior predictions. In this work, we collected a dataset in a laboratory condition synchronising data from a LiDAR, an ultrasonic sensor and an IMU in an experimental device. We controlled the incidence angle, the distance, and the water turbidity to analyse their effect on the predictions. Traditional machine-learning techniques were evaluated as models to combine data from distance and inertial sensors, reducing the error rates compared to individual sensorsâ predictions. Results indicated a sharp drop in the mean absolute error, root mean squared error and coefficient of determination for all water turbidity and incidence angles considered, especially when tree-based ensembles were used. The ultrasonic sensor led to improved results for low water turbidity and increased incidence angle, but statistically significant differences were not found in the other cases
Predicting Paid Certification in Massive Open Online Courses
Massive open online courses (MOOCs) have been proliferating because of the free or low-cost offering of content for learners, attracting the attention of many stakeholders across the entire educational landscape. Since 2012, coined as âthe Year of the MOOCsâ, several platforms have gathered millions of learners in just a decade. Nevertheless, the certification rate of both free and paid courses has been low, and only about 4.5â13% and 1â3%, respectively, of the total number of enrolled learners obtain a certificate at the end of their courses. Still, most research concentrates on completion, ignoring the certification problem, and especially its financial aspects. Thus, the research described in the present thesis aimed to investigate paid certification in MOOCs, for the first time, in a comprehensive way, and as early as the first week of the course, by exploring its various levels. First, the latent correlation between learner activities and their paid certification decisions was examined by (1) statistically comparing the activities of non-paying learners with course purchasers and (2) predicting paid certification using different machine learning (ML) techniques. Our temporal (weekly) analysis showed statistical significance at various levels when comparing the activities of non-paying learners with those of the certificate purchasers across the five courses analysed. Furthermore, we used the learnerâs activities (number of step accesses, attempts, correct and wrong answers, and time spent on learning steps) to build our paid certification predictor, which achieved promising balanced accuracies (BAs), ranging from 0.77 to 0.95. Having employed simple predictions based on a few clickstream variables, we then analysed more in-depth what other information can be extracted from MOOC interaction (namely discussion forums) for paid certification prediction. However, to better explore the learnersâ discussion forums, we built, as an original contribution, MOOCSent, a cross- platform review-based sentiment classifier, using over 1.2 million MOOC sentiment-labelled reviews. MOOCSent addresses various limitations of the current sentiment classifiers including (1) using one single source of data (previous literature on sentiment classification in MOOCs was based on single platforms only, and hence less generalisable, with relatively low number of instances compared to our obtained dataset;) (2) lower model outputs, where most of the current models are based on 2-polar
iii
iv
classifier (positive or negative only); (3) disregarding important sentiment indicators, such as emojis and emoticons, during text embedding; and (4) reporting average performance metrics only, preventing the evaluation of model performance at the level of class (sentiment). Finally, and with the help of MOOCSent, we used the learnersâ discussion forums to predict paid certification after annotating learnersâ comments and replies with the sentiment using MOOCSent. This multi-input model contains raw data (learner textual inputs), sentiment classification generated by MOOCSent, computed features (number of likes received for each textual input), and several features extracted from the texts (character counts, word counts, and part of speech (POS) tags for each textual instance). This experiment adopted various deep predictive approaches â specifically that allow multi-input architecture - to early (i.e., weekly) investigate if data obtained from MOOC learnersâ interaction in discussion forums can predict learnersâ purchase decisions (certification). Considering the staggeringly low rate of paid certification in MOOCs, this present thesis contributes to the knowledge and field of MOOC learner analytics with predicting paid certification, for the first time, at such a comprehensive (with data from over 200 thousand learners from 5 different discipline courses), actionable (analysing learners decision from the first week of the course) and longitudinal (with 23 runs from 2013 to 2017) scale. The present thesis contributes with (1) investigating various conventional and deep ML approaches for predicting paid certification in MOOCs using learner clickstreams (Chapter 5) and course discussion forums (Chapter 7), (2) building the largest MOOC sentiment classifier (MOOCSent) based on learnersâ reviews of the courses from the leading MOOC platforms, namely Coursera, FutureLearn and Udemy, and handles emojis and emoticons using dedicated lexicons that contain over three thousand corresponding explanatory words/phrases, (3) proposing and developing, for the first time, multi-input model for predicting certification based on the data from discussion forums which synchronously processes the textual (comments and replies) and numerical (number of likes posted and received, sentiments) data from the forums, adapting the suitable classifier for each type of data as explained in detail in Chapter 7
The development of bioinformatics workflows to explore single-cell multi-omics data from T and B lymphocytes
The adaptive immune response is responsible for recognising, containing and eliminating viral infection, and protecting from further reinfection. This antigen-specific response is driven by T and B cells, which recognise antigenic epitopes via highly specific heterodimeric surface receptors, termed T-cell receptors (TCRs) and B cell receptors (BCRs). The theoretical diversity of the receptor repertoire that can be generated via homologous recombination of V, D and J genes is large enough (>1015 unique sequences) that virtually any antigen can be recognised. However, only a subset of these are generated within the human body, and how they succeed in specifically recognising any pathogen(s) and distinguishing these from self-proteins remains largely unresolved.
The recent advances in applying single-cell genomics technologies to simultaneously measure the clonality, surface phenotype and transcriptomic signature of pathogen- specific immune cells have significantly improved understanding of these questions. Single-cell multi-omics permits the accurate identification of clonally expanded populations, their differentiation trajectories, the level of immune receptor repertoire diversity involved in the response and the phenotypic and molecular heterogeneity.
This thesis aims to develop a bioinformatic workflow utilising single-cell multi-omics data to explore, quantify and predict the clonal and transcriptomic signatures of the human T-cell response during and following viral infection. In the first aim, a web application, VDJView, was developed to facilitate the simultaneous analysis and visualisation of clonal, transcriptomic and clinical metadata of T and B cell multi-omics data. The application permits non-bioinformaticians to perform quality control and common analyses of single-cell genomics data integrated with other metadata, thus permitting the identification of biologically and clinically relevant parameters. The second aim pertains to analysing the functional, molecular and immune receptor profiles of CD8+ T cells in the acute phase of primary hepatitis C virus (HCV) infection. This analysis identified a novel population of progenitors of exhausted T cells, and lineage tracing revealed distinct trajectories with multiple fates and evolutionary plasticity. Furthermore, it was observed that high-magnitude IFN-Îł CD8+ T-cell response is associated with the increased probability of viral escape and chronic infection. Finally, in the third aim, a novel analysis is presented based on the topological characteristics of a network generated on pathogen-specific, paired-chain, CD8+ TCRs. This analysis revealed how some cross-reactivity between TCRs can be explained via the sequence similarity between TCRs and that this property is not uniformly distributed across all pathogen-specific TCR repertoires. Strong correlations between the topological properties of the network and the biological properties of the TCR sequences were identified and highlighted.
The suite of workflows and methods presented in this thesis are designed to be adaptable to various T and B cell multi-omic datasets. The associated analyses contribute to understanding the role of T and B cells in the adaptive immune response to viral-infection and cancer
- âŠ