150 research outputs found
Statistical Data Modeling and Machine Learning with Applications
The modeling and processing of empirical data is one of the main subjects and goals of statistics. Nowadays, with the development of computer science, the extraction of useful and often hidden information and patterns from data sets of different volumes and complex data sets in warehouses has been added to these goals. New and powerful statistical techniques with machine learning (ML) and data mining paradigms have been developed. To one degree or another, all of these techniques and algorithms originate from a rigorous mathematical basis, including probability theory and mathematical statistics, operational research, mathematical analysis, numerical methods, etc. Popular ML methods, such as artificial neural networks (ANN), support vector machines (SVM), decision trees, random forest (RF), among others, have generated models that can be considered as straightforward applications of optimization theory and statistical estimation. The wide arsenal of classical statistical approaches combined with powerful ML techniques allows many challenging and practical problems to be solved. This Special Issue belongs to the section “Mathematics and Computer Science”. Its aim is to establish a brief collection of carefully selected papers presenting new and original methods, data analyses, case studies, comparative studies, and other research on the topic of statistical data modeling and ML as well as their applications. Particular attention is given, but is not limited, to theories and applications in diverse areas such as computer science, medicine, engineering, banking, education, sociology, economics, among others. The resulting palette of methods, algorithms, and applications for statistical modeling and ML presented in this Special Issue is expected to contribute to the further development of research in this area. We also believe that the new knowledge acquired here as well as the applied results are attractive and useful for young scientists, doctoral students, and researchers from various scientific specialties
Big-Data Science in Porous Materials: Materials Genomics and Machine Learning
By combining metal nodes with organic linkers we can potentially synthesize
millions of possible metal organic frameworks (MOFs). At present, we have
libraries of over ten thousand synthesized materials and millions of in-silico
predicted materials. The fact that we have so many materials opens many
exciting avenues to tailor make a material that is optimal for a given
application. However, from an experimental and computational point of view we
simply have too many materials to screen using brute-force techniques. In this
review, we show that having so many materials allows us to use big-data methods
as a powerful technique to study these materials and to discover complex
correlations. The first part of the review gives an introduction to the
principles of big-data science. We emphasize the importance of data collection,
methods to augment small data sets, how to select appropriate training sets. An
important part of this review are the different approaches that are used to
represent these materials in feature space. The review also includes a general
overview of the different ML techniques, but as most applications in porous
materials use supervised ML our review is focused on the different approaches
for supervised ML. In particular, we review the different method to optimize
the ML process and how to quantify the performance of the different methods. In
the second part, we review how the different approaches of ML have been applied
to porous materials. In particular, we discuss applications in the field of gas
storage and separation, the stability of these materials, their electronic
properties, and their synthesis. The range of topics illustrates the large
variety of topics that can be studied with big-data science. Given the
increasing interest of the scientific community in ML, we expect this list to
rapidly expand in the coming years.Comment: Editorial changes (typos fixed, minor adjustments to figures
Machine learning using radiomics and dosiomics for normal tissue complication probability modeling of radiation-induced xerostomia
In routine clinical practice, the risk of xerostomia is typically managed by limiting the mean radiation dose to parotid glands. This approach used to give satisfying results. In recent years, however, several studies have reported mean-dose models to fail in the recognition of xerostomia risk. This can be explained by a strong improvement of overall dose conformality in radiotherapy due to recent technological advances, and thereby a substantial reduction of the mean dose to parotid glands. This thesis investigated novel approaches to building reliable normal tissue complication probability (NTCP) models of xerostomia in this context.
For the purpose of the study, a cohort of 153 head-and-neck cancer patients treated with radiotherapy at Heidelberg University Hospital was retrospectively collected. The predictive performance of the mean-dose to parotid glands was evaluated with the Lyman-Kutcher-Burman (LKB) model. In order to examine the individual predictive power of predictors describing parotid shape (radiomics), dose shape (dosiomics), and demographic characteristics, a total of 61 different features was defined and extracted from the DICOM files. These included the patient’s age and sex, parotid shape features, features related to the dose-volume histogram, the mean dose to subvolumes of parotid glands, spatial dose gradients, and three-dimensional dose moments. In the multivariate analysis, a variety of machine learning algorithms was evaluated: 1) classification methods, that discriminated patients between a high and a low risk of complication, 2) feature selection techniques, that aimed to select a number of highly informative covariates from a large set of predictors, 3) sampling methods, that reduced the class imbalance, 4) data cleaning methods, that reduced noise in the data set. The predictive performance of the models was validated internally, using nested cross-validation, and externally, using an independent patient cohort from the PARSPORT clinical trial.
The LKB model showed fairly good performance on mild-to-severe (G1+) xerostomia predictions. The corresponding dose-response curve revealed that even small doses to parotid glands increase the risk of xerostomia and should be kept as low as possible. For the patients who did develop moderate-to-severe (G2+) xerostomia, the mean dose was not an informative predictor, even though the efficient sparing of parotid glands allowed to achieve low G2+ xerostomia rates. The features describing the shape of a parotid gland and the shape of a dose proved to be highly predictive of xerostomia. In particular, the parotid volume and the spatial dose gradients in the transverse plane explained xerostomia well. The results of the machine learning algorithms comparison showed that a particular choice of a classifier and a feature selection method can significantly influence predictive performance of the NTCP model. In general, support vector machines and extra-trees achieved top performance, especially for the endpoints with a large number of observations. For the endpoints with a smaller number of observations, simple logistic regression often performed on a par with the top-ranking machine learning algorithms. The external validation showed that the analyzed multivariate models did not generalize well to the PARSPORT cohort. The only features that were predictive of xerostomia both in the Heidelberg (HD) and the PARSPORT cohort were the spatial dose gradients in the right-left and the anterior-posterior directions. Substantial differences in the distribution of covariates between the two cohorts were observed, which may be one of the reasons for the weak generalizability of the HD models.
The results presented in this thesis undermine the applicability of NTCP models of xerostomia based only on the mean dose to parotid glands in highly conformal radiotherapy treatments. The spatial dose gradients in the left-right and the anterior-posterior directions proved to be predictive of xerostomia both in the HD and the PARSPORT cohort. This finding is especially important as it is not limited to a single cohort but describes a general pattern present in two independent data sets. The performance of the sophisticated machine learning methods may indicate a need for larger patient cohorts in studies on NTCP models in order to fully benefit from their advantages. Last but not least, the observed covariate-shift between the HD and the PARSPORT cohort motivates, in the author’s opinion, a need for reporting information about the covariate distribution when publishing novel NTCP models
Machine Learning-Based Models for Prediction of Toxicity Outcomes in Radiotherapy
In order to limit radiotherapy (RT)-related side effects, effective toxicity prediction and assessment schemes are essential. In recent years, the growing interest toward artificial intelligence and machine learning (ML) within the science community has led to the implementation of innovative tools in RT. Several researchers have demonstrated the high performance of ML-based models in predicting toxicity, but the application of these approaches in clinics is still lagging, partly due to their low interpretability. Therefore, an overview of contemporary research is needed in order to familiarize practitioners with common methods and strategies. Here, we present a review of ML-based models for predicting and classifying RT-induced complications from both a methodological and a clinical standpoint, focusing on the type of features considered, the ML methods used, and the main results achieved. Our work overviews published research in multiple cancer sites, including brain, breast, esophagus, gynecological, head and neck, liver, lung, and prostate cancers. The aim is to define the current state of the art and main achievements within the field for both researchers and clinicians
Application of advanced machine learning techniques to early network traffic classification
The fast-paced evolution of the Internet is drawing a complex context which
imposes demanding requirements to assure end-to-end Quality of Service. The
development of advanced intelligent approaches in networking is envisioning
features that include autonomous resource allocation, fast reaction against
unexpected network events and so on. Internet Network Traffic Classification
constitutes a crucial source of information for Network Management, being decisive
in assisting the emerging network control paradigms. Monitoring traffic flowing
through network devices support tasks such as: network orchestration, traffic
prioritization, network arbitration and cyberthreats detection, amongst others.
The traditional traffic classifiers became obsolete owing to the rapid Internet
evolution. Port-based classifiers suffer from significant accuracy losses due to port
masking, meanwhile Deep Packet Inspection approaches have severe user-privacy
limitations. The advent of Machine Learning has propelled the application of
advanced algorithms in diverse research areas, and some learning approaches have
proved as an interesting alternative to the classic traffic classification approaches.
Addressing Network Traffic Classification from a Machine Learning perspective
implies numerous challenges demanding research efforts to achieve feasible
classifiers. In this dissertation, we endeavor to formulate and solve important
research questions in Machine-Learning-based Network Traffic Classification. As a
result of numerous experiments, the knowledge provided in this research constitutes
an engaging case of study in which network traffic data from two different
environments are successfully collected, processed and modeled.
Firstly, we approached the Feature Extraction and Selection processes providing our
own contributions. A Feature Extractor was designed to create Machine-Learning
ready datasets from real traffic data, and a Feature Selection Filter based on fast
correlation is proposed and tested in several classification datasets. Then, the
original Network Traffic Classification datasets are reduced using our Selection
Filter to provide efficient classification models. Many classification models based on
CART Decision Trees were analyzed exhibiting excellent outcomes in identifying
various Internet applications. The experiments presented in this research comprise
a comparison amongst ensemble learning schemes, an exploratory study on Class
Imbalance and solutions; and an analysis of IP-header predictors for early traffic
classification. This thesis is presented in the form of compendium of JCR-indexed
scientific manuscripts and, furthermore, one conference paper is included.
In the present work we study a wide number of learning approaches employing the
most advance methodology in Machine Learning. As a result, we identify the
strengths and weaknesses of these algorithms, providing our own solutions to
overcome the observed limitations. Shortly, this thesis proves that Machine
Learning offers interesting advanced techniques that open prominent prospects in
Internet Network Traffic Classification.Departamento de TeorĂa de la Señal y Comunicaciones e IngenierĂa TelemáticaDoctorado en TecnologĂas de la InformaciĂłn y las Telecomunicacione
FAULT DETECTION FRAMEWORK FOR IMBALANCED AND SPARSELY-LABELED DATA SETS USING SELF-ORGANIZING MAPS
While machine learning techniques developed for fault detection usually assume that the classes in the training data are balanced, in real-world applications, this is seldom the case. These techniques also usually require labeled training data, obtaining which is a costly and time-consuming task. In this context, a data-driven framework is developed to detect faults in systems where the condition monitoring data is either imbalanced or consists of mostly unlabeled observations. To mitigate the problem of class imbalance, self-organizing maps (SOMs) are trained in a supervised manner, using the same map size for both classes of data, prior to performing classification. The optimal SOM size for balancing the classes in the data, the size of the neighborhood function, and the learning rate, are determined by performing multiobjective optimization on SOM quality measures such as quantization error and information entropy; and performance measures such as training time and classification error. For training data sets which contain a majority of unlabeled observations, the transductive semi-supervised approach is used to label the neurons of an unsupervised SOM, before performing supervised SOM classification on the test data set. The developed framework is validated using artificial and real-world fault detection data sets
Customer Churn Detection and Marketing Retention Strategies in the Online Food Delivery Business
The purpose of this thesis is to analyze the behavior of customers within the
Online Food Delivery industry, through which it is proposed to develop a prediction
model that allows detecting, based on valuable active customers, those who will leave
the services of Alpha Corporation in the near future.
Firstly, valuable customers are defined as those consumers who have made at
least 8 orders in the last 12 months. In this way, considering the historical behavior of
said users, as well as applying Feature Engineering techniques, a first approach is
proposed based on the implementation of a Random Forest algorithm and, later, a
boosting algorithm: XGBoost.
Once the performance of each of the models developed is analyzed, and potential
churners are identified, different marketing suggestions are proposed in order to retain
said customers. Retention strategies will be based on how Alpha Corporation works, as
well as on the output of the predictive model. Other development alternatives will also
be discussed: a clustering model based on potential churners or an unstructured data
model to analyze the emotions of those users according to the NPS surveys. The aim of
these proposals is to complement the prediction to design more specific retention
marketing strategies
- …