15 research outputs found

    Effect of Hyperparameter Tuning Using Random Search on Tree-Based Classification Algorithm for Software Defect Prediction

    Get PDF
    The field of information technology requires software, which has significant issues. Quality and reliability improvement needs damage prediction. Tree-based algorithms like Random Forest, Deep Forest, and Decision Tree offer potential in this domain. However, proper hyperparameter configuration is crucial for optimal outcomes. This study demonstrates the use of Random Search Hyperparameter Setting Technique to predict software defects, improving damage estimation accuracy. Using ReLink datasets, we found effective algorithm parameters for predicting software damage. Decision Tree, Random Forest, and Deep Forest achieved an average AUC of 0.73 with Random Search. Random Search outperformed other tree-based algorithms. The main contribution is the innovative Random Search hyperparameter tuning, particularly for Random Forest. Random Search has distinct advantages over other tree-based algorithm

    Analisis Performansi Naive Bayes Dan Random Forest Terhadap Sentimen Kenaikan Harga BBM di Indonesia

    Get PDF
    Bahan Bakar Minyak (BBM) adalah komoditas penting dalam aktifitas perekonomian masyarakat. Kebijakan kenaikan harga BBM dapat berpengaruh negatif terhadap pertumbuhan ekonomi masyarakat. Namun pemerintah melakukan berbagai upaya baik, seperti Bantuan Langsung Tunai BBM. Fenomena ini menimbulkan beragam sentimen di masyarakat. Beragam sentimen tersebut dapat menjadi tolak ukur pemerintah dalam mengambil keputusan. Oleh karena itu, digunakan algoritma Naïve Bayes Classifier (NBC) dan Random Forest (RF) untuk klasifikasi sentimen masyarakat terhadap kebijakan kenaikan harga BBM melalui data teks Twitter yang berjumlah 250 ribu data tweet. Label kelas sentimen meliputi positif, netral, dan negatif. Analisis performasi dilakukan pada masing-masing algoritma dengan mempertimbangkan nilai accuracy, recall, dan rata-rata nilai kurva AUC-ROC. Kedua algoritma akan melalui proses tuning hyperparameter, untuk NBC yaitu nilai laplace smoothing dan untuk RF yaitu nilai minimum samples split dan minimum samples leaf. Disimpulkan bahwa performa RF lebih unggul dengan nilai akurasi mencapai 85.15% dan rata-rata nilai AUC-ROC sebesar 94.62%., dibandingkan NBC dengan nilai akurasi 79.74% dan rata-rata AUC-ROC sebesar 89.83%

    Surrogate modelling of solar radiation potential for the design of PV module layout on entire façade of tall buildings

    Get PDF
    This research investigated the performance of a surrogate modeling approach for the simulation of solar radiation potential on the vertical surfaces of tall buildings. Surrogate modeling is used to approximate the input–output behavior of the existing simulation model. The Random Forest (RF) machine learning approach was used to investigate three different scenarios, namely (1) Random variation, (2) Grid variation, and (3) Uniform variation, and the Genetic Algorithm is used to optimize the hyperparameters. A case study was performed to investigate the performance of surrogate models using a building in the Sir George William (SGW) campus of Concordia University in downtown Montreal Canada. The results suggest that even by only using a small sample size of the random solutions, surrogate modeling can achieve up to 94% accuracy in the prediction of solar radiation potentials. From the three scenarios, the best accuracy was obtained when using the Random variation method. In short, solar radiation simulation is very complex and too sensitive to the location and shadow effect. Therefore, simplification of those factors cannot be made to approximate the solar radiation potential. Also, using RF, the computational time improved by 16 times faster than when using the existing simulation model.</p

    Optimization of mechanical properties of multiscale hybrid polymer nanocomposites: A combination of experimental and machine learning techniques

    Get PDF
    Machine learning (ML) models provide fast and accurate predictions of material properties at a low computational cost. Herein, the mechanical properties of multiscale poly(3-hydroxybutyrate) (P3HB)-based nanocomposites reinforced with different concentrations of multiwalled carbon nanotubes (MWCNTs), WS2 nanosheets and sepiolite (SEP) nanoclay have been predicted. The nanocomposites were prepared via solution casting. SEM images revealed that the three nanofillers were homogenously and randomly dispersed into the matrix. A synergistic reinforcement effect was attained, resulting in an unprecedented stiffness improvement of 132% upon addition of 1:2:2 wt% SEP:MWCNTs:WS2. Conversely, the increments in strength were only moderates (up to 13.4%). A beneficial effect in the matrix ductility was also found due to the presence of both nanofillers. Four ML approaches, Recurrent Neural Network (RNN), RNN with Levenberg's algorithm (RNN-LV), decision tree (DT) and Random Forest (RF), were applied. The correlation coefficient (R2), mean absolute error (MAE) and mean square error (MSE) were used as statistical indicators to compare their performance. The best-performing model for the Young's modulus was RNN-LV with 3 hidden layers and 50 neurons in each layer, while for the tensile strength was the RF model using a combination of 100 estimators and a maximum depth of 100. An RNN model with 3 hidden layers was the most suitable to predict the elongation at break and impact strength, with 90 and 50 neurons in each layer, respectively. The highest correlation (R2 of 1 and 0.9203 for the training and test set, respectively) and the smallest errors (MSE of 0.13 and MAE of 0.31) were obtained for the prediction of the elongation at break. The developed models represent a powerful tool for the optimization of the mechanical properties in multiscale hybrid polymer nanocomposites, saving time and resources in the experimental characterization process

    Un-factorize non-food NPS on a food-based retailer

    Get PDF
    Dissertação de mestrado em Estatística para Ciência de DadosO Net Promoter Score (NPS) é uma métrica muito utilizada para medir o nível de lealdade dos consumidores. Neste sentido, esta dissertação pretende desenvolver um modelo de classificação que permita identificar a classe do NPS dos consumidores, ou seja, classificar o consumidor como Detrator, Passivo ou Promotor, assim como perceber os fatores que têm maior impacto nessa classificação. A informação recolhida permitirá à organização ter uma melhor percepção das áreas a melhorar de forma a elevar a satisfação do consumidor. Para tal, propõe-se uma abordagem de Data Mining para o problema de classificação multiclasse. A abordagem utiliza dados de um inquérito e dados transacionais do cartão de fidelização de um retalhista, que formam o conjunto de dados a partir dos quais se consegue obter informações sobre as pontuações do Net Promoter Score (NPS), o comportamento dos consumidores e informações das lojas. Inicialmente é feita uma análise exploratória dos dados extraídos. Uma vez que as classes são desbalanceadas, várias técnicas de reamostragem são aplicadas para equilibrar as mesmas. São aplicados dois algoritmos de classificação: Árvores de Decisão e Random Forests. Os resultados obtidos revelam um mau desempenho dos modelos. Uma análise de erro é feita ao último modelo, onde se conclui que este tem dificuldade em distinguir os Detratores e os Passivos, mas tem um bom desempenho a prever os Promotores. Numa ótica de negócio, esta metodologia pode ser utilizada para fazer uma distinção entre os Promotores e o resto dos consumidores, uma vez que os Promotores são a segmentação de clientes mais prováveis de beneficiar o mesmo a longo prazo, ajudando a promover a organização e atraíndo novos consumidores.More and more companies realise that understanding their customers can be a way to improve customer satisfaction and, consequently, customer loyalty, which in turn can result in an increase in sales. The NPS has been widely adopted by managers as a measure of customer loyalty and predictor of sales growth. In this regard, this dissertation aims to create a classification model focused not only in identi fying the customer’s NPS class, namely, classify the customer as Detractor, Passive or Promoter, but also in understanding which factors have the most impact on the customer’s classification. The goal in doing so is to collect relevant business insights as a way to identify areas that can help to improve customer satisfaction. We propose a Data Mining approach to the NPS multi-class classification problem. Our ap proach leverages survey data, as well as transactional data collected through a retailer’s loyalty card, building a data set from which we can extract information, such as NPS ratings, customer behaviour and store details. Initially, an exploratory analysis is done on the data. Several resam pling techniques are applied to the data set to handle class imbalance. Two different machine learning algorithms are applied: Decision Trees and Random Forests. The results did not show a good model’s performance. An error analysis was then performed in the later model, where it was concluded that the classifier has difficulty distinguishing the classes Detractors and Passives, but has a good performance when predicting the class Promoters. In a business sense, this methodology can be leveraged to distinguish the Promoters from the rest of the consumers, since the Promoters are more likely to provide good value in long term and can benefit the company by spreading the word for attracting new customers

    Machine Learning for Performance Aware Virtual Network Function Placement

    Get PDF
    With the growing demand for data connectivity, network service providers are faced with the task of reducing their capital and operational expenses while simultaneously improving network performance and addressing the increased connectivity demand. Although Network Function Virtualization has been identified as a potential solution, several challenges must be addressed to ensure its feasibility. The work presented in this thesis addresses the Virtual Network Function (VNF) placement problem through the development of a machine learning-based Delay-Aware Tree (DAT) which learns from the previous placement of VNF instances forming a Service Function Chain. The DAT is able to predict VNF instance placements with an average 34μs of additional delay when compared to the near-optimal BACON heuristic VNF placement algorithm. The DAT’s max depth hyperparameter is then optimized using Particle Swarm Optimization (PSO) and its performance is improved by an average of 44μs through the introduction of the Depth-Optimized Delay-Aware Tree (DO-DAT)

    Ennustemallin kehittäminen suomalaisten PK-yritysten konkurssiriskin määritykseen

    Get PDF
    Bankruptcy prediction is a subject of significant interest to both academics and practitioners because of its vast economic and societal impact. Academic research in the field is extensive and diverse; no consensus has formed regarding the superiority of different prediction methods or predictor variables. Most studies focus on large companies; small and medium-sized enterprises (SMEs) have received less attention, mainly due to data unavailability. Despite recent academic advances, simple statistical models are still favored in practical use, largely due to their understandability and interpretability. This study aims to construct a high-performing but user-friendly and interpretable bankruptcy prediction model for Finnish SMEs using financial statement data from 2008–2010. A literature review is conducted to explore the key aspects of bankruptcy prediction; the findings are used for designing an empirical study. Five prediction models are trained on different predictor subsets and training samples, and two models are chosen for detailed examination based on the findings. A prediction model using the random forest method, utilizing all available predictors and the unadjusted training data containing an imbalance of bankrupt and non-bankrupt firms, is found to perform best. Superior performance compared to a benchmark model is observed in terms of both key metrics, and the random forest model is deemed easy to use and interpretable; it is therefore recommended for practical application. Equity ratio and financial expenses to total assets consistently rank as the best two predictors for different models; otherwise the findings on predictor importance are mixed, but mainly in line with the prevalent views in the related literature. This study shows that constructing an accurate but practical bankruptcy prediction model is feasible, and serves as a guideline for future scholars and practitioners seeking to achieve the same. Some further research avenues to follow are recognized based on empirical findings and the extant literature. In particular, this study raises an important question regarding the appropriateness of the most commonly used performance metrics in bankruptcy prediction. Area under the precision-recall curve (PR AUC), which is widely used in other fields of study, is deemed a suitable alternative and is recommended for measuring model performance in future bankruptcy prediction studies.Konkurssien ennustaminen on taloudellisten ja yhteiskunnallisten vaikutustensa vuoksi merkittävä aihe akateemisesta ja käytännöllisestä näkökulmasta. Alan tutkimus on laajaa ja monipuolista, eikä konsensusta parhaiden ennustemallien ja -muuttujien suhteen ole saavutettu. Valtaosa tutkimuksista keskittyy suuryrityksiin; pienten ja keskisuurten (PK)-yritysten konkurssimallinnus on jäänyt vähemmälle huomiolle. Akateemisen tutkimuksen viimeaikaisesta kehityksestä huolimatta käytännön sovellukset perustuvat usein yksinkertaisille tilastollisille malleille johtuen niiden paremmasta ymmärrettävyydestä. Tässä diplomityössä rakennetaan ennustemalli suomalaisten PK-yritysten konkurssiriskin määritykseen käyttäen tilinpäätösdataa vuosilta 2008–2010. Tavoitteena on tarkka, mutta käyttäjäystävällinen ja helposti tulkittava malli. Konkurssimallinnuksen keskeisiin osa-alueisiin perehdytään kirjallisuuskatsauksessa, jonka pohjalta suunnitellaan empiirinen tutkimus. Viiden mallinnusmenetelmän suoriutumista vertaillaan erilaisia opetusaineiston ja ennustemuuttujien osajoukkoja käyttäen, ja löydösten perusteella kaksi parasta menetelmää otetaan lähempään tarkasteluun. Satunnaismetsä (random forest) -koneoppimismenetelmää käyttävä, kaikkia saatavilla olevia ennustemuuttujia ja muokkaamatonta, epäsuhtaisesti konkurssi- ja ei-konkurssitapauksia sisältävää opetusaineistoa hyödyntävä malli toimii parhaiten. Keskeisten suorituskykymittarien valossa satunnaismetsämalli suoriutuu käytettyä verrokkia paremmin, ja todetaan helppokäyttöiseksi ja hyvin tulkittavaksi; sitä suositellaan sovellettavaksi käytäntöön. Omavaraisuusaste ja rahoituskulujen suhde taseen loppusummaan osoittautuvat johdonmukaisesti parhaiksi ennustemuuttujiksi eri mallinnusmetodeilla, mutta muilta osin havainnot muuttujien keskinäisestä paremmuudesta ovat vaihtelevia. Tämä diplomityö osoittaa, että konkurssiennustemalli voi olla sekä tarkka että käytännöllinen, ja tarjoaa suuntaviivoja tuleville tutkimuksille. Empiiristen havaintojen ja kirjallisuuslöydösten pohjalta esitetään jatkotutkimusehdotuksia. Erityisen tärkeä huomio on se, että konkurssiennustamisessa tyypillisesti käytettyjen suorituskykymittarien soveltuvuus on kyseenalaista konkurssitapausten harvinaisuudesta johtuen. Muilla tutkimusaloilla laajasti käytetty tarkkuus-saantikäyrän alle jäävä pinta-ala (PR AUC) todetaan soveliaaksi vaihtoehdoksi, ja sitä suositellaan käytettäväksi konkurssimallien suorituskyvyn mittaukseen. Avainsanat konkurssien ennustaminen, luottoriski, koneoppiminen

    Development of an R package to learn supervised classification techniques

    Get PDF
    This TFG aims to develop a custom R package for teaching supervised classification algorithms, starting with the identification of requirements, including algorithms, data structures, and libraries. A strong theoretical foundation is essential for effective package design. Documentation will explain each function’s purpose, accompanied by necessary paperwork. The package will include R scripts and data files in organized directories, complemented by a user manual for easy installation and usage, even for beginners. Built entirely from scratch without external dependencies, it’s optimized for accuracy and performance. In conclusion, this TFG provides a roadmap for creating an R package to teach supervised classification algorithms, benefiting researchers and practitioners dealing with real-world challenges.Grado en Ingeniería Informátic

    Pathology detection mechanisms through continuous acquisition of biological signals

    Get PDF
    Mención Internacional en el título de doctorPattern identification is a widely known technology, which is used on a daily basis for both identification and authentication. Examples include biometric identification (fingerprint or facial), number plate recognition or voice recognition. However, when we move into the world of medical diagnostics this changes substantially. This field applies many of the recent innovations and technologies, but it is more difficult to see cases of pattern recognition applied to diagnostics. In addition, the cases where they do occur are always supervised by a specialist and performed in controlled environments. This behaviour is expected, as in this field, a false negative (failure to identify pathology when it does exists) can be critical and lead to serious consequences for the patient. This can be mitigated by configuring the algorithm to be safe against false negatives, however, this will raise the false positive rate, which may increase the workload of the specialist in the best case scenario or even result in a treatment being given to a patient who does not need it. This means that, in many cases, validation of the algorithm’s decision by a specialist is necessary, however, there may be cases where this validation is not so essential, or where this first identification can be treated as a guideline to help the specialist. With this objective in mind, this thesis focuses on the development of an algorithm for the identification of lower body pathologies. This identification is carried out by means of the way people walk (gait). People’s gait differs from one person to another, even making biometric identification possible through its use. however, when the people has a pathology, both physical or psychological, the gait is affected. This alteration generates a common pattern depending on the type of pathology. However, this thesis focuses exclusively on the identification of physical pathologies. Another important aspect in this thesis is that the different algorithms are created with the idea of portability in mind, avoiding the obligation of the user to carry out the walks with excessive restrictions (both in terms of clothing and location). First, different algorithms are developed using different configurations of smartphones for database acquisition. In particular, configurations using 1, 2 and 4 phones are used. The phones are placed on the legs using special holders so that they cannot move freely. Once all the walks have been captured, the first step is to filter the signals to remove possible noise. The signals are then processed to extract the different gait cycles (corresponding to two steps) that make up the walks. Once the feature extraction process is finished, part of the features are used to train different machine learning algorithms, which are then used to classify the remaining features. However, the evidence obtained through the experiments with the different configurations and algorithms indicates that it is not feasible to perform pathology identification using smartphones. This can be mainly attributed to three factors: the quality of the signals captured by the phones, the unstable sampling frequency and the lack of synchrony between the phones. Secondly, due to the poor results obtained using smartphones, the capture device is changed to a professional motion acquisition system. In addition, two types of algorithm are proposed, one based on neural networks and the other based on the algorithms used previously. Firstly, the acquisition of a new database is proposed. To facilitate the capture of the data, a procedure is established, which is proposed to be in an environment of freedom for the user. Once all the data are available, the preprocessing to be carried out is similar to that applied previously. The signals are filtered to remove noise and the different gait cycles that make up the walks are extracted. However, as we have information from several sensors and several locations for the capture device, instead of using a common cut-off frequency, we empirically set a cut-off frequency for each signal and position. Since we already have the data ready, a recurrent neural network is created based on the literature, so we can have a first approximation to the problem. Given the feasibility of the neural network, different experiments are carried out with the aim of improving the performance of the neural network. Finally, the other algorithm picks up the legacy of what was seen in the first part of the thesis. As before, this algorithm is based on the parameterisation of the gait cycles for its subsequent use and employs algorithms based on machine learning. Unlike the use of time signals, by parameterising the cycles, spurious data can be generated. To eliminate this data, the dataset undergoes a preparation phase (cleaning and scaling). Once a prepared dataset has been obtained, it is split in two, one part is used to train the algorithms, which are used to classify the remaining samples. The results of these experiments validate the feasibility of this algorithm for pathology detection. Next, different experiments are carried out with the aim of reducing the amount of information needed to identify a pathology, without compromising accuracy. As a result of these experiments, it can be concluded that it is feasible to detect pathologies using only 2 sensors placed on a leg.La identificación de patrones es una tecnología ampliamente conocida, la cual se emplea diariamente tanto para identificación como para autenticación. Algunos ejemplos de ello pueden ser la identificación biométrica (dactilar o facial), el reconocimiento de matrículas o el reconocimiento de voz. Sin embargo, cuando nos movemos al mundo del diagnóstico médico esto cambia sustancialmente. Este campo aplica muchas de las innovaciones y tecnologías recientes, pero es más difícil ver casos de reconocimiento de patrones aplicados al diagnóstico. Además, los casos donde se dan siempre están supervisados por un especialista y se realizan en ambientes controlados. Este comportamiento es algo esperado, ya que, en este campo, un falso negativo (no identificar la patología cuando esta existe) puede ser crítico y provocar consecuencias graves para el paciente. Esto se puede intentar paliar, configurando el algoritmo para que sea seguro frente a los falsos negativos, no obstante, esto aumentará la tasa de falsos positivos, lo cual puede aumentar el trabajo del especialista en el mejor de los casos o incluso puede provocar que se suministre un tratamiento a un paciente que no lo necesita. Esto hace que, en muchos casos sea necesaria la validación de la decisión del algoritmo por un especialista, sin embargo, pueden darse casos donde esta validación no sea tan esencial, o que se pueda tratar a esta primera identificación como una orientación de cara a ayudar al especialista. Con este objetivo en mente, esta tesis se centra en el desarrollo de un algoritmo para la identificación de patologías del tren inferior. Esta identificación se lleva a cabo mediante la forma de caminar de la gente (gait, en inglés). La forma de caminar de la gente difiere entre unas personas y otras, haciendo posible incluso la identificación biométrica mediante su uso. Sin embargo, esta también se ve afectada cuando se presenta una patología, tanto física como psíquica, que afecta a las personas. Esta alteración, genera un patrón común dependiendo del tipo de patología. No obstante, esta tesis se centra exclusivamente la identificación de patologías físicas. Otro aspecto importante en esta tesis es que los diferentes algoritmos se crean con la idea de la portabilidad en mente, evitando la obligación del usuario de realizar los paseos con excesivas restricciones (tanto de vestimenta como de localización). En primer lugar, se desarrollan diferentes algoritmos empleando diferentes configuraciones de teléfonos inteligentes para la adquisición de la base de datos. En concreto se usan configuraciones empleando 1, 2 y 4 teléfonos. Los teléfonos se colocan en las piernas empleando sujeciones especiales, de tal modo que no se puedan mover libremente. Una vez que se han capturado todos los paseos, el primer paso es filtrar las señales para eliminar el posible ruido que contengan. Seguidamente las señales se procesan para extraer los diferentes ciclos de la marcha (que corresponden a dos pasos) que componen los paseos. Una vez terminado el proceso de extracción de características, parte de estas se emplean para entrenar diferentes algoritmos de machine learning, los cuales luego son empleados para clasificar las restantes características. Sin embargo, las evidencias obtenidas a través de la realización de los experimentos con las diferentes configuración y algoritmos indican que no es viable realizar una identificación de patologías empleando teléfonos inteligentes. Principalmente esto se puede achacar a tres factores: la calidad de las señales capturadas por los teléfonos, la frecuencia de muestreo inestable y la falta de sincronía entre los teléfonos. Por otro lado, a raíz de los pobres resultados obtenidos empleado teléfonos inteligentes se cambia el dispositivo de captura a un sistema profesional de adquisición de movimiento. Además, se plantea crear dos tipos de algoritmo, uno basado en redes neuronales y otro basado en los algoritmos empleados anteriormente. Primeramente, se plantea la adquisición de una nueva base de datos. Para ellos se establece un procedimiento para facilitar la captura de los datos, los cuales se plantea han de ser en un entorno de libertad para el usuario. Una vez que se tienen todos los datos, el preprocesado que se realizar es similar al aplicado anteriormente. Las señales se filtran para eliminar el ruido y se extraen los diferentes ciclos de la marcha que componen los paseos. Sin embargo, como para el dispositivo de captura tenemos información de varios sensores y varias localizaciones, el lugar de emplear una frecuencia de corte común, empíricamente se establece una frecuencia de corte para cada señal y posición. Dado que ya tenemos los datos listos, se crea una red neuronal recurrente basada en la literatura, de este modo podemos tener una primera aproximación al problema. Vista la viabilidad de la red neuronal, se realizan diferentes experimentos con el objetivo de mejorar el rendimiento de esta. Finalmente, el otro algoritmo recoge el legado de lo visto en la primera parte de la tesis. Al igual que antes, este algoritmo se basa en la parametrización de los ciclos de la marcha, para su posterior utilización y emplea algoritmos basado en machine learning. A diferencia del uso de señales temporales, al parametrizar los ciclos, se pueden generar datos espurios. Para eliminar estos datos, el conjunto de datos se somete a una fase de preparación (limpieza y escalado). Una vez que se ha obtenido un conjunto de datos preparado, este se divide en dos, una parte se usa para entrenar los algoritmos, los cuales se emplean para clasificar las muestras restantes. Los resultados de estos experimentos validan la viabilidad de este algoritmo para la detección de patologías. A continuación, se realizan diferentes experimentos con el objetivo de reducir la cantidad de información necesaria para identificar una patología, sin perjudicar a la precisión. Resultado de estos experimentos, se puede concluir que es viable detectar patologías empleando únicamente 2 sensores colocados en una pierna.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidente: María del Carmen Sánchez Ávila.- Secretario: Mariano López García.- Vocal: Richard Matthew Gues
    corecore