27 research outputs found

    Automated Feature Engineering for Classification Problems

    Get PDF
    O estudo sobre geração de features tem aumentado conforme os anos, é um dos maiores desafios para Machine Learning. Totalmente dependente de conhecimento de domínio é uma área que se feita de forma manual consome muito tempo e não é escalável. Por sua vez, meta-learning auxilia o aprendizado através diferentes domínios. Nos apresentamos uma abordagem de automação de geração de features que utiliza o meta-learning como auxílio na seleção de features. Considerando que geramos uma grande quantidade de features, usamos o conhecimento de 100 data sets de diferentes domínios para responder à pergunta se devemos ou não gerar features para um data set e também quais features. Nosso experimento mostrou que é possível utilizar o meta-learning no processo de seleção, podendo nos informar se devemos ou não gerar o conjunto de features automáticas para um determinado data set, obtendo 66.96% de taxa de acerto, enquanto a nossa baseline é de 50%, nos provamos estatisticamente que a nossa taxa de acerto é melhor do que a baseline em 88% dos casos.Infelizmente, não obtivemos um excelente resultado a nível base ao utilizar apenas as features que foram selecionadas individualmente, porém ao nível meta obtemos um resultado de 65.52% de taxa de acerto ao prever quais features individuais supostamente trariam melhora na performance do modelo. Considerando que a nossa baseline é de 39%, nos estatisticamente provamos que nossa taxa de acerto é melhor que a baseline em 93% dos casos.Os resultados nos mostram que meta-learning pode ser utilizado no auxílio de geração e seleção de features, entretanto a nossa abordagem ainda pode ser aprimorada sendo mais assertiva nas previsões a nível meta e melhores resultados a nível base. Nosso código esta disponível em https://github.com/guifeliper/automated-feature-engineering.The study on feature generation has grown over the last years, is one of the biggest challenges for Machine Learning. Entirely dependent on domain knowledge, it is an area that if done manually, is time-consuming and not scalable. In turn, meta-learning helps to learn through different domains and can bring benefits to this area.We present an automated feature engineering approach that uses meta-learning as an assistant in the selection of features. Considering that we generate a large number of features, we use the knowledge of 100 data sets from different domains to answer the question of whether or not to create features for a data set and also what features to use.Our experiment showed that it is possible to use meta-learning in the selection process, and can inform us whether or not we should generate the set of automatic features for a given data set, obtaining 66.96% of accuracy, while the overall baseline is 50% and statistically, our accuracy is proved to be better than the baseline at 88% of the cases.Unfortunately, we did not get an excellent result in the base level by using only the features that were selected individually, but at the meta level, we get a 65.52% of accuracy, when predicting which individual features would supposedly bring improve for the performance. Considering that our overall baseline is 39%, we statistically proved that our accuracy is better than the baseline at 93% of the cases.The results show that meta-learning can be used to aid the generation and selection of features. However, our approach can still be improved, being more precise in the predictions at the meta-level and better results at the base level. Our code is available at https://github.com/guifeliper/automated-feature-engineering

    FeatGeNN: Improving Model Performance for Tabular Data with Correlation-based Feature Extraction

    Full text link
    Automated Feature Engineering (AutoFE) has become an important task for any machine learning project, as it can help improve model performance and gain more information for statistical analysis. However, most current approaches for AutoFE rely on manual feature creation or use methods that can generate a large number of features, which can be computationally intensive and lead to overfitting. To address these challenges, we propose a novel convolutional method called FeatGeNN that extracts and creates new features using correlation as a pooling function. Unlike traditional pooling functions like max-pooling, correlation-based pooling considers the linear relationship between the features in the data matrix, making it more suitable for tabular data. We evaluate our method on various benchmark datasets and demonstrate that FeatGeNN outperforms existing AutoFE approaches regarding model performance. Our results suggest that correlation-based pooling can be a promising alternative to max-pooling for AutoFE in tabular data applications

    Automated Data Preparation using Semantics of Data Science Artifacts

    Get PDF
    Data preparation is critical for improving model accuracy. However, data scientists often work independently, spending most of their time writing code to identify and select relevant features, enrich, clean, and transform their datasets to train predictive models for solving a machine learning problem. Working in isolation from each other, they lack support to learn from what other data scientists have performed on similar datasets. This thesis addresses these challenges by presenting a novel approach that automates data preparation using the semantics of data science artifacts. Therefore, this work proposes KGFarm, a holistic platform for automating data preparation based on machine learning models trained using the semantics of data science artifacts, captured as a knowledge graph (KG). These semantics comprise datasets and pipeline scripts. KGFarm seamlessly integrates with existing data science platforms, effectively enabling scientific communities to automatically discover and learn from each other’s work. KGFarm’s models were trained on top of a KG constructed from the top-rated 1000 Kaggle datasets and 13800 pipeline scripts with the highest number of votes. Our comprehensive evaluation uses 130 unseen datasets collected from different AutoML benchmarks to compare KGFarm against state-of-the-art systems in data cleaning, data transformation, feature selection, and feature engineering tasks. Our experiments show that KGFarm consumes significantly less time and memory compared to the state-of-the-art systems while achieving comparable or better accuracy. Hence, KGFarm effectively handles large-scale datasets and empowers data scientists to automate data preparation pipelines interactively

    Toward Efficient Automated Feature Engineering

    Full text link
    Automated Feature Engineering (AFE) refers to automatically generate and select optimal feature sets for downstream tasks, which has achieved great success in real-world applications. Current AFE methods mainly focus on improving the effectiveness of the produced features, but ignoring the low-efficiency issue for large-scale deployment. Therefore, in this work, we propose a generic framework to improve the efficiency of AFE. Specifically, we construct the AFE pipeline based on reinforcement learning setting, where each feature is assigned an agent to perform feature transformation \com{and} selection, and the evaluation score of the produced features in downstream tasks serve as the reward to update the policy. We improve the efficiency of AFE in two perspectives. On the one hand, we develop a Feature Pre-Evaluation (FPE) Model to reduce the sample size and feature size that are two main factors on undermining the efficiency of feature evaluation. On the other hand, we devise a two-stage policy training strategy by running FPE on the pre-evaluation task as the initialization of the policy to avoid training policy from scratch. We conduct comprehensive experiments on 36 datasets in terms of both classification and regression tasks. The results show 2.9%2.9\% higher performance in average and 2x higher computational efficiency comparing to state-of-the-art AFE methods

    Automated machine learning in practice : state of the art and recent results

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A main driver behind the digitization of industry and society is the belief that data-driven model building and decision making can contribute to higher degrees of automation and more informed decisions. Building such models from data often involves the application of some form of machine learning. Thus, there is an ever growing demand in work force with the necessary skill set to do so. This demand has given rise to a new research topic concerned with fitting machine learning models fully automatically – AutoML. This paper gives an overview of the state of the art in AutoML with a focus on practical applicability in a business context, and provides recent benchmark results of the most important AutoML algorithms
    corecore