7 research outputs found

    A framework for increasing the value of predictive data-driven models by enriching problem domain characterization with novel features

    Get PDF
    The need to leverage knowledge through data mining has driven enterprises in a demand for more data. However, there is a gap between the availability of data and the application of extracted knowledge for improving decision support. In fact, more data do not necessarily imply better predictive data-driven marketing models, since it is often the case that the problem domain requires a deeper characterization. Aiming at such characterization, we propose a framework drawn on three feature selection strategies, where the goal is to unveil novel features that can effectively increase the value of data by providing a richer characterization of the problem domain. Such strategies involve encompassing context (e.g., social and economic variables), evaluating past history, and disaggregate the main problem into smaller but interesting subproblems. The framework is evaluated through an empirical analysis for a real bank telemarketing application, with the results proving the benefits of such approach, as the area under the receiver operating characteristic curve increased with each stage, improving previous model in terms of predictive performance.The work of P. Cortez was supported by FCT within the Project Scope UID/CEC/00319/2013. The authors would like to thank the anonymous reviewers for their helpful comments.info:eu-repo/semantics/publishedVersio

    Forecasting tomorrow’s tourist

    Get PDF
    Purpose: This study aims to present a very recent literature review on tourism demand forecasting based on 50 relevant articles published between 2013 and June 2016. Design/methodology/approach: For searching the literature, the 50 most relevant articles according to Google Scholar ranking were selected and collected. Then, each of the articles were scrutinized according to three main dimensions: the method or technique used for analyzing data; the location of the study; and the covered timeframe. Findings: The most widely used modeling technique continues to be time series, confirming a trend identified prior to 2011. Nevertheless, artificial intelligence techniques, and most notably neural networks, are clearly becoming more used in recent years for tourism forecasting. This is a relevant subject for journals related to other social sciences, such as Economics, and also tourism data constitute an excellent source for developing novel modeling techniques. Originality/value: The present literature review offers recent insights on tourism forecasting scientific literature, providing evidences on current trends and revealing interesting research gaps.info:eu-repo/semantics/submittedVersio

    A comparative analysis of classifiers in cancer prediction using multiple data mining techniques

    Get PDF
    In recent years, application of data mining methods in health industry has received increased attention from both health professionals and scholars. This paper presents a data mining framework for detecting breast cancer based on real data from one of Iran hospitals by applying association rules and the most commonly used classifiers. The former were adopted for reducing the size of datasets, while the latter were chosen for cancer prediction. A k-fold cross validation procedure was included for evaluating the performance of the proposed classifiers. Among the six classifiers used in this paper, support vector machine achieved the best results, with an accuracy of 93%. It is worth mentioning that the approach proposed can be applied for detecting other diseases as well

    Mutual information and sensitivity analysis for feature selection in customer targeting: a comparative study

    Get PDF
    WOS:000454945400004Feature selection is a highly relevant task in any data-driven knowledge discovery project. The present research focuses on analysing the advantages and disadvantages of using mutual information (MI) and data-based sensitivity analysis (DSA) for feature selection in classification problems, by applying both to a bank telemarketing case. A logistic regression model is built on the tuned set of features identified by each of the two techniques as the most influencing set of features on the success of a telemarketing contact, in a total of 13 features for MI and 9 for DSA. The latter performs better for lower values of false positives while the former is slightly better for a higher false-positive ratio. Thus, MI becomes a better choice if the intention is reducing slightly the cost of contacts without risking losing a high number of successes. However, DSA achieved good prediction results with less features.info:eu-repo/semantics/acceptedVersio

    Sharing is Caring: Using Open Data To Improve Targeting Policies

    Get PDF
    When it comes to predictive power, companies in a variety of sectors depend on having sufficient data to develop and deploy business analytics applications, for example, to acquire new customers. While there is a vast literature on enriching internal data sets with external data sources, it is still largely unclear whether and how open data can be used to enrich internal data sets to improve business analytics. We choose a particular business analytics problem – designing targeting policies to acquire new customers – to investigate how an internal data set of a German grocery supplier can be enriched with open data to improve targeting policies. Using the enriched data set, we can improve the response rate of several well-established targeting policies by more than 30% in back-testing. Based on these results, we encourage firms and researchers to use, leverage, and share open data to enhance business analytics

    Factors influencing charter flight departure delay

    Get PDF
    This study aims to identify the main factors leading to charter flight departure delay through data mining. The data sample analysed consists of 5,484 flights operated by a European airline between 2014 and 2017. The tuned dataset of 33 features was used for modelling departure delay (e.g., if the flight delayed more than 15 minutes). The results proved the value of the proposed approach by an area under the receiver operating characteristic curve of 0.831 and supported knowledge extraction through the data-based sensitivity analysis. The features related to previous flight delay information were considered as being the most influential toward current flight being delayed or not, which is consistent with the propagating effect of flight delays. However, it is not the reason for the previous delay nor the delay duration that accounted for the most relevance. Instead, a computed feature indicating if there were two or more registered reasons accounted for 33% of relevance. The contributions include also using a broader data mining approach supported by an extensive data understanding and preparation stage using both proprietary and open access data sources to build a comprehensive dataset.info:eu-repo/semantics/acceptedVersio

    Stripping customers' feedback on hotels evaluation through data mining

    Get PDF
    Com a constante evolução tecnológica e a consequente afluência de partilha de informação entre os consumidores, as plataformas online, como é o caso do TripAdvisor, começaram a ser usadas para análise, principalmente na indústria hoteleira. Estas plataformas permitem aos clientes a partilha de opiniões e a respectiva atribuição de uma avaliação quantitativa aos hotéis visitados. Os estudos publicados têm-se focado, fundamentalmente, na análise dos comentários; contudo, estudos relacionados com a avaliação quantitativa são mais escassos. Este estudo foi desenvolvido através de técnicas de data mining por forma a modelar a pontuação atribuída no TripAdvisor. Foram recolhidos dois comentários por cada mês do ano de 2015 referentes a 21 hotéis localizados na avenida mais emblemática de Las Vegas, a Strip, num total de 504 comentários. A localização foi seleccionada por ser um destino de elevado impato turístico já que a cidade persiste devido à hotelaria e aos casinos. Foram seleccionadas 19 variáveis que representam o utilizador, o hotel e as suas características para alimentarem uma máquina de vectores de suporte objectivando a modelação da avaliação quantitativa para extração de conhecimento. Os resultados atestaram a utilidade do modelo na sua capacidade preditiva. Após esta validação foi aplicada uma análise de sensibilidade ao modelo para compreender a relevância das variáveis. Os resultados revelaram que as variáveis diretamente relacionadas com o utilizador e a sua experiência na utilização do TripAdvisor têm maior influência na atribuição das pontuações, comparativamente com as variáveis relacionadas com o hotel.The emergence of online reviews’ platforms such as TripAdvisor provided tools for tourists to write their opinions and rate hotels with a quantitative score. While numerous studies are found based on textual comments of users, research on the score is rather scarce. This study presents a data mining approach for modeling TripAdvisor score using 504 reviews published in 2015 for the 21 hotels located in the Strip, Las Vegas. Nineteen features characterizing the reviews, hotels and the users were prepared and used for feeding a support vector machine for modeling the score. The results achieved reveal the model is a good approximation for predicting the score. Therefore, a sensitivity analysis was applied over the model for extracting useful knowledge translated into features’ relevance for the score. The findings unveiled user features related to TripAdvisor membership experience play a key role in influencing the scores granted, clearly surpassing hotel features