19,315 research outputs found
CASP-DM: Context Aware Standard Process for Data Mining
We propose an extension of the Cross Industry Standard Process for Data
Mining (CRISPDM) which addresses specific challenges of machine learning and
data mining for context and model reuse handling. This new general
context-aware process model is mapped with CRISP-DM reference model proposing
some new or enhanced outputs
Process Framework for Subscriber Management and Retention in Nigerian Telecommunication Industry
in the global telecommunication industry. Hence, a dominant approach for subscriber
management and retention is churn control, since it is cheaper to retain an existing
subscriber than acquiring a new one. Predictive modeling employs the use of data mining
techniques to identify patterns and provide a result that a group of subscribers are likely to
churn in the near future. However, the effectiveness of subscriber retention strategy in an
organization can be further boosted if the reason for churn and the timing of churn can also
be predicted.
In this paper, we propose a data mining process framework that can be used to predict
churn, determine when a subscriber is likely to churn, provides the reason why a subscriber
may churn, and recommend appropriate intervention strategy for customer retention using
a combination of statistical and machine learning techniques. This experiment is carried
out using data from a major telecom operator in Nigeria
Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection
In recent years, there have been many practical applications of anomaly
detection such as in predictive maintenance, detection of credit fraud, network
intrusion, and system failure. The goal of anomaly detection is to identify in
the test data anomalous behaviors that are either rare or unseen in the
training data. This is a common goal in predictive maintenance, which aims to
forecast the imminent faults of an appliance given abundant samples of normal
behaviors. Local outlier factor (LOF) is one of the state-of-the-art models
used for anomaly detection, but the predictive performance of LOF depends
greatly on the selection of hyperparameters. In this paper, we propose a novel,
heuristic methodology to tune the hyperparameters in LOF. A tuned LOF model
that uses the proposed method shows good predictive performance in both
simulations and real data sets.Comment: 15 pages, 5 figure
Using webcrawling of publicly available websites to assess E-commerce relationships
We investigate e-commerce success factors concerning their impact on the success of commerce transactions between businesses companies. In scientific literature, many e-commerce success factors are introduced. Most of them are focused on companies' website quality. They are evaluated concerning companies' success in the business-to- consumer (B2C) environment where consumers choose their preferred e-commerce websites based on these success factors e.g. website content quality, website interaction, and website customization. In contrast to previous work, this research focuses on the usage of existing e-commerce success factors for predicting successfulness of business-to-business (B2B) ecommerce. The introduced methodology is based on the identification of semantic textual patterns representing success factors from the websites of B2B companies. The successfulness of the identified success factors in B2B ecommerce is evaluated by regression modeling. As a result, it is shown that some B2C e-commerce success factors also enable the predicting of B2B e-commerce success while others do not. This contributes to the existing literature concerning ecommerce success factors. Further, these findings are valuable for B2B e-commerce websites creation
Re-mining item associations: methodology and a case study in apparel retailing
Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not been integrated into market basket analysis in earlier studies. This paper proposes a new approach to mine price, time and domain related attributes through re-mining of association mining results. The underlying factors behind positive and negative relationships can be characterized and described through this second data mining stage. The applicability of the methodology is demonstrated through the analysis of data coming from a large apparel retail chain, and its algorithmic complexity is analyzed in comparison to the existing techniques
Yields and qualities of pigeonpea varieties grown under smallholder farmers’ conditions in Eastern and Southern Africa
Pigeonpea is one of the few crops with a high potential for resource-poor farmers due to its complementary resource use when intercropped with maize. A three year comprehensive comparative study on the performance of six pigeonpea (Cajanus cajan) varieties on farmers’ fields in Eastern and Southern Africa where intercropping with maize is normal practice, was undertaken. The varieties were tested for accumulation of dry matter (DM), nitrogen (N) and phosphorus (P) in all above-ground organs for three years under farmers’ conditions. The study revealed that the latest introduced ICEAP 00040 outperformed all the other tested varieties (ICP 9145; ICEAP 00020, ICEAP 00053, ICEAP 00068, and a local variety called “Babati White”) under farmer-managed conditions. The harvest indices (HI), ranging from 0.08-0.15 on dry matter (DM) basis, were relatively low and unaffected (P>0.05) by the environmental variation. The N harvest index (NHI) was 0.28 and P harvest index (PHI) was 0.19. The better responses of ICEAP00040 to favourable conditions could however only be realised in a minority of cases as yields generally were low. These low yields are still a major challenge in African smallholder agriculture as pulses play an important role in soil fertility maintenance as well as in the household diets
Detecting variable responses in time-series using repeated measures ANOVA: Application to physiologic challenges.
We present an approach to analyzing physiologic timetrends recorded during a stimulus by comparing means at each time point using repeated measures analysis of variance (RMANOVA). The approach allows temporal patterns to be examined without an a priori model of expected timing or pattern of response. The approach was originally applied to signals recorded from functional magnetic resonance imaging (fMRI) volumes-of-interest (VOI) during a physiologic challenge, but we have used the same technique to analyze continuous recordings of other physiological signals such as heart rate, breathing rate, and pulse oximetry. For fMRI, the method serves as a complement to whole-brain voxel-based analyses, and is useful for detecting complex responses within pre-determined brain regions, or as a post-hoc analysis of regions of interest identified by whole-brain assessments. We illustrate an implementation of the technique in the statistical software packages R and SAS. VOI timetrends are extracted from conventionally preprocessed fMRI images. A timetrend of average signal intensity across the VOI during the scanning period is calculated for each subject. The values are scaled relative to baseline periods, and time points are binned. In SAS, the procedure PROC MIXED implements the RMANOVA in a single step. In R, we present one option for implementing RMANOVA with the mixed model function "lme". Model diagnostics, and predicted means and differences are best performed with additional libraries and commands in R; we present one example. The ensuing results allow determination of significant overall effects, and time-point specific within- and between-group responses relative to baseline. We illustrate the technique using fMRI data from two groups of subjects who underwent a respiratory challenge. RMANOVA allows insight into the timing of responses and response differences between groups, and so is suited to physiologic testing paradigms eliciting complex response patterns
On-Disk Data Processing: Issues and Future Directions
In this paper, we present a survey of "on-disk" data processing (ODDP). ODDP,
which is a form of near-data processing, refers to the computing arrangement
where the secondary storage drives have the data processing capability.
Proposed ODDP schemes vary widely in terms of the data processing capability,
target applications, architecture and the kind of storage drive employed. Some
ODDP schemes provide only a specific but heavily used operation like sort
whereas some provide a full range of operations. Recently, with the advent of
Solid State Drives, powerful and extensive ODDP solutions have been proposed.
In this paper, we present a thorough review of architectures developed for
different on-disk processing approaches along with current and future
challenges and also identify the future directions which ODDP can take.Comment: 24 pages, 17 Figures, 3 Table
Next challenges for adaptive learning systems
Learning from evolving streaming data has become a 'hot' research topic in the last decade and many adaptive learning algorithms have been developed. This research was stimulated by rapidly growing amounts of industrial, transactional, sensor and other business data that arrives in real time and needs to be mined in real time. Under such circumstances, constant manual adjustment of models is in-efficient and with increasing amounts of data is becoming infeasible. Nevertheless, adaptive learning models are still rarely employed in business applications in practice. In the light of rapidly growing structurally rich 'big data', new generation of parallel computing solutions and cloud computing services as well as recent advances in portable computing devices, this article aims to identify the current key research directions to be taken to bring the adaptive learning closer to application needs. We identify six forthcoming challenges in designing and building adaptive learning (pre-diction) systems: making adaptive systems scalable, dealing with realistic data, improving usability and trust, integrat-ing expert knowledge, taking into account various application needs, and moving from adaptive algorithms towards adaptive tools. Those challenges are critical for the evolving stream settings, as the process of model building needs to be fully automated and continuous.</jats:p
- …