34 research outputs found

    Scaling forecasting algorithms using clustered modeling

    Get PDF
    Cataloged from PDF version of article.Research on forecasting has traditionally focused on building more accurate statistical models for a given time series. The models are mostly applied to limited data due to efficiency and scalability problems. However, many enterprise applications require scalable forecasting on large number of data series. For example, telecommunication companies need to forecast each of their customers' traffic load to understand their usage behavior and to tailor targeted campaigns. Forecasting models are typically applied on aggregate data to estimate the total traffic volume for revenue estimation and resource planning. However, they cannot be easily applied to each user individually as building accurate models for large number of users would be time consuming. The problem is exacerbated when the forecasting process is continuous and the models need to be updated periodically. This paper addresses the problem of building and updating forecasting models continuously for multiple data series. We propose dynamic clustered modeling for forecasting by utilizing representative models as an analogy to cluster centers. We apply the models to each individual series through iterative nonlinear optimization. We develop two approaches: The Integrated Clustered Modeling integrates clustering and modeling simultaneously, and the Sequential Clustered Modeling applies them sequentially. Our findings indicate that modeling an individual's behavior using its segment can be more scalable and accurate than the individual model itself. The grouped models avoid overfits and capture common motifs even on noisy data. Experimental results from a telco CRM application show the method is efficient and scalable, and also more accurate than having separate individual models

    Div-BLAST: Diversification of sequence search results

    Get PDF
    Cataloged from PDF version of article.Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAS

    Diversity based Relevance Feedback for Time Series Search

    Get PDF
    We propose a diversity based relevance feedback approach for time series data to improve the accuracy of search results. We first develop the concept of relevance feedback for time series based on dual-tree complex wavelet (CWT) and SAX based approaches. We aim to enhance the search quality by incorporating diversity in the results presented to the user for feedback. We then propose a method which utilizes the representation type as part of the feedback, as opposed to a human choosing based on a preprocessing or training phase. The proposed methods utilize a weighting to handle the relevance feedback of important properties for both single and multiple representation cases. Our experiments on a large variety of time series data sets show that the proposed diversity based relevance feedback improves the retrieval performance. Results confirm that representation feedback incorporates item diversity implicitly and achieves good performance even when using simple nearest neighbor as the retrieval method. To the best of our knowledge, this is the first study on diversification of time series search to improve retrieval accuracy and representation feedback. © 2013 VLDB Endowment

    Predicting optimal facility location without customer locations

    Get PDF
    Deriving meaningful insights from location data helps businesses make better decisions. One critical decision made by a business is choosing a location for its new facility. Optimal location queries ask for a location to build a new facility that optimizes an objective function. Most of the existing works on optimal location queries propose solutions to return best location when the set of existing facilities and the set of customers are given. However, most businesses do not know the locations of their customers. In this paper, we introduce a new problem setting for optimal location queries by removing the assumption that the customer locations are known. We propose an optimal location predictor which accepts partial information about customer locations and returns a location for the new facility. The predictor generates synthetic customer locations by using given partial information and it runs optimal location queries with generated location data. Experiments with real data show that the predictor can find the optimal location when sufficient information is provided. © 2017 Copyright held by the owner/author(s)

    LineKing: Coffee Shop Wait-Time Monitoring Using Smartphones

    Get PDF
    This article describes LineKing, a crowdsensing system for monitoring and forecasting coffee shop line wait times. LineKing consists of a smartphone component that provides automatic and accurate wait-time detection, and a cloud backend that uses the collected data to provide accurate wait-time estimation. LineKing is used on a daily basis by hundreds of users to monitor the wait-times of a coffee shop in the University at Buffalo, SUNY. The novel wait-time estimation algorithms of LineKing deployed at the cloud backend provide median absolute errors of less than 3 minutes. © 2002-2012 IEEE

    Fast k-NN classifier for documents based on a graph structure

    Get PDF
    In this paper, a fast k nearest neighbors (k-NN) classifier for documents is presented. Documents are usually represented in a high-dimensional feature space, where terms appeared on it are treated as features and the weight of each term reflects its importance in the document. There are many approaches to find the vicinity of an object, but their performance drastically decreases as the number of dimensions grows. This problem prevents its application for documents. The proposed method is based on a graph index structure with a fast search algorithm. It’s high selectivity permits to obtain a similar classification quality than exhaustive classifier, with a few number of computed distances. Our experimental results show that it is feasible the use of the proposed method in problems of very high dimensionality, such as Text Mining

    Scaling forecasting algorithms using clustered modeling

    Get PDF
    Research on forecasting has traditionally focused on building more accurate statistical models for a given time series. The models are mostly applied to limited data due to efficiency and scalability problems. However, many enterprise applications require scalable forecasting on large number of data series. For example, telecommunication companies need to forecast each of their customers’ traffic load to understand their usage behavior and to tailor targeted campaigns. Forecasting models are typically applied on aggregate data to estimate the total traffic volume for revenue estimation and resource planning. However, they cannot be easily applied to each user individually as building accurate models for large number of users would be time consuming. The problem is exacerbated when the forecasting process is continuous and the models need to be updated periodically. This paper addresses the problem of building and updating forecasting models continuously for multiple data series. We propose dynamic clustered modeling for forecasting by utilizing representative models as an analogy to cluster centers. We apply the models to each individual series through iterative nonlinear optimization. We develop two approaches: The Integrated Clustered Modeling integrates clustering and modeling simultaneously, and the Sequential Clustered Modeling applies them sequentially. Our findings indicate that modeling an individual’s behavior using its segment can be more scalable and accurate than the individual model itself. The grouped models avoid overfits and capture common motifs even on noisy data. Experimental results from a telco CRM application show the method is efficient and scalable, and also more accurate than having separate individual models. © 2014, Springer-Verlag Berlin Heidelberg

    Location Recommendations for New Businesses Using Check-in Data

    Get PDF
    Location based social networks (LBSN) and mobile applications generate data useful for location oriented business decisions. Companies can get insights about mobility patterns of potential customers and their daily habits on shopping, dining, etc.To enhance customer satisfaction and increase profitability. We introduce a new problem of identifying neighborhoods with a potential of success in a line of business. After partitioning the city into neighborhoods, based on geographical and social distances, we use the similarities of the neighborhoods to identify specific neighborhoods as candidates for investment for a new business opportunity. We present two solutions for this new problem: i) a probabilistic approach based on Bayesian inference for location selection along with a voting based approximation, and ii) an adaptation of collaborative filtering using the similarity of neighborhoods based on co-existence of related venues and check-in patterns. We use Foursquare user check-in and venue location data to evaluate the performance of the proposed approach. Our experiments show promising results for identifying new opportunities and supporting business decisions using increasingly available check-in data sets. © 2016 IEEE

    A large-scale sentiment analysis for Yahoo! Answers

    Get PDF
    Sentiment extraction from online web documents has recently been an active research topic due to its potential use in commercial applications. By sentiment analysis, we refer to the problem of assigning a quantitative positive/negative mood to a short bit of text. Most studies in this area are limited to the identification of sentiments and do not investigate the interplay between sentiments and other factors. In this work, we use a sentiment extraction tool to investigate the influence of factors such as gender, age, education level, the topic at hand, or even the time of the day on sentiments in the context of a large online question answering site. We start our analysis by looking at direct correlations, e.g., we observe more positive sentiments on weekends, very neutral ones in the Science & Mathematics topic, a trend for younger people to express stronger sentiments, or people in military bases to ask the most neutral questions. We then extend this basic analysis by investigating how properties of the (asker, answerer) pair affect the sentiment present in the answer. Among other things, we observe a dependence on the pairing of some inferred attributes estimated by a user's ZIP code. We also show that the best answers differ in their sentiments from other answers, e.g., in the Business & Finance topic, best answers tend to have a more neutral sentiment than other answers. Finally, we report results for the task of predicting the attitude that a question will provoke in answers. We believe that understanding factors influencing the mood of users is not only interesting from a sociological point of view, but also has applications in advertising, recommendation, and search. Copyright 2012 ACM

    Generating Time-Varying Road Network Data Using Sparse Trajectories

    Get PDF
    While research on time-varying graphs has attracted recent attention, the research community has limited or no access to real datasets to develop effective algorithms and systems. Using noisy and sparse GPS traces from vehicles, we develop a time-varying road network data set where edge weights differ over time. We present our methodology and share this dataset, along with a graph manipulation tool. We estimate the traffic conditions using the sparse GPS data available by characterizing the sparsity issues and assessing the properties of travel sequence data frequency domain. We develop interpolation methods to complete the sparse data into a complete graph dataset with realistic time-varying edge values. We evaluate the performance of time-varying and static shortest path solutions over the generated dynamic road network. The shortest paths using the dynamic graph produce very different results than the static version. We provide an independent Java API and a graph database to analyze and manipulate the generated time-varying graph data easily, not requiring any knowledge about the inners of the graph database system. We expect our solution to support researchers to pursue problems of time-varying graphs in terms of theoretical, algorithmic, and systems aspects. The data and Java API are available at: http://elif.eser.bilkent.edu.tr/roadnetwork. © 2016 IEEE
    corecore