6,029 research outputs found
Mining optimal item packages using mixed integer programming
Traditional methods for discovering frequent patterns from large databases are based on attributing equal weights to all items of the database. In the real world, managerial decisions are based on economic values attached to the item sets. In this paper, we introduce the concept of the value based frequent item packages problems. Furthermore, we provide a mixed integer linear programming (MILP) model for value based optimization problem in the context of transaction data. The problem discussed in this paper is to find an optimal set of item packages (or item sets making up the whole transaction) that returns maximum profit to the organization under some limited resources. The specification of this problem opens the way for applying existing and new MILP solution techniques to deal with a number of practical decision problems. The model has been implemented and tested with real life retail data. The test results are reported in the paper
Proactive Empirical Assessment of New Language Feature Adoption via Automated Refactoring: The Case of Java 8 Default Methods
Programming languages and platforms improve over time, sometimes resulting in
new language features that offer many benefits. However, despite these
benefits, developers may not always be willing to adopt them in their projects
for various reasons. In this paper, we describe an empirical study where we
assess the adoption of a particular new language feature. Studying how
developers use (or do not use) new language features is important in
programming language research and engineering because it gives designers
insight into the usability of the language to create meaning programs in that
language. This knowledge, in turn, can drive future innovations in the area.
Here, we explore Java 8 default methods, which allow interfaces to contain
(instance) method implementations.
Default methods can ease interface evolution, make certain ubiquitous design
patterns redundant, and improve both modularity and maintainability. A focus of
this work is to discover, through a scientific approach and a novel technique,
situations where developers found these constructs useful and where they did
not, and the reasons for each. Although several studies center around assessing
new language features, to the best of our knowledge, this kind of construct has
not been previously considered.
Despite their benefits, we found that developers did not adopt default
methods in all situations. Our study consisted of submitting pull requests
introducing the language feature to 19 real-world, open source Java projects
without altering original program semantics. This novel assessment technique is
proactive in that the adoption was driven by an automatic refactoring approach
rather than waiting for developers to discover and integrate the feature
themselves. In this way, we set forth best practices and patterns of using the
language feature effectively earlier rather than later and are able to possibly
guide (near) future language evolution. We foresee this technique to be useful
in assessing other new language features, design patterns, and other
programming idioms
An Introduction to Programming for Bioscientists: A Python-based Primer
Computing has revolutionized the biological sciences over the past several
decades, such that virtually all contemporary research in the biosciences
utilizes computer programs. The computational advances have come on many
fronts, spurred by fundamental developments in hardware, software, and
algorithms. These advances have influenced, and even engendered, a phenomenal
array of bioscience fields, including molecular evolution and bioinformatics;
genome-, proteome-, transcriptome- and metabolome-wide experimental studies;
structural genomics; and atomistic simulations of cellular-scale molecular
assemblies as large as ribosomes and intact viruses. In short, much of
post-genomic biology is increasingly becoming a form of computational biology.
The ability to design and write computer programs is among the most
indispensable skills that a modern researcher can cultivate. Python has become
a popular programming language in the biosciences, largely because (i) its
straightforward semantics and clean syntax make it a readily accessible first
language; (ii) it is expressive and well-suited to object-oriented programming,
as well as other modern paradigms; and (iii) the many available libraries and
third-party toolkits extend the functionality of the core language into
virtually every biological domain (sequence and structure analyses,
phylogenomics, workflow management systems, etc.). This primer offers a basic
introduction to coding, via Python, and it includes concrete examples and
exercises to illustrate the language's usage and capabilities; the main text
culminates with a final project in structural bioinformatics. A suite of
Supplemental Chapters is also provided. Starting with basic concepts, such as
that of a 'variable', the Chapters methodically advance the reader to the point
of writing a graphical user interface to compute the Hamming distance between
two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables,
numerous exercises, and 19 pages of Supporting Information; currently in
press at PLOS Computational Biolog
Mixed-Integer Convex Nonlinear Optimization with Gradient-Boosted Trees Embedded
Decision trees usefully represent sparse, high dimensional and noisy data.
Having learned a function from this data, we may want to thereafter integrate
the function into a larger decision-making problem, e.g., for picking the best
chemical process catalyst. We study a large-scale, industrially-relevant
mixed-integer nonlinear nonconvex optimization problem involving both
gradient-boosted trees and penalty functions mitigating risk. This
mixed-integer optimization problem with convex penalty terms broadly applies to
optimizing pre-trained regression tree models. Decision makers may wish to
optimize discrete models to repurpose legacy predictive models, or they may
wish to optimize a discrete model that particularly well-represents a data set.
We develop several heuristic methods to find feasible solutions, and an exact,
branch-and-bound algorithm leveraging structural properties of the
gradient-boosted trees and penalty functions. We computationally test our
methods on concrete mixture design instance and a chemical catalysis industrial
instance
Online fulfillment: f-warehouse order consolidation and bops store picking problems
Fulfillment of online retail orders is a critical challenge for retailers since the legacy infrastructure and control methods are ill suited for online retail. The primary performance goal of online fulfillment is speed or fast fulfillment, requiring received orders to be shipped or ready for pickup within a few hours. Several novel numerical problems characterize fast fulfillment operations and this research solves two such problems. Order fulfillment warehouses (F-Warehouses) are a critical component of the physical internet behind online retail supply chains. Two key distinguishing features of an F-Warehouse are (i) Explosive Storage Policy – A unique item can be stored simultaneously in multiple bin locations dispersed through the warehouse, and (ii) Commingled Bins – A bin can stock several different items simultaneously. The inventory dispersion profile of an item is therefore temporal and non-repetitive. The order arrival process is continuous, and each order consists of one or more items. From the set of pending orders, efficient picking lists of 10-15 items are generated. A picklist of items is collected in a tote, which is then transported to a packaging station, where items belonging to the same order are consolidated into a shipment package. There are multiple such stations.
This research formulates and solves the order consolidation problem. At any time, a batch of totes are to be processed through several available order packaging stations. Tote assignment to a station will determine whether an order will be shipped in a single package or multiple packages. Reduced shipping costs are a key operational goal of an online retailer, and the number of packages is a determining factor. The decision variable is which station a tote should be assigned to, and the performance objective is to minimize the number of packages and balance the packaging station workload. This research first formulates the order consolidation problem as a mixed integer programming model, and then develops two fast heuristics (#1 and #2) plus two clustering algorithm derived solutions. For small problems, the heuristic #2 is on average within 4.1% of the optimal solution. For larger problems heuristic #2 outperforms all other algorithms. Performance behavior of heuristic #2 is further studied as a function of several characteristics.
S-Strategy fulfillment is a store-based solution for fulfilling online customer orders. The S-Strategy is driven by two key motivations, first, retailers have a network of stores where the inventory is already dispersed, and second, the expectation is that forward positioned inventory could be faster and more economical than a warehouse based F-Strategy. Orders are picked from store inventory and then the customer picks up from the store (BOPS). A BOPS store has two distinguishing features (i) In addition to shelf stock, the layout includes a space constrained back stock of selected items, and (ii) a set of dedicated pickers who are scheduled to fulfill orders. This research solves two BOFS related problems: (i) Back stock strategy: Assignment of items located in the back stock and (ii) Picker scheduling: Effect of numbers of picker and work hours. A continuous flow of incoming orders is assumed for both problems and the objective is fulfillment time and labor cost minimization. For the back-stock problem an assignment rule based on order frequency, forward location and order basket correlations achieves a 17.6% improvement over a no back-stock store, while a rule based only on order frequency achieves a 12.4 % improvement. Additional experiments across a range of order baskets are reported
On the design of R-based scalable frameworks for data science applications
This thesis is comprised of three papers "On the design of R-based scalable frameworks for data science applications". We discuss the design of conceptual and computational frameworks for the R language for statistical computing and graphics and build software artifacts for two typical data science use cases: optimization problem solving and large scale text analysis. Each part follows a design science approach. We use a verification method for the software frameworks introduced, i.e., prototypical instantiations of the designed artifacts are evaluated on the basis of real-world applications in mixed integer optimization (consensus journal ranking) and text mining (culturomics).
The first paper introduces an extensible object oriented R Optimization Infrastructure (ROI). Methods from the field of optimization play an important role in many techniques routinely used in statistics, machine learning and data science. Often, implementations of these methods rely on highly specialized optimization algorithms, designed to be only applicable within a specific application. However, in many instances recent advances, in particular in the field of convex optimization, make it possible to conveniently and straightforwardly use modern solvers instead with the advantage of enabling broader usage scenarios and thus promoting reusability. With ROI one can formulate and solve optimization problems in a consistent way. It is capable of modeling linear, quadratic, conic, and general nonlinear optimization problems. Furthermore, the paper discusses how extension packages can add additional optimization solvers, read/write functions and additional resources such as model collections. Selected examples from the field of statistics conclude the paper.
With the second paper we aim to answer two questions. Firstly, it addresses the issue on how to construct suitable aggregates of individual journal rankings, using an optimization-based consensus ranking approach. Secondly, the presented application serves as an evaluation of the ROI prototype. Regarding the first research question we apply the proposed method to a subset of marketing-related journals from a list of collected journal rankings. Next, the paper studies the stability of the derived consensus solution, and degeneration effects that occur when excluding journals and/or rankings. Finally, we investigate the similarities/dissimilarities of the consensus with a naive meta-ranking and with individual rankings. The results show that, even though journals are not uniformly ranked, one may derive a consensus ranking with considerably high agreement with the individual rankings.
In the third paper we examine how we can extend the text mining package tm to handle large (text) corpora. This enables statisticians to answer many interesting research questions via statistical analysis or modeling of data sets that cannot be analyzed easily otherwise, e.g., due to software or hardware induced data size limitations. Adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing large data sets by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. The paper presents a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We evaluate the presented prototype on the basis of an application in culturomics and show that it can handle data sets of significant size efficiently
ControlFlag: A Self-supervised Idiosyncratic Pattern Detection System for Software Control Structures
Software debugging has been shown to utilize upwards of 50% of developers’ time. Machine programming, the field concerned with the automation of software (and hardware) development, has recently made progress in both research and production-quality automated debugging systems. In this paper, we present ControlFlag, a system that detects possible idiosyncratic violations in software control structures. ControlFlag also suggests possible corrections in the event a true error is detected. A novelty of ControlFlag is that it is entirely self-supervised; that is, it requires no labels to learn about the potential idiosyncratic programming pattern violations. In addition to presenting ControlFlag’s design, we also provide an abbreviated experimental evaluation
- …