1,099 research outputs found

    Separable Convex Optimization with Nested Lower and Upper Constraints

    Full text link
    We study a convex resource allocation problem in which lower and upper bounds are imposed on partial sums of allocations. This model is linked to a large range of applications, including production planning, speed optimization, stratified sampling, support vector machines, portfolio management, and telecommunications. We propose an efficient gradient-free divide-and-conquer algorithm, which uses monotonicity arguments to generate valid bounds from the recursive calls, and eliminate linking constraints based on the information from sub-problems. This algorithm does not need strict convexity or differentiability. It produces an ϵ\epsilon-approximate solution for the continuous problem in O(nlogmlognBϵ)\mathcal{O}(n \log m \log \frac{n B}{\epsilon}) time and an integer solution in O(nlogmlogB)\mathcal{O}(n \log m \log B) time, where nn is the number of decision variables, mm is the number of constraints, and BB is the resource bound. A complexity of O(nlogm)\mathcal{O}(n \log m) is also achieved for the linear and quadratic cases. These are the best complexities known to date for this important problem class. Our experimental analyses confirm the good performance of the method, which produces optimal solutions for problems with up to 1,000,000 variables in a few seconds. Promising applications to the support vector ordinal regression problem are also investigated

    CDCL(Crypto) and Machine Learning based SAT Solvers for Cryptanalysis

    Get PDF
    Over the last two decades, we have seen a dramatic improvement in the efficiency of conflict-driven clause-learning Boolean satisfiability (CDCL SAT) solvers over industrial problems from a variety of applications such as verification, testing, security, and AI. The availability of such powerful general-purpose search tools as the SAT solver has led many researchers to propose SAT-based methods for cryptanalysis, including techniques for finding collisions in hash functions and breaking symmetric encryption schemes. A feature of all of the previously proposed SAT-based cryptanalysis work is that they are \textit{blackbox}, in the sense that the cryptanalysis problem is encoded as a SAT instance and then a CDCL SAT solver is invoked to solve said instance. A weakness of this approach is that the encoding thus generated may be too large for any modern solver to solve it efficiently. Perhaps a more important weakness of this approach is that the solver is in no way specialized or tuned to solve the given instance. Finally, very little work has been done to leverage parallelism in the context of SAT-based cryptanalysis. To address these issues, we developed a set of methods that improve on the state-of-the-art SAT-based cryptanalysis along three fronts. First, we describe an approach called \cdcl (inspired by the CDCL(TT) paradigm) to tailor the internal subroutines of the CDCL SAT solver with domain-specific knowledge about cryptographic primitives. Specifically, we extend the propagation and conflict analysis subroutines of CDCL solvers with specialized codes that have knowledge about the cryptographic primitive being analyzed by the solver. We demonstrate the power of this framework in two cryptanalysis tasks of algebraic fault attack and differential cryptanalysis of SHA-1 and SHA-256 cryptographic hash functions. Second, we propose a machine-learning based parallel SAT solver that performs well on cryptographic problems relative to many state-of-the-art parallel SAT solvers. Finally, we use a formulation of SAT into Bayesian moment matching to address heuristic initialization problem in SAT solvers

    Analyzing ranking data using decision tree

    Get PDF
    Ranking/preference data arises from many applications in marketing, psychology and politics. We establish a new decision tree model for the analysis of ranking data by adopting the concept of classification and regression tree [2]. We modify the existing splitting criteria, Gini and entropy, which can precisely measure the impurity of a set of ranking data. Two types of impurity measures for ranking data are introduced, namely n-wise and top-k measures. Minimal cost-complexity pruning is used to find the optimum-sized tree. In model assessment, the area under the ROC curve (AUC) is applied to evaluate the tree performance. The proposed methodology is implemented to analyze a partial ranking dataset of Inglehart's items collected in the 1993 International Social Science Programme survey. Change in importance of item values with country, age and level of education are identified.postprintThe European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2008), Antwerp, Belgium, 15-19 September 2008. In Proceedings of ECML PKDD 2008, p. 139-15

    Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis

    Full text link
    Interesting and challenging methodological questions arise from the analysis of Big Biomedical Data, where viable solutions are sought with the help of modern computational tools. In this dissertation, I look at problems in biomedical studies related to data integration, data heterogeneity, and related statistical learning algorithms. The overarching strategy throughout the dissertation research is rooted in the treatment of individual datasets, but not individual subjects, as the elements of focus. Thus, I generalized some of the traditional subject-level methods to be tailored for the development of Big Data methodologies. Following an introduction overview in the first chapter, Chapter II concerns the development of fusion learning of model heterogeneity in data integration via a regression coefficient clustering method. The statistical learning procedure is built for the generalized linear models, and enforces an adjacent fusion penalty on ordered parameters (Wang et al., 2016). This is an adaptation of the fused lasso (Tibshirani et al., 2005), and an extension to the homogeneity pursuit (Ke et al., 2015) that only considers a single data set. Using this method, we can identify regression coefficient heterogeneity across sub-datasets and fuse homogeneous subsets to greatly simplify the regression model, so to improve statistical power. The proposed fusion learning algorithm (published as Tang and Song (2016)) allows the integration of a large number of sub-datasets, a clear advantage over the traditional methods with stratum-covariate interactions or random effects. This method is useful to cluster treatment effects, so some outlying studies may be detected. We demonstrate our method with datasets from the Panel Study of Income Dynamics and from the Early Life Exposures in Mexico to Environmental Toxicants study. This method has also been extended to the Cox proportional hazards model to handle time-to-event response. Chapter III, under the assumption of homogeneous generalized linear model, focuses on the development of a divide-and-combine method for extremely large data that may be stored on distributed file systems. Using the means of confidence distribution (Fisher, 1956; Efron, 1993), I develop a procedure to combine results from different sub-datasets, where lasso is used to reduce model size in order to achieve numerical stability. The algorithm fits into the MapReduce paradigm and may be perfectly parallelized. To deal with estimation bias incurred by lasso regularization, a de-bias step is invoked so the proposed method can enjoy a valid inference. The method is conceptually simple, and computationally scalable and fast, with the numerical evidence illustrated in the comparison with the benchmark maximum likelihood estimator based on full data, and some other competing divide-and-combine-type methods. We apply the method to a large public dataset from the National Highway Traffic Safety Administration on identifying the risk factors of accident injury. In Chapter IV, I generalize the fusion learning algorithm given in Chapter II and develop a coefficient clustering method for correlated data in the context of the generalized estimating equations. The motivation of this generalization is to assess model heterogeneity for the pattern mixture modeling approach (Little, 1993) where models are stratified by missing data patterns. This is one of primary strategies in the literature to deal with the informative missing data mechanism. My method aims to simplify the pattern mixture model by fusing some homogeneous parameters under the generalized estimating equations (GEE, Liang and Zeger (1986)) framework.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145885/1/lutang_1.pd

    Random Forests Based Rule Learning And Feature Elimination

    Get PDF
    Much research combines data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance. We propose an efficient approach, combining rule extraction and feature elimination, based on 1-norm regularized random forests. This approach simultaneously extracts a small number of rules generated by random forests and selects important features. To evaluate this approach, we have applied it to several drug activity prediction data sets, microarray data sets, a seacoast chemical sensors data set, a Stockori flowering time data set, and three data sets from the UCI repository. This approach performs well compared to state-of-the-art prediction algorithms like random forests in terms of predictive performance and generates only a small number of decision rules. Some of the decision rules extracted are significant in solving the problem being studied. It demonstrates high potential in terms of prediction performance and interpretation on studying real applications

    On intelligible multimodal visual analysis

    Get PDF
    Analyzing data becomes an important skill in a more and more digital world. Yet, many users are facing knowledge barriers preventing them to independently conduct their data analysis. To tear down some of these barriers, multimodal interaction for visual analysis has been proposed. Multimodal interaction through speech and touch enables not only experts, but also novice users to effortlessly interact with such kind of technology. However, current approaches do not take the user differences into account. In fact, whether visual analysis is intelligible ultimately depends on the user. In order to close this research gap, this dissertation explores how multimodal visual analysis can be personalized. To do so, it takes a holistic view. First, an intelligible task space of visual analysis tasks is defined by considering personalization potentials. This task space provides an initial basis for understanding how effective personalization in visual analysis can be approached. Second, empirical analyses on speech commands in visual analysis as well as used visualizations from scientific publications further reveal patterns and structures. These behavior-indicated findings help to better understand expectations towards multimodal visual analysis. Third, a technical prototype is designed considering the previous findings. Enriching the visual analysis by a persistent dialogue and a transparency of the underlying computations, conducted user studies show not only advantages, but address the relevance of considering the user’s characteristics. Finally, both communications channels – visualizations and dialogue – are personalized. Leveraging linguistic theory and reinforcement learning, the results highlight a positive effect of adjusting to the user. Especially when the user’s knowledge is exceeded, personalizations helps to improve the user experience. Overall, this dissertations confirms not only the importance of considering the user’s characteristics in multimodal visual analysis, but also provides insights on how an intelligible analysis can be achieved. By understanding the use of input modalities, a system can focus only on the user’s needs. By understanding preferences on the output modalities, the system can better adapt to the user. Combining both directions imporves user experience and contributes towards an intelligible multimodal visual analysis
    corecore