37,223 research outputs found
A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition
Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high
prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much
attention from researchers in fields of multigrid methods and graph analysis.
Many optimization techniques have been developed for certain application fields
and computing architecture over the decades. The objective of this paper is to
provide a structured and comprehensive overview of the research on SpGEMM.
Existing optimization techniques have been grouped into different categories
based on their target problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning and load
balancing, result accumulating, and target architecture-oriented optimization.
The rationales of different algorithms in each category are analyzed, and a
wide range of SpGEMM algorithms are summarized. This survey sufficiently
reveals the latest progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative study of
existing implementations on CPU and GPU is presented. Based on our findings, we
highlight future research directions and how future studies can leverage our
findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm
Convergence of Tomlin's HOTS algorithm
The HOTS algorithm uses the hyperlink structure of the web to compute a
vector of scores with which one can rank web pages. The HOTS vector is the
vector of the exponentials of the dual variables of an optimal flow problem
(the "temperature" of each page). The flow represents an optimal distribution
of web surfers on the web graph in the sense of entropy maximization.
In this paper, we prove the convergence of Tomlin's HOTS algorithm. We first
study a simplified version of the algorithm, which is a fixed point scaling
algorithm designed to solve the matrix balancing problem for nonnegative
irreducible matrices. The proof of convergence is general (nonlinear
Perron-Frobenius theory) and applies to a family of deformations of HOTS. Then,
we address the effective HOTS algorithm, designed by Tomlin for the ranking of
web pages. The model is a network entropy maximization problem generalizing
matrix balancing. We show that, under mild assumptions, the HOTS algorithm
converges with a linear convergence rate. The proof relies on a uniqueness
property of the fixed point and on the existence of a Lyapunov function.
We also show that the coordinate descent algorithm can be used to find the
ideal and effective HOTS vectors and we compare HOTS and coordinate descent on
fragments of the web graph. Our numerical experiments suggest that the
convergence rate of the HOTS algorithm may deteriorate when the size of the
input increases. We thus give a normalized version of HOTS with an
experimentally better convergence rate.Comment: 21 page
Towards Optimal Distributed Node Scheduling in a Multihop Wireless Network through Local Voting
In a multihop wireless network, it is crucial but challenging to schedule
transmissions in an efficient and fair manner. In this paper, a novel
distributed node scheduling algorithm, called Local Voting, is proposed. This
algorithm tries to semi-equalize the load (defined as the ratio of the queue
length over the number of allocated slots) through slot reallocation based on
local information exchange. The algorithm stems from the finding that the
shortest delivery time or delay is obtained when the load is semi-equalized
throughout the network. In addition, we prove that, with Local Voting, the
network system converges asymptotically towards the optimal scheduling.
Moreover, through extensive simulations, the performance of Local Voting is
further investigated in comparison with several representative scheduling
algorithms from the literature. Simulation results show that the proposed
algorithm achieves better performance than the other distributed algorithms in
terms of average delay, maximum delay, and fairness. Despite being distributed,
the performance of Local Voting is also found to be very close to a centralized
algorithm that is deemed to have the optimal performance
Matrix Scaling and Balancing via Box Constrained Newton's Method and Interior Point Methods
In this paper, we study matrix scaling and balancing, which are fundamental
problems in scientific computing, with a long line of work on them that dates
back to the 1960s. We provide algorithms for both these problems that, ignoring
logarithmic factors involving the dimension of the input matrix and the size of
its entries, both run in time where is the amount of error we are willing to
tolerate. Here, represents the ratio between the largest and the
smallest entries of the optimal scalings. This implies that our algorithms run
in nearly-linear time whenever is quasi-polynomial, which includes, in
particular, the case of strictly positive matrices. We complement our results
by providing a separate algorithm that uses an interior-point method and runs
in time .
In order to establish these results, we develop a new second-order
optimization framework that enables us to treat both problems in a unified and
principled manner. This framework identifies a certain generalization of linear
system solving that we can use to efficiently minimize a broad class of
functions, which we call second-order robust. We then show that in the context
of the specific functions capturing matrix scaling and balancing, we can
leverage and generalize the work on Laplacian system solving to make the
algorithms obtained via this framework very efficient.Comment: To appear in FOCS 201
- …