Search CORE

6,344 research outputs found

Mathematics in the National Curriculum for Wales : Key Stages 2-4 = Mathemateg yng Nghwricwlwm Cenedlaethol Cymru : Cyfnodau Allweddol 2-4

Author
Publication venue: Department for Children, Education, Lifelong Learning and Skills
Publication date: 01/01/2008
Field of study

BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Author: Anderson M. R.
Bardenet R.
Bergstra J. S.
Bottou L.
Crankshaw D.
Derezinski M.
Drineas P.
Duchi J.
Feurer M.
Gittens A.
Le Q. V.
Lin C.-J.
Lucic M.
Maclaurin D.
Martens J.
Mozafari B.
Musco C.
Ogawa K.
Pedregosa F.
R
Recht B.
Salakhutdinov R.
Tieleman T.
Tipping M. E.
Weimer M.
Xing E.
Yang A. Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/12/2018
Field of study

The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

arXiv.org e-Print Archive

Crossref

Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models

Author: Anderson Linda
Baklanov Artem
Duer Alexander
Hanbury Allan
Lupu Mihai
Rekabsaz Navid
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In parallel to textual information, factual market data have been widely used as the mainstream approach to forecast market risk. We therefore study different fusion methods to combine text and market data resources. Our word embedding-based approach significantly outperforms state-of-the-art methods. In addition, we investigate the characteristics of the reports of the companies in different financial sectors

arXiv.org e-Print Archive

Crossref

International Institute for Applied Systems Analysis (IIASA)

Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

Author: Tan Yufen
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2010
Field of study

Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

Digital Commons@Wayne State University