849 research outputs found

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only 0.7\sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Classification by feature partitioning

    Get PDF
    This paper presents a new form of exemplar-based learning, based on a representation scheme called jfaliirf parluinning, and a panitular implementation of this technique called CFF (for Classification by feature Partioning). Learning in CFP is accomplished by storing the objects separately in each (tenure dimension as disjoint sets of values called segments A segment is; expanded through generalization or specialized by dividing in into sub-segments. Cklassification is based on a weighted voting among the individual productions of the features, which are simply the class values of the segments corresponding to the values of a test instance fur each feature An empirical evaluation of CFP and its comparison with two other classification techniques, lhai consider each feature separately are given. © 1996 Kluwer Academic Publishers,

    Learning with feature partitions

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and Institute of Engineering and Science, Bilkent Univ., 1993.Thesis (Master's) -- Bilkent University, 1993.Includes bibliographical references leaves 78-81This thesis presents a new methodology of learning from examples, based on feature partitioning. Classification by Feature Partitioning (CFP) is a particular implementation of this technique, which is an inductive, incremental, and supervised learning method. Learning in CFP is accomplished by storing the objects separately in each feature dimension as disjoint partitions of values. A partition, a basic unit of representation which is initially a point in the feature dimension, is expanded through generalization. The CFP algorithm specializes a partition by subdividing it into two subpartitions. Theoretical (with respect to PAC-model) and empirical evaluation of the CFP is presented and compared with some other similar techniques.Şirin, İzzetM.S

    Concept representation with overlapping feature intervals

    Get PDF
    This article presents a new form of exemplar-based learning method, based on overlapping feature intervals. In this model, a concept is represented by a collection of overlappling intervals for each feature and class. Classification with Overlapping Feature Intervals (COFI) is a particular implementation of this technique. In this incremental, inductive, and supervised learning method, the basic unit of the representation is an interval. The COFI algorithm learns the projections of the intervals in each feature dimension for each class. Initially, an interval is a point on a feature-class dimension; then it can be expanded through generalization. No specialization of intervals is done on feature-class dimensions by this algorithm. Classification in the COFI algorithm is based on a majority voting among the local predictions that are made individually by each feature. An evaluation of COFI and its comparison with similar other classification techniques is given

    Batch learning of disjoint feature intervals

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1996.Thesis (Master's) -- Bilkent University, 1996.Includes bibliographical references leaves 98-104.This thesis presents several learning algorithms for multi-concept descriptions in the form of disjoint feature intervals, called Feature Interval Learning algorithms (FIL). These algorithms are batch supervised inductive learning algorithms, and use feature projections of the training instances for the representcition of the classification knowledge induced. These projections can be generalized into disjoint feature intervals. Therefore, the concept description learned is a set of disjoint intervals separately for each feature. The classification of an unseen instance is based on the weighted majority voting among the local predictions of features. In order to handle noisy instances, several extensions are developed by placing weights to intervals rather than features. Empirical evaluation of the FIL algorithms is presented and compared with some other similar classification algorithms. Although the FIL algorithms achieve comparable accuracies with other algorithms, their average running times are much more less than the others. This thesis also presents a new adaptation of the well-known /s-NN classification algorithm to the feature projections approach, called A:-NNFP for k-Nearest Neighbor on Feature Projections, based on a majority voting on individual classifications made by the projections of the training set on each feature and compares with the /:-NN algorithm on some real-world and cirtificial datasets.Akkuş, AynurM.S

    Learning from imbalanced data in face re-identification using ensembles of classifiers

    Get PDF
    Face re-identification is a video surveillance application where systems for video-to-video face recognition are designed using faces of individuals captured from video sequences, and seek to recognize them when they appear in archived or live videos captured over a network of video cameras. Video-based face recognition applications encounter challenges due to variations in capture conditions such as pose, illumination etc. Other challenges in this application are twofold; 1) the imbalanced data distributions between the face captures of the individuals to be re-identified and those of other individuals 2) varying degree of imbalance during operations w.r.t. the design data. Learning from imbalanced data is challenging in general due in part to the bias of performance in most two-class classification systems towards correct classification of the majority (negative, or non-target) class (face images/frames captured from the individuals in not to be re-identified) better than the minority (positive, or target) class (face images/frames captured from the individual to be re-identified) because most two-class classification systems are intended to be used under balanced data condition. Several techniques have been proposed in the literature to learn from imbalanced data that either use data-level techniques to rebalance data (by under-sampling the majority class, up-sampling the minority class, or both) for training classifiers or use algorithm-level methods to guide the learning process (with or without cost sensitive approaches) such that the bias of performance towards correct classification of the majority class is neutralized. Ensemble techniques such as Bagging and Boosting algorithms have been shown to efficiently utilize these methods to address imbalance. However, there are issues faced by these techniques in the literature: (1) some informative samples may be neglected by random under-sampling and adding synthetic positive samples through upsampling adds to training complexity, (2) cost factors must be pre-known or found, (3) classification systems are often optimized and compared using performance measurements (like accuracy) that are unsuitable for imbalance problem; (4) most learning algorithms are designed and tested on a fixed imbalance level of data, which may differ from operational scenarios; The objective of this thesis is to design specialized classifier ensembles to address the issue of imbalance in the face re-identification application and as sub-goals avoiding the abovementioned issues faced in the literature. In addition achieving an efficient classifier ensemble requires a learning algorithm to design and combine component classifiers that hold suitable diversity-accuracy trade off. To reach the objective of the thesis, four major contributions are made that are presented in three chapters summarized in the following. In Chapter 3, a new application-based sampling method is proposed to group samples for under-sampling in order to improve diversity-accuracy trade-off between classifiers of the ensemble. The proposed sampling method takes the advantage of the fact that in face re-identification applications, facial regions of a same person appearing in a camera field of view may be regrouped based on their trajectories found by face tracker. A partitional Bagging ensemble method is proposed that accounts for possible variations in imbalance level of the operational data by combining classifiers that are trained on different imbalance levels. In this method, all samples are used for training classifiers and information loss is therefore avoided. In Chapter 4, a new ensemble learning algorithm called Progressive Boosting (PBoost) is proposed that progressively inserts uncorrelated groups of samples into a Boosting procedure to avoid loosing information while generating a diverse pool of classifiers. From one iteration to the next, the PBoost algorithm accumulates these uncorrelated groups of samples into a set that grows gradually in size and imbalance. This algorithm is more sophisticated than the one proposed in Chapter 3 because instead of training the base classifiers on this set, the base classifiers are trained on balanced subsets sampled from this set and validated on the whole set. Therefore, the base classifiers are more accurate while the robustness to imbalance is not jeopardized. In addition, the sample selection is based on the weights that are assigned to samples which correspond to their importance. In addition, the computation complexity of PBoost is lower than Boosting ensemble techniques in the literature for learning from imbalanced data because not all of the base classifiers are validated on all negative samples. A new loss factor is also proposed to be used in PBoost to avoid biasing performance towards the negative class. Using this loss factor, the weight update of samples and classifier contribution in final predictions are set according to the ability of classifiers to recognize both classes. In comparing the performance of the classifier systems in Chapter 3 and 4, a need is faced for an evaluation space that compares classifiers in terms of a suitable performance metric over all of their decision thresholds, different imbalance levels of test data, and different preference between classes. The F-measure is often used to evaluate two-class classifiers on imbalanced data, and no global evaluation space was available in the literature for this measure. Therefore, in Chapter 5, a new global evaluation space for the F-measure is proposed that is analogous to the cost curves for expected cost. In this space, a classifier is represented as a curve that shows its performance over all of its decision thresholds and a range of possible imbalance levels for the desired preference of true positive rate to precision. These properties are missing in ROC and precision-recall spaces. This space also allows us to empirically improve the performance of specialized ensemble learning methods for imbalance under a given operating condition. Through a validation, the base classifiers are combined using a modified version of the iterative Boolean combination algorithm such that the selection criterion in this algorithm is replaced by F-measure instead of AUC, and the combination is carried out for each operating condition. The proposed approaches in this thesis were validated and compared using synthetic data and videos from the Faces In Action, and COX datasets that emulate face re-identification applications. Results show that the proposed techniques outperforms state of the art techniques over different levels of imbalance and overlap between classes

    Classification with overlapping feature intervals

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1995.Thesis (Master's) -- -Bilkent University, 1995.Includes bibliographical references leaves 83-88.This thesis presents a new form of exemplar-based learning method, based on overlapping feature intervals. Classification with Overlapping Feature Intervals (COFI) is the particular implementation of this technique. In this incremental, inductive and supervised learning method, the basic unit of the representation is an interval. The COFI algorithm learns the projections of the intervals in each class dimension for each feature. An interval is initially a point on a class dimension, then it can be expanded through generalization. No specialization of intervals is done on class dimensions by this algorithm. Classification in the COFI algorithm is based on a majority voting among the local predictions that are made individually by each feature.Koç, Hakime ÜnsalM.S
    corecore