5 research outputs found

    Contextual Ranking of Database Query Results

    Get PDF

    Feature Partitioning for the Co-Traning Setting

    Get PDF
    Supervised learning algorithms rely on availability of labeled data. Labeled data is either scarce or involves substantial human effort in the labeling process. These two factors, along with the abundance of unlabeled data, have spurred research initiatives that exploit unlabeled data to boost supervised learning. This genre of learning algorithms that utilize unlabeled data alongside a small set of labeled data are known as semi-supervised learning algorithms. Data characteristics, such as the presence of a generative model, provide the foundation for applying these learning algorithms. Co-training is one such al gorithm that leverages existence of two redundant views for a data instance. Based on these two views, the co-training algorithm trains two classifiers using the labeled data. The small set of labeled data results in a pair of weak classi fiers. With the help of the unlabeled data the two classifiers alternately boost each other to achieve a high-accuracy classifier. The conditions imposed by the co-training algorithm regarding the data characteristics restrict its application to data that possesses a natural split of the feature set. In this thesis we study the co-training setting and propose to overcome the above mentioned constraint by manufacturing feature splits. We pose and investigate the following questions: 1 . Can a feature split be constructed for a dataset such that the co-training algorithm can be applied to it? 2. If a feature split can be engineered, would splitting the features into more than two partitions give a better classifier? In essence, does moving from co-training (2 classifiers) to k-training (k-classifiers) help? 3. Is there an optimal number of views for a dataset such that k-training leads to an optimal classifier? The task of obtaining feature splits is approached by modeling the problem as a graph partitioning problem. Experiments are conducted on a breadth of text datasets. Results of k-training using constructed feature sets are compared with that of the expectation-maximization algorithm, which has been successful in a semi-supervised setting

    Structural advances for pattern discovery in multi-relational databases

    Get PDF
    With ever-growing storage needs and drift towards very large relational storage settings, multi-relational data mining has become a prominent and pertinent field for discovering unique and interesting relational patterns. As a consequence, a whole suite of multi-relational data mining techniques is being developed. These techniques may either be extensions to the already existing single-table mining techniques or may be developed from scratch. For the traditionalists, single-table mining algorithms can be used to work on multi-relational settings by making inelegant and time consuming joins of all target relations. However, complex relational patterns cannot be expressed in a single-table format and thus, cannot be discovered. This work presents a new multi-relational frequent pattern mining algorithm termed Multi-Relational Frequent Pattern Growth (MRFP Growth). MRFP Growth is capable of mining multiple relations, linked with referential integrity, for frequent patterns that satisfy a user specified support threshold. Empirical results on MRFP Growth performance and its comparison with the state-of-the-art multirelational data mining algorithms like WARMR and Decentralized Apriori are discussed at length. MRFP Growth scores over the latter two techniques in number of patterns generated and speed. The realm of multi-relational clustering is also explored in this thesis. A multi-Relational Item Clustering approach based on Hypergraphs (RICH) is proposed. Experimentally RICH combined with MRFP Growth proves to be a competitive approach for clustering multi-relational data. The performance and iii quality of clusters generated by RICH are compared with other clustering algorithms. Finally, the thesis demonstrates the applied utility of the theoretical implications of the above mentioned algorithms in an application framework for auto-annotation of images in an image database. The system is called CoMMA which stands for Combining Multi-relational Multimedia for Associations
    corecore