2 research outputs found

    Logico-linguistic semantic representation of documents

    Get PDF
    The knowledge behind the gigantic pool of data remains largely unextracted. Techniques such as ontology design, RDF representations, hpernym extraction, etc. have been used to represent the knowledge. However, the area of logic (FOPL) and linguistics (Semantics) has not been explored in depth for this purpose. Search engines suffer in extraction of specific answers to queries because of the absence of structured domain knowledge. The current paper deals with the design of formalism to extract and represent knowledge from the data in a consistent format. The application of logic and linguistics combined greatly eases and increases the precision of knowledge translation from natural language. The results clearly indicate the effectiveness of the knowledge extraction and representation methodology developed providing intelligence to machines for efficient analysis of data. The methodology helps machines to precise results in an efficient manner

    Greedy Representative Selection for Unsupervised Data Analysis

    Get PDF
    In recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances. This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below. First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection. Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time. Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance. Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task. Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems
    corecore