207 research outputs found

    A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

    Get PDF
    Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

    Advanced Data Mining Techniques for Compound Objects

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space. The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications. The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies. The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval. The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines

    NASA Tech Briefs, December 2011

    Get PDF
    Topics covered include: 1) SNE Industrial Fieldbus Interface; 2) Composite Thermal Switch; 3) XMOS XC-2 Development Board for Mechanical Control and Data Collection; 4) Receiver Gain Modulation Circuit; 5) NEXUS Scalable and Distributed Next-Generation Avionics Bus for Space Missions; 6) Digital Interface Board to Control Phase and Amplitude of Four Channels; 7) CoNNeCT Baseband Processor Module; 8) Cryogenic 160-GHz MMIC Heterodyne Receiver Module; 9) Ka-Band, Multi-Gigabit-Per-Second Transceiver; 10) All-Solid-State 2.45-to-2.78-THz Source; 11) Onboard Interferometric SAR Processor for the Ka-Band Radar Interferometer (KaRIn); 12) Space Environments Testbed; 13) High-Performance 3D Articulated Robot Display; 14) Athena; 15) In Situ Surface Characterization; 16) Ndarts; 17) Cryo-Etched Black Silicon for Use as Optical Black; 18) Advanced CO2 Removal and Reduction System; 19) Correcting Thermal Deformations in an Active Composite Reflector; 20) Umbilical Deployment Device; 21) Space Mirror Alignment System; 22) Thermionic Power Cell To Harness Heat Energies for Geothermal Applications; 23) Graph Theory Roots of Spatial Operators for Kinematics and Dynamics; 24) Spacesuit Soft Upper Torso Sizing Systems; 25) Radiation Protection Using Single-Wall Carbon Nanotube Derivatives; 26) PMA-PhyloChip DNA Microarray to Elucidate Viable Microbial Community Structure; 27) Lidar Luminance Quantizer; 28) Distributed Capacitive Sensor for Sample Mass Measurement; 29) Base Flow Model Validation; 30) Minimum Landing Error Powered-Descent Guidance for Planetary Missions; 31) Framework for Integrating Science Data Processing Algorithms Into Process Control Systems; 32) Time Synchronization and Distribution Mechanisms for Space Networks; 33) Local Estimators for Spacecraft Formation Flying; 34) Software-Defined Radio for Space-to-Space Communications; 35) Reflective Occultation Mask for Evaluation of Occulter Designs for Planet Finding; and 36) Molecular Adsorber Coatin

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Graph-Based Weakly-Supervised Methods for Information Extraction & Integration

    Get PDF
    The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers. The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases. Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems

    11th SC@RUG 2014 proceedings:Student Colloquium 2013-2014

    Get PDF

    11th SC@RUG 2014 proceedings:Student Colloquium 2013-2014

    Get PDF

    11th SC@RUG 2014 proceedings:Student Colloquium 2013-2014

    Get PDF

    11th SC@RUG 2014 proceedings:Student Colloquium 2013-2014

    Get PDF
    • …
    corecore