207 research outputs found
A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network
Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58
Advanced Data Mining Techniques for Compound Objects
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space.
The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and
multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications.
The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies.
The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval.
The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new
method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines
NASA Tech Briefs, December 2011
Topics covered include: 1) SNE Industrial Fieldbus Interface; 2) Composite Thermal Switch; 3) XMOS XC-2 Development Board for Mechanical Control and Data Collection; 4) Receiver Gain Modulation Circuit; 5) NEXUS Scalable and Distributed Next-Generation Avionics Bus for Space Missions; 6) Digital Interface Board to Control Phase and Amplitude of Four Channels; 7) CoNNeCT Baseband Processor Module; 8) Cryogenic 160-GHz MMIC Heterodyne Receiver Module; 9) Ka-Band, Multi-Gigabit-Per-Second Transceiver; 10) All-Solid-State 2.45-to-2.78-THz Source; 11) Onboard Interferometric SAR Processor for the Ka-Band Radar Interferometer (KaRIn); 12) Space Environments Testbed; 13) High-Performance 3D Articulated Robot Display; 14) Athena; 15) In Situ Surface Characterization; 16) Ndarts; 17) Cryo-Etched Black Silicon for Use as Optical Black; 18) Advanced CO2 Removal and Reduction System; 19) Correcting Thermal Deformations in an Active Composite Reflector; 20) Umbilical Deployment Device; 21) Space Mirror Alignment System; 22) Thermionic Power Cell To Harness Heat Energies for Geothermal Applications; 23) Graph Theory Roots of Spatial Operators for Kinematics and Dynamics; 24) Spacesuit Soft Upper Torso Sizing Systems; 25) Radiation Protection Using Single-Wall Carbon Nanotube Derivatives; 26) PMA-PhyloChip DNA Microarray to Elucidate Viable Microbial Community Structure; 27) Lidar Luminance Quantizer; 28) Distributed Capacitive Sensor for Sample Mass Measurement; 29) Base Flow Model Validation; 30) Minimum Landing Error Powered-Descent Guidance for Planetary Missions; 31) Framework for Integrating Science Data Processing Algorithms Into Process Control Systems; 32) Time Synchronization and Distribution Mechanisms for Space Networks; 33) Local Estimators for Spacecraft Formation Flying; 34) Software-Defined Radio for Space-to-Space Communications; 35) Reflective Occultation Mask for Evaluation of Occulter Designs for Planet Finding; and 36) Molecular Adsorber Coatin
Recommended from our members
Laboratory Directed Research and Development Program FY 2004 Annual Report
The Oak Ridge National Laboratory (ORNL) Laboratory Directed Research and Development (LDRD) Program reports its status to the U.S. Department of Energy (DOE) in March of each year. The program operates under the authority of DOE Order 413.2A, 'Laboratory Directed Research and Development' (January 8, 2001), which establishes DOE's requirements for the program while providing the Laboratory Director broad flexibility for program implementation. LDRD funds are obtained through a charge to all Laboratory programs. This report describes all ORNL LDRD research activities supported during FY 2004 and includes final reports for completed projects and shorter progress reports for projects that were active, but not completed, during this period. The FY 2004 ORNL LDRD Self-Assessment (ORNL/PPA-2005/2) provides financial data about the FY 2004 projects and an internal evaluation of the program's management process. ORNL is a DOE multiprogram science, technology, and energy laboratory with distinctive capabilities in materials science and engineering, neutron science and technology, energy production and end-use technologies, biological and environmental science, and scientific computing. With these capabilities ORNL conducts basic and applied research and development (R&D) to support DOE's overarching national security mission, which encompasses science, energy resources, environmental quality, and national nuclear security. As a national resource, the Laboratory also applies its capabilities and skills to the specific needs of other federal agencies and customers through the DOE Work For Others (WFO) program. Information about the Laboratory and its programs is available on the Internet at <http://www.ornl.gov/>. LDRD is a relatively small but vital DOE program that allows ORNL, as well as other multiprogram DOE laboratories, to select a limited number of R&D projects for the purpose of: (1) maintaining the scientific and technical vitality of the Laboratory; (2) enhancing the Laboratory's ability to address future DOE missions; (3) fostering creativity and stimulating exploration of forefront science and technology; (4) serving as a proving ground for new research; and (5) supporting high-risk, potentially high-value R&D. Through LDRD the Laboratory is able to improve its distinctive capabilities and enhance its ability to conduct cutting-edge R&D for its DOE and WFO sponsors. To meet the LDRD objectives and fulfill the particular needs of the Laboratory, ORNL has established a program with two components: the Director's R&D Fund and the Seed Money Fund. As outlined in Table 1, these two funds are complementary. The Director's R&D Fund develops new capabilities in support of the Laboratory initiatives, while the Seed Money Fund is open to all innovative ideas that have the potential for enhancing the Laboratory's core scientific and technical competencies. Provision for multiple routes of access to ORNL LDRD funds maximizes the likelihood that novel and seminal ideas with scientific and technological merit will be recognized and supported
Graph-Based Weakly-Supervised Methods for Information Extraction & Integration
The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the pro blem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. In this thesis, we develop techniques that generalize from limited human input, via weakly-supervised methods for IE and II. In particular, we argue that graph-based representation of data and learning over such graphs can result in effective and scalable methods for large-scale Information Extraction and Integration. Within IE, we focus on the problem of assigning semantic classes to entities. First we develop a context pattern induction method to extend small initial entity lists of various semantic classes. We also demonstrate that features derived from such extended entity lists can significantly improve performance of state-of-the-art discriminative taggers.
The output of pattern-based class-instance extractors is often high-precision and low-recall in nature, which is inadequate for many real world applications. We use Adsorption, a graph based label propagation algorithm, to significantly increase recall of an initial high-precision, low-recall pattern-based extractor by combining evidences from unstructured and structured text corpora. Building on Adsorption, we propose a new label propagation algorithm, Modified Adsorption (MAD), and demonstrate its effectiveness on various real-world datasets. Additionally, we also show how class-instance acquisition performance in the graph-based SSL setting can be improved by incorporating additional semantic constraints available in independently developed knowledge bases.
Within Information Integration, we develop a novel system, Q, which draws ideas from machine learning and databases to help a non-expert user construct data-integrating queries based on keywords (across databases) and interactive feedback on answers. We also present an information need-driven strategy for automatically incorporating new sources and their information in Q. We also demonstrate that Q\u27s learning strategy is highly effective in combining the outputs of ``black box\u27\u27 schema matchers and in re-weighting bad alignments. This removes the need to develop an expensive mediated schema which has been necessary for most previous systems
- âŚ