4,477 research outputs found

    Automatic Classification of Text Databases through Query Probing

    Get PDF
    Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

    Classification-Aware Hidden-Web Text Database Selection,

    Get PDF
    Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying.We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words.We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Search Interfaces on the Web: Querying and Characterizing

    Get PDF
    Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.Siirretty Doriast

    Research and Development Workstation Environment: the new class of Current Research Information Systems

    Get PDF
    Against the backdrop of the development of modern technologies in the field of scientific research the new class of Current Research Information Systems (CRIS) and related intelligent information technologies has arisen. It was called - Research and Development Workstation Environment (RDWE) - the comprehensive problem-oriented information systems for scientific research and development lifecycle support. The given paper describes design and development fundamentals of the RDWE class systems. The RDWE class system's generalized information model is represented in the article as a three-tuple composite web service that include: a set of atomic web services, each of them can be designed and developed as a microservice or a desktop application, that allows them to be used as an independent software separately; a set of functions, the functional filling-up of the Research and Development Workstation Environment; a subset of atomic web services that are required to implement function of composite web service. In accordance with the fundamental information model of the RDWE class the system for supporting research in the field of ontology engineering - the automated building of applied ontology in an arbitrary domain area, scientific and technical creativity - the automated preparation of application documents for patenting inventions in Ukraine was developed. It was called - Personal Research Information System. A distinctive feature of such systems is the possibility of their problematic orientation to various types of scientific activities by combining on a variety of functional services and adding new ones within the cloud integrated environment. The main results of our work are focused on enhancing the effectiveness of the scientist's research and development lifecycle in the arbitrary domain area.Comment: In English, 13 pages, 1 figure, 1 table, added references in Russian. Published. Prepared for special issue (UkrPROG 2018 conference) of the scientific journal "Problems of programming" (Founder: National Academy of Sciences of Ukraine, Institute of Software Systems of NAS Ukraine

    Teollisuusautomaatiojärjestelmien tunnistus ja luokittelu IP-verkoissa

    Get PDF
    Industrial Control Systems (ICS) are an essential part of the critical infrastructure of society and becoming increasingly vulnerable to cyber attacks performed over computer networks. The introduction of remote access connections combined with mistakes in automation system configurations expose ICSs to attacks coming from public Internet. Insufficient IT security policies and weaknesses in security features of automation systems increase the risk of a successful cyber attack considerably. In recent years the amount of observed cyber attacks has been on constant rise, signaling the need of new methods for finding and protecting vulnerable automation systems. So far, search engines for Internet connected devices, such as Shodan, have been a great asset in mapping the scale of the problem. In this theses methods are presented to identify and classify industrial control systems over IP based networking protocols. A great portion of protocols used in automation networks contain specific diagnostic requests for pulling identification information from a device. Port scanning methods combined with more elaborate service scan probes can be used to extract identifying data fields from an automation device. Also, a model for automated finding and reporting of vulnerable ICS devices is presented. A prototype software was created and tested with real ICS devices to demonstrate the viability of the model. The target set was gathered from Finnish devices directly connected to the public Internet. Initial results were promising as devices or systems were identified at 99% success ratio. A specially crafted identification ruleset and detection database was compiled to work with the prototype. However, a more comprehensive detection library of ICS device types is needed before the prototype is ready to be used in different environments. Also, other features which help to further assess the device purpose and system criticality would be some key improvements for the future versions of the prototype.Yhteiskunnan kriittiseen infrastruktuuriin kuuluvat teollisuusautomaatiojärjestelmät ovat yhä enemmissä määrin alttiita tietoverkkojen kautta tapahtuville kyberhyökkäyksille. Etähallintayhteyksien yleistyminen ja virheet järjestelmien konfiguraatioissa mahdollistavat hyökkäykset jopa suoraa Internetistä käsin. Puutteelliset tietoturvakäytännöt ja teollisuusautomaatiojärjestelmien heikot suojaukset lisäävät onnistuneen kyberhyökkäyksen riskiä huomattavasti. Viime vuosina kyberhyökkäysten määrä maailmalla on ollut jatkuvassa kasvussa ja siksi tarve uusille menetelmille haavoittuvaisten järjestelmien löytämiseksi ja suojaamiseksi on olemassa. Internetiin kytkeytyneiden laitteiden hakukoneet, kuten Shodan, ovat olleet suurena apuna ongelman laajuuden kartoittamisessa. Tässä työssä esitellään menetelmiä teollisuusautomaatiojärjestelmien tunnistamiseksi ja luokittelemiseksi käyttäen IP-pohjaisia tietoliikenneprotokollia. Suuri osa automaatioverkoissa käytetyistä protokollista sisältää erityisiä diagnostiikkakutsuja laitteen tunnistetietojen selvittämiseksi. Porttiskannauksella ja tarkemmalla palvelukohtaisella skannauksella laitteesta voidaan saada yksilöivää tunnistetietoa. Työssä esitellään myös malli automaattiselle haavoittuvaisten teollisuusautomaatiojärjestelmien löytämiselle ja raportoimiselle. Mallin tueksi esitellään ohjelmistoprototyyppi, jolla mallin toimivuutta testattiin käyttäen testijoukkona oikeita Suomesta löytyviä, julkiseen Internetiin kytkeytyneitä teollisuusautomaatiolaitteita. Prototyypin alustavat tulokset olivat lupaavia: laitteille tai järjestelmille kyettiin antamaan jokin tunniste 99 % tapauksista käyttäen luokittelussa apuna prototyypille luotua tunnistekirjastoa. Ohjelmiston yleisempi käyttö vaatii kuitenkin kattavamman automaatiolaitteiden tunnistekirjaston luomista sekä prototyypin jatkokehitystä: tehokkaampi tunnistaminen edellyttää automaatiojärjestelmien toimintaympäristön ja kriittisyyden tarkempaa analysointia

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System
    corecore