98,528 research outputs found

    Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

    Get PDF
    Discovering potentially useful and previously unknown historical knowledge from heterogeneous E-Commerce (B2C) web site contents to answer comparative queries such as “list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make”, would require the difficult task of finding the schema of web documents from different web pages, extracting target information and performing web content data integration, building their virtual or physical data warehouse and mining from it. Automatic data extractors (wrappers) such as the WebOMiner system use data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, traversing the DOM for pattern discovery to recognize and extract different web data types (e.g., text, image, links, and lists). Some limitations of the existing systems include using complicated matching techniques such as tree matching, non-deterministic finite state automata (NFA), domain ontology and inability to answer complex comparative historical and derived queries. This thesis proposes building the WebOMiner_S which uses web structure and content mining approaches on the DOM¬ tree html code to simplify and make more easily extendable the WebOMiner system data extraction process. We propose to replace the use of NFA in the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java XPATH parser with its methods to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags \u3c div class = “ ′′ \u3e) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated. Experiments show that the WebOMiner_S achieves a 100% precision and 100% recall in identifying the product records, 95.55% precision and 100% recall in identifying the data columns

    Optical tomography: Image improvement using mixed projection of parallel and fan beam modes

    Get PDF
    Mixed parallel and fan beam projection is a technique used to increase the quality images. This research focuses on enhancing the image quality in optical tomography. Image quality can be defined by measuring the Peak Signal to Noise Ratio (PSNR) and Normalized Mean Square Error (NMSE) parameters. The findings of this research prove that by combining parallel and fan beam projection, the image quality can be increased by more than 10%in terms of its PSNR value and more than 100% in terms of its NMSE value compared to a single parallel beam

    Automatic supervised information extraction of structured web data

    Get PDF
    The overall purpose of this project is, in short words, to create a system able to extract vital information from product web pages just like a human would. Information like the name of the product, its description, price tag, company that produces it, and so on. At a first glimpse, this may not seem extraordinary or technically difficult, since web scraping techniques exist from long ago (like the python library Beautiful Soup for instance, an HTML parser1 released in 2004). But let us think for a second on what it actually means being able to extract desired information from any given web source: the way information is displayed can be extremely varied, not only visually, but also semantically. For instance, some hotel booking web pages display at once all prices for the different room types, while medium-sized consumer products in websites like Amazon offer the main product in detail and then more small-sized product recommendations further down the page, being the latter the preferred way of displaying assets by most retail companies. And each with its own styling and search engines. With the above said, the task of mining valuable data from the web now does not sound as easy as it first seemed. Hence the purpose of this project is to shine some light on the Automatic Supervised Information Extraction of Structured Web Data problem. It is important to think if developing such a solution is really valuable at all. Such an endeavour both in time and computing resources should lead to a useful end result, at least on paper, to justify it. The opinion of this author is that it does lead to a potentially valuable result. The targeted extraction of information of publicly available consumer-oriented content at large scale in an accurate, reliable and future proof manner could provide an incredibly useful and large amount of data. This data, if kept updated, could create endless opportunities for Business Intelligence, although exactly which ones is beyond the scope of this work. A simple metaphor explains the potential value of this work: if an oil company were to be told where are all the oil reserves in the planet, it still should need to invest in machinery, workers and time to successfully exploit them, but half of the job would have already been done2. As the reader will see in this work, the way the issue is tackled is by building a somehow complex architecture that ends in an Artificial Neural Network3. A quick overview of such architecture is as follows: first find the URLs that lead to the product pages that contain the desired data that is going to be extracted inside a given site (like URLs that lead to ”action figure” products inside the site ebay.com); second, per each URL passed, extract its HTML and make a screenshot of the page, and store this data in a suitable and scalable fashion; third, label the data that will be fed to the NN4; fourth, prepare the aforementioned data to be input in an NN; fifth, train the NN; and sixth, deploy the NN to make [hopefully accurate] predictions

    Explaining Trained Neural Networks with Semantic Web Technologies: First Steps

    Get PDF
    The ever increasing prevalence of publicly available structured data on the World Wide Web enables new applications in a variety of domains. In this paper, we provide a conceptual approach that leverages such data in order to explain the input-output behavior of trained artificial neural networks. We apply existing Semantic Web technologies in order to provide an experimental proof of concept

    Searching for Ground Truth: a stepping stone in automating genre classification

    Get PDF
    This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.
    corecore