1,528 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    DART: the distributed agent based retrieval toolkit

    Get PDF
    The technology of search engines is evolving from indexing and classification of web resources based on keywords to more sophisticated techniques which take into account the meaning and the context of textual information and usage. Replying to query, commercial search engines face the user requests with a large amount of results, mostly useless or only partially related to the request; the subsequent refinement, operated downloading and examining as much pages as possible and simply ignoring whatever stays behind the first few pages, is left up to the user. Furthermore, architectures based on centralized indexes, allow commercial search engines to control the advertisement of online information, in contrast to P2P architectures that focus the attention on user requirements involving the end user in search engine maintenance and operation. To address such wishes, new search engines should focus on three key aspects: semantics, geo-referencing, collaboration/distribution. Semantic analysis lets to increase the results relevance. The geo-referencing of catalogued resources allows contextualisation based on user position. Collaboration distributes storage, processing, and trust on a world-wide network of nodes running on users’ computers, getting rid of bottlenecks and central points of failures. In this paper, we describe the studies, the concepts and the solutions developed in the DART project to introduce these three key features in a novel search engine architecture

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    Ontology Population for Open-Source Intelligence

    Get PDF
    We present an approach based on GATE (General Architecture for Text Engineering) for the automatic population of ontologies from text documents. We describe some experimental results, which are encouraging in terms of extracted correct instances of the ontology. We then focus on a phase of our pipeline and discuss a variant thereof, which aims at reducing the manual effort needed to generate pre-defined dictionaries used in document annotation. Our additional experiments show promising results also in this case

    Building a scalable index and a web search engine for music on the Internet using Open Source software

    Get PDF
    The Internet has made possible the access to thousands of freely available music tracks with Creative Commons or Public Domain licenses. Actually, this number keeps growing every year. In practical terms, it is very difficult to browse this music collection, because it is wide and disperse in hundreds of websites. To address the music recommendation issue, a case study on existing systems was made, to put the problem in context in order to identify necessary building blocks. This thesis is mainly focused on the problem of indexing this large collection of music. The reason to focus on this problem, is that there is no database or index holding information about this music material, thus making this research on the subject extremely difficult. In order to figure out what software could help solve this problem, the state of the art in “Open Source tools for web crawling and indexing” was assessed. Based on the conclusions from the state of the art, a prototype was developed and implemented using the most appropriate software framework. The created solution proved it was capable of crawling the web pages, while parsing and indexing MP3 files. The produced index is available through a web search engine interface also producing results in XML format. The results obtained lead to the conclusion that it is attainable to build a scalable index and web search engine for music in the Internet using Open Source software. This is supported by the proof of concept achieved with the working prototype.A Internet tornou possĂ­vel o acesso a milhares de faixas musicais disponĂ­veis gratuitamente segundo uma licença Creative Commons ou de DomĂ­nio PĂșblico. Na realidade, este nĂșmero continua a aumentar em cada ano. Em termos prĂĄticos, Ă© muito difĂ­cil navegar nesta colecção de mĂșsica, pois a mesma Ă© vasta e encontra-se dispersa em milhares de sites na Web. Para abordar o assunto da recomendação de mĂșsica, um caso de estudo sobre sistemas de recomendação de mĂșsica existentes foi elaborado, para contextualizar o problema e identificar os grandes blocos que os constituem. Esta tese foca-se na problemĂĄtica da indexação de uma grande colecção de mĂșsica, pela razĂŁo de que, nĂŁo existe uma base de dados ou Ă­ndice que contenha informação sobre este repositĂłrio musical, tornando muito difĂ­cil o estudo nesta matĂ©ria. De forma a compreender que software poderia ajudar a resolver o problema, foi avaliado o estado da arte em ferramentas de rastreio de conteĂșdos web e indexação de cĂłdigo aberto. Com base nas conclusĂ”es do estado da arte, o protĂłtipo foi desenvolvido e implementado, utilizando o software mais apropriado para a tarefa. A solução criada provou que era possĂ­vel percorrer as pĂĄginas Web, enquanto se analisavam e indexavam MP3. O Ă­ndice produzido encontra-se disponĂ­vel atravĂ©s de um motor de busca online e tambĂ©m com resultados no formato XML. Os resultados obtidos levam a concluir que Ă© possĂ­vel, construir um Ă­ndice escalĂĄvel e motor de busca na web para mĂșsica na Internet utilizando software Open Source. Estes resultados sĂŁo fundamentados pela prova de conceito obtida com o protĂłtipo funcional

    UniversityIE: Information Extraction From University Web Pages

    Get PDF
    The amount of information available on the web is growing constantly. As a result, theproblem of retrieving any desired information is getting more difficult by the day. Toalleviate this problem, several techniques are currently being used, both for locatingpages of interest and for extracting meaningful information from the retrieved pages.Information extraction (IE) is one such technology that is used for summarizingunrestricted natural language text into a structured set of facts. IE is already being appliedwithin several domains such as news transcripts, insurance information, and weatherreports. Various approaches to IE have been taken and a number of significant resultshave been reported.In this thesis, we describe the application of IE techniques to the domain of universityweb pages. This domain is broader than previously evaluated domains and has a varietyof idiosyncratic problems to address. We present an analysis of the domain of universityweb pages and the consequences of having them input to IE systems. We then presentUniversityIE, a system that can search a web site, extract relevant pages, and processthem for information such as admission requirements or general information. TheUniversityIE system, developed as part of this research, contributes three IE methods anda web-crawling heuristic that worked relatively well and predictably over a test set ofuniversity web sites.We designed UniversityIE as a generic framework for plugging in and executing IEmethods over pages acquired from the web. We also integrated in the system a genericweb crawler (built at the University of Kentucky) and ported to Java and integrated anexternal word lexicon (WordNet) and a syntax parser (Link Grammar Parser)

    A Survey on Automatically Mining Facets for Web Queries

    Get PDF
    In this paper, a detailed survey on different facet mining techniques, their advantages and disadvantages is carried out. Facets are any word or phrase which summarize an important aspect about the web query. Researchers proposed different efficient techniques which improves the user’s web query search experiences magnificently. Users are happy when they find the relevant information to their query in the top results. The objectives of their research are: (1) To present automated solution to derive the query facets by analyzing the text query; (2) To create taxonomy of query refinement strategies for efficient results; and (3) To personalize search according to user interest
    • 

    corecore