2,569 research outputs found

    A Survey on Region Extractors from Web Documents

    Get PDF
    Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Towards Comparative Web Content Mining using Object Oriented Model

    Get PDF
    Web content data are heterogeneous in nature; usually composed of different types of contents and data structure. Thus, extraction and mining of web content data is a challenging branch of data mining. Traditional web content extraction and mining techniques are classified into three categories: programming language based wrappers, wrapper (data extraction program) induction techniques, and automatic wrapper generation techniques. First category constructs data extraction system by providing some specialized pattern specification languages, second category is a supervised learning, which learns data extraction rules and third category is automatic extraction process. All these data extraction techniques rely on web document presentation structures, which need complicated matching and tree alignment algorithms, routine maintenance, hard to unify for vast variety of websites and fail to catch heterogeneous data together. To catch more diversity of web documents, a feasible implementation of an automatic data extraction technique based on object oriented data model technique, 00Web, had been proposed in Annoni and Ezeife (2009). This thesis implements, materializes and extends the structured automatic data extraction technique. We developed a system (called WebOMiner) for extraction and mining of structured web contents based on object-oriented data model. Thesis extends the extraction algorithms proposed by Annoni and Ezeife (2009) and develops an automata based automatic wrapper generation algorithm for extraction and mining of structured web content data. Our algorithm identifies data blocks from flat array data structure and generates Non-Deterministic Finite Automata (NFA) pattern for different types of content data for extraction. Objective of this thesis is to extract and mine heterogeneous web content and relieve the hard effort of matching, tree alignment and routine maintenance. Experimental results show that our system is highly effective and it performs the mining task with 100% precision and 96.22% recall value

    Enhanced web analytics for health insurance

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsNowadays companies need invest and improve on data solution implementation within most of the business workflows and processes, in order to differentiate the offer and stay ahead of their competitors. It’s becoming more and more important to take data driven decisions to boost profitability and improve the overall customer experience. In this way, strategies are defined not anymore on common beliefs and assumptions, but on contextualized and trustful insights. This reports describes the work that has been made during a 9-month internship, in order to provide the business with a new and improved solution for enhancing the web analytics tasks and supporting the improve of the online user digital experience. User-level data related to the website activity has been extracted at the highest granularity level. Afterwards, raw data have been cleaned and stored in an Analytical Base Table with which an initial data exploration has been made. After giving initial insights to the digital team, a predictive model has been developed in order to predict the probability of the users to buy the insurance product online. Finally, based on the initial data exploration and the model’s results, a set of recommendations has been built and provided to the digital department for their implementation in order to make the website more engaging and dynamic
    corecore