7,144 research outputs found

    Cleaning Web pages for effective Web content mining.

    Get PDF
    Web pages usually contain many noisy blocks, such as advertisements, navigation bar, copyright notice and so on. These noisy blocks can seriously affect web content mining because contents contained in noise blocks are irrelevant to the main content of the web page. Eliminating noisy blocks before performing web content mining is very important for improving mining accuracy and efficiency. A few existing approaches detect noisy blocks with exact same contents, but are weak in detecting near-duplicate blocks, such as navigation bars. In this thesis, given a collection of web pages in a web site, a new system, WebPageCleaner, which eliminates noisy blocks from these web pages so as to improve the accuracy and efficiency of web content mining, is proposed. WebPageCleaner detects both noisy blocks with exact same contents as well as those with near-duplicate contents. It is based on the observation that noisy blocks usually share common contents, and appear frequently on a given web site. WebPageCleaner consists of three modules: block extraction, block importance retrieval, and cleaned files generation. A vision-based technique is employed for extracting blocks from web pages. Blocks get their importance degree according to their block features such as block position, and level of similarity of block contents to each other. A collection of cleaned files with high importance degree are generated finally and used for web content mining. The proposed technique is evaluated using Naive Bayes text classification. Experiments show that WebPageCleaner is able to lead to a more efficient and accurate web page classification results than existing approaches.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L5. Source: Masters Abstracts International, Volume: 45-01, page: 0359. Thesis (M.Sc.)--University of Windsor (Canada), 2006

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    A Topic-Agnostic Approach for Identifying Fake News Pages

    Full text link
    Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.Comment: Accepted for publication in the Companion Proceedings of the 2019 World Wide Web Conference (WWW'19 Companion). Presented in the 2019 International Workshop on Misinformation, Computational Fact-Checking and Credible Web (MisinfoWorkshop2019). 6 page

    MINDWALC : mining interpretable, discriminative walks for classification of nodes in a knowledge graph

    Get PDF
    Background Leveraging graphs for machine learning tasks can result in more expressive power as extra information is added to the data by explicitly encoding relations between entities. Knowledge graphs are multi-relational, directed graph representations of domain knowledge. Recently, deep learning-based techniques have been gaining a lot of popularity. They can directly process these type of graphs or learn a low-dimensional numerical representation. While it has been shown empirically that these techniques achieve excellent predictive performances, they lack interpretability. This is of vital importance in applications situated in critical domains, such as health care. Methods We present a technique that mines interpretable walks from knowledge graphs that are very informative for a certain classification problem. The walks themselves are of a specific format to allow for the creation of data structures that result in very efficient mining. We combine this mining algorithm with three different approaches in order to classify nodes within a graph. Each of these approaches excels on different dimensions such as explainability, predictive performance and computational runtime. Results We compare our techniques to well-known state-of-the-art black-box alternatives on four benchmark knowledge graph data sets. Results show that our three presented approaches in combination with the proposed mining algorithm are at least competitive to the black-box alternatives, even often outperforming them, while being interpretable. Conclusions The mining of walks is an interesting alternative for node classification in knowledge graphs. Opposed to the current state-of-the-art that uses deep learning techniques, it results in inherently interpretable or transparent models without a sacrifice in terms of predictive performance

    Web page cleaning for web mining

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    An Analysis of Predicting Job Titles Using Job Descriptions

    Get PDF
    A job title is an all-encompassing very short form description that conveys all of the pertinent information relating to a job. The job title typically encapsulates - and should encapsulate - the domain, role and level of responsibility of any given job. Significant value is attached to job titles both internally within organisational structures and to individual job holders. Organisations map out all employees in an organogram on the basis of job titles. This has a bearing on issues like salary, level and scale of responsibility, employee selection and so on. Employees draw value from their own job titles as a means of self-identity and this can have a significant impact on their engagement and motivation. Classification of job titles based upon the details of the job is a subjective human resources exercise, however, which risks bias and inconsistency. I am instead proposing that the job title classification process can be performed in a systematic, algorithmic- based process with the application of standard Natural Language Processing (NLP) together with supervised machine learning. In this paper, data (job descriptions) labelled with Job Titles was collected from a popular national job postings website (www.irishjobs.ie). The data went through several standard text-pre-processing transformations which are detailed below, in or- der to reduce dimensionality of the corpus of data. Feature engineering was used to create a Data Model(s) of selected keyword sets characteristic to each Job Title gen- erated on the basis of term frequency. The models developed with the Random Forest and Support Vector Machines supervised learning algorithms were used to generate prediction models to make predictions based on the Top 30 most frequently occurring Job Titles. The most successful model was the SVM linear kernel based model, which had an Accuracy rate of 71%, Macro Average Precision of 70%, Macro Averaged Recall of 67% and a Macro Average F-Score of 66%. The Random Forest Model performed less well; with a Accuracy rate of 58%, Macro Average Precision of 56%, Macro Average Recall of 55% and Macro Average FScore of 56%. The data model described here and the prediction performance obtained indicate that several particularities of the problem its high dimensionality and the complexity of feature engineering required to generate a data model with the correct keywords for each job lead to data models that cannot provide an optimal performance even when using powerful Machine Learning (ML) algorithms. The data model design can be improved using a wider data set (completed from job descriptions collected from a variety of websites) thus optimising the set of keywords describing each job title. More complex and computationally expensive algorithms - based on deep learning - may also provide more refined and more accurate predictive models. No research was found during this study which examined the subject matter of classification of job titles using machine learning specifically. However, other relevant literature was reviewed on text classification via supervised learning which was useful in designing the models and applied to this domain. While supervised ML techniques are commonly applied to text classification includ- ing sentiment analysis, there was no similar study described in the literature approach- ing the link between job titles and the corresponding required skills. Nevertheless, the work presented here describes a valid and practical approach to answering the pro- posed research question within the constraints of a limited data model and basic ML algorithms. Such an approach may prove a working base for designing future models for artificial intelligence applications
    • 

    corecore