Search CORE

116 research outputs found

Recommended from our members

The Syllabus Based Web Content Extractor (SBWCE)

Author: Hilal Saba
Rizvi S.A.M.
Publication venue: CSUSB ScholarWorks
Publication date: 02/06/2014
Field of study

Syllabus Based Web Content Extractor (SBWCE) introduces a new technique of Syllabus Based Web Content Mining. It makes the Syllabus Based Web Content Extraction easy and creates an instant online book view based on the links relevant to the given Syllabus. Three important contributions are made by the current work. First, as multiple format educational information is needed for Syllabus based content; the technique used makes the finding of such content easier. Second, a new approach for capturing and recording the heuristics involved during searching by experts is used. Third, the grouping of Syllabus Words for precise extraction is exploited. This paper introduces SBWCE and presents the related details

CSUSB ScholarWorks

Content-Based Book Recommending Using Learning for Text Categorization

Author: Mooney Raymond J.
Roy Loriene
Publication venue
Publication date: 01/01/1999
Field of study

Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.Comment: 8 pages, 3 figures, Submission to Fourth ACM Conference on Digital Librarie

arXiv.org e-Print Archive

CiteSeerX

Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages

Author: Kapoor Rahul
Kejriwal Mayank
Szekely Pedro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Extracting geographical tags from webpages is a well-motivated application in many domains. In illicit domains with unusual language models, like human trafficking, extracting geotags with both high precision and recall is a challenging problem. In this paper, we describe a geotag extraction framework in which context, constraints and the openly available Geonames knowledge base work in tandem in an Integer Linear Programming (ILP) model to achieve good performance. In preliminary empirical investigations, the framework improves precision by 28.57% and F-measure by 36.9% on a difficult human trafficking geotagging task compared to a machine learning-based baseline. The method is already being integrated into an existing knowledge base construction system widely used by US law enforcement agencies to combat human trafficking.Comment: 6 pages, GeoRich 2017 workshop at ACM SIGMOD conferenc

arXiv.org e-Print Archive

Crossref

User driven information extraction with LODIE

Author: Gentile Anna Lisa
Mazumdar Suvodeep
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2014
Field of study

Information Extraction (IE) is the technique for transforming unstructured or semi-structured data into structured representation that can be understood by machines. In this paper we use a user-driven Information Extraction technique to wrap entity-centric Web pages. The user can select concepts and properties of interest from available Linked Data. Given a number of websites containing pages about the concepts of interest, the method will exploit (i) recurrent structures in the Web pages and (ii) available knowledge in Linked data to extract the information of interest from the Web pages

CiteSeerX

Sheffield Hallam University Research Archive

MAnnheim DOCument Server

Mining web sites using adaptive information extraction

Author: 10th Conference on European Chapter of the Association for Computational Linguistics
Ciravegna Fabio
Dingli Alexiei
Guthrie David
Wilks Yorick
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2003
Field of study

Adaptive Information Extraction systems (IES) are currently used by some Semantic Web (SW) annotation tools as support to annotation (Handschuh et al., 2002; Vargas-Vera et al., 2002). They are generally based on fully supervised methodologies requiring fairly intense domain-specific annotation. Unfortunately, selecting representative examples may be difficult and annotations can be incorrect and require time. In this paper we present a methodology that drastically reduce (or even remove) the amount of manual annotation required when annotating consistent sets of pages. A very limited number of user-defined examples are used to bootstrap learning. Simple, high precision (and possibly high recall) IE patterns are induced using such examples, these patterns will then discover more examples which will in turn discover more patterns, etc.peer-reviewe

OAR@UM

Information Extraction in Illicit Domains

Author: Banko M.
Bauer F.
Chakrabarti S.
Kushmerick N.
Mikolov T.
Sahlgren M.
Wick M.
Zouaq A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/03/2017
Field of study

Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

arXiv.org e-Print Archive

Crossref