24 research outputs found
A Model for Personalized Keyword Extraction from Web Pages using Segmentation
The World Wide Web caters to the needs of billions of users in heterogeneous
groups. Each user accessing the World Wide Web might have his / her own
specific interest and would expect the web to respond to the specific
requirements. The process of making the web to react in a customized manner is
achieved through personalization. This paper proposes a novel model for
extracting keywords from a web page with personalization being incorporated
into it. The keyword extraction problem is approached with the help of web page
segmentation which facilitates in making the problem simpler and solving it
effectively. The proposed model is implemented as a prototype and the
experiments conducted on it empirically validate the model's efficiency.Comment: 6 Pages, 2 Figure
Morpes: A Model for Personalized Rendering of Web Content on Mobile Devices
With the tremendous growth in the information communication sector, the
mobile phones have become the prime information communication devices. The
convergence of traditional telephony with the modern web enabled communication
in the mobile devices has made the communication much effective and simpler. As
mobile phones are becoming the crucial source of accessing the contents of the
World Wide Web which was originally designed for personal computers, has opened
up a new challenge of accommodating the web contents in to the smaller mobile
devices. This paper proposes an approach towards building a model for rendering
the web pages in mobile devices. The proposed model is based on a
multi-dimensional web page segment evaluation model. The incorporation of
personalization in the proposed model makes the rendering user-centric. The
proposed model is validated with a prototype implementation.Comment: 10 Pages, 2 Figure
Boilerplate Removal using a Neural Sequence Labeling Model
The extraction of main content from web pages is an important task for
numerous applications, ranging from usability aspects, like reader views for
news articles in web browsers, to information retrieval or natural language
processing. Existing approaches are lacking as they rely on large amounts of
hand-crafted features for classification. This results in models that are
tailored to a specific distribution of web pages, e.g. from a certain time
frame, but lack in generalization power. We propose a neural sequence labeling
model that does not rely on any hand-crafted features but takes only the HTML
tags and words that appear in a web page as input. This allows us to present a
browser extension which highlights the content of arbitrary web pages directly
within the browser using our model. In addition, we create a new, more current
dataset to show that our model is able to adapt to changes in the structure of
web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape
ENTITY EXTRACTION USING STATISTICAL METHODS USING INTERACTIVE KNOWLEDGE MINING FRAMEWORK
There are various kinds of valuable semantic information about real-world entities embedded in web pages and databases. Extracting and integrating these entity information from the Web is of great significance. Comparing to traditional information extraction problems, web entity extraction needs to solve several new challenges to fully take advantage of the unique characteristic of the Web. In this paper, we introduce our recent work on statistical extraction of structured entities, named entities, entity facts and relations from Web. We also briefly introduce iKnoweb, an interactive knowledge mining framework for entity information integration. We will use two novel web applications, Microsoft Academic Search (aka Libra) and EntityCube, as working examples
A personalized web page content filtering model based on segmentation
In the view of massive content explosion in World Wide Web through diverse
sources, it has become mandatory to have content filtering tools. The filtering
of contents of the web pages holds greater significance in cases of access by
minor-age people. The traditional web page blocking systems goes by the Boolean
methodology of either displaying the full page or blocking it completely. With
the increased dynamism in the web pages, it has become a common phenomenon that
different portions of the web page holds different types of content at
different time instances. This paper proposes a model to block the contents at
a fine-grained level i.e. instead of completely blocking the page it would be
efficient to block only those segments which holds the contents to be blocked.
The advantages of this method over the traditional methods are fine-graining
level of blocking and automatic identification of portions of the page to be
blocked. The experiments conducted on the proposed model indicate 88% of
accuracy in filtering out the segments.Comment: 11 Pages, 6 Figure
Multidimensional Web Page Evaluation Model Using Segmentation And Annotations
The evaluation of web pages against a query is the pivot around which the
Information Retrieval domain revolves around. The context sensitive, semantic
evaluation of web pages is a non-trivial problem which needs to be addressed
immediately. This research work proposes a model to evaluate the web pages by
cumulating the segment scores which are computed by multidimensional evaluation
methodology. The model proposed is hybrid since it utilizes both the structural
semantics and content semantics in the evaluation process. The score of the web
page is computed in a bottom-up process by evaluating individual segment's
score through a multi-dimensional approach. The model incorporates an approach
for segment level annotation. The proposed model is prototyped for evaluation;
experiments conducted on the prototype confirm the model's efficiency in
semantic evaluation of pages.Comment: 11 Pages, 4 Figures; International Journal on Cybernetics &
Informatics (IJCI), Vol.1, No.4, August 2012. arXiv admin note: substantial
text overlap with arXiv:1203.361
Extraction Contextuelle de Concepts Ontologiques pour le Web Sémantique
National audienceDe nombreux travaux de recherche, s'intéressant à l'annotation, l'intégration des données, les services web, etc. reposent sur les ontologies. Le développement de ces applications dépend de la richesse conceptuelle des ontologies. Dans cet article, nous présentons l'extraction des concepts ontologiques à partir de documents HTML. Afin d'améliorer ce processus, nous proposons un algorithme de clustering hiérarchique non supervisé intitulé " Extraction de Concepts Ontologiques " (ECO) ; celui-ci utilise d'une façon incrémentale l'algorithme de partitionnement Kmeans et est guidé par un contexte structurel. Ce dernier exploite la structure HTML ainsi que la position du mot afin d'optimiser la pondération de chaque terme ainsi que la sélection du co-occurrent le plus proche sémantiquement. Guidé par ce contexte, notre algorithme adopte un processus incrémental assurant un raffinement successif des contextes de chaque mot. Il offre, également, le choix entre une exécution entièrement automatique ou interactive. Nous avons expérimenté notre proposition sur un corpus du domaine du tourisme en français. Les résultats ont montré que notre algorithme améliore la qualité conceptuelle ainsi que la pertinence des concepts ontologiques extraits
Filtered-page ranking: uma abordagem para ranqueamento de documentos HTML previamente filtrados
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2016.Algoritmos de ranking de páginas Web podem ser criados usando técnicas baseadas em elementos estruturais da página Web, em segmentação da página ou na busca personalizada. Esta pesquisa aborda um método de ranking de documentos previamente filtrados, que segmenta a página Web em blocos de três categorias para delas eliminar conteúdo irrelevante. O método de ranking proposto, chamado Filtered-Page Ranking (FPR), consta de duas etapas principais: (i) segmentação da página web e eliminação de conteúdo irrelevante e (ii) ranking de páginas Web. O foco da extração de conteúdo irrelevante é eliminar conteúdos não relacionados à consulta do usuário, através do algoritmo proposto Query-Based Blocks Mining (QBM), para que o ranking considere somente conteúdo relevante. O foco da etapa de ranking é calcular quão relevante cada página Web é para determinada consulta, usando critérios considerados em estudos de recuperação da informação. Com a presente pesquisa pretende-se demonstrar que o QBM extrai eficientemente o conteúdo irrelevante e que os critérios utilizados para calcular quão próximo uma página Web é da consulta são relevantes, produzindo uma média de resultados de ranking de páginas Web de qualidade melhor que a do clássico modelo vetorial.Abstract : Web page ranking algorithms can be created using content-based, structure-based or user search-based techniques. This research addresses an user search-based approach applied over previously filtered documents ranking, which relies in a segmentation process to extract irrelevante content from documents before ranking. The process splits the document into three categories of blocks in order to fragment the document and eliminate irrelevante content. The ranking method, called Page Filtered Ranking, has two main steps: (i) irrelevante content extraction; and (ii) document ranking. The focus of the extraction step is to eliminate irrelevante content from the document, by means of the Query-Based Blocks Mining algorithm, creating a tree that is evaluated in the ranking process. During the ranking step, the focus is to calculate the relevance of each document for a given query, using criteria that give importance to specific parts of the document and to the highlighted features of some HTML elements. Our proposal is compared to two baselines: the classic vectorial model, and the CETR noise removal algorithm, and the results demonstrate that our irrelevante content removal algorithm improves the results and our relevance criteria are relevant to the process
A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining
This chapter addresses two crucial issues that arise when one applies Web-mining techniques for extracting relevant information. The first one is the acquisition of useful knowledge from textual data; the second issue stems from the fact that a web page often proposes a considerable amount of \u2018noise\u2019 with respect to the sections that are truly informative for the user's purposes. The novelty contribution of this work lies in a framework that can tackle both these tasks at the same time, supporting text summarization and page segmentation. The approach achieves this goal by exploiting semantic networks to map natural language into an abstract representation, which eventually supports the identification of the topics addressed in a text source. A heuristic algorithm uses the abstract representation to highlight the relevant segments of text in the original document. The verification of the approach effectiveness involved a publicly available benchmark, the DUC 2002 dataset, and satisfactory results confirmed the method effectiveness