1,054 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Exploiting multimedia in creating and analysing multimedia Web archives
The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general
Recommended from our members
The Corpus Expansion Toolkit: finding what we want on the web
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:
1. Domain known – source known
2. Domain known – source unknown
3. Domain unknown – source unknown
First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system
Learning to Hash-tag Videos with Tag2Vec
User-given tags or labels are valuable resources for semantic understanding
of visual media such as images and videos. Recently, a new type of labeling
mechanism known as hash-tags have become increasingly popular on social media
sites. In this paper, we study the problem of generating relevant and useful
hash-tags for short video clips. Traditional data-driven approaches for tag
enrichment and recommendation use direct visual similarity for label transfer
and propagation. We attempt to learn a direct low-cost mapping from video to
hash-tags using a two step training process. We first employ a natural language
processing (NLP) technique, skip-gram models with neural network training to
learn a low-dimensional vector representation of hash-tags (Tag2Vec) using a
corpus of 10 million hash-tags. We then train an embedding function to map
video features to the low-dimensional Tag2vec space. We learn this embedding
for 29 categories of short video clips with hash-tags. A query video without
any tag-information can then be directly mapped to the vector space of tags
using the learned embedding and relevant tags can be found by performing a
simple nearest-neighbor retrieval in the Tag2Vec space. We validate the
relevance of the tags suggested by our system qualitatively and quantitatively
with a user study
Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption,
as well as greater demand from the consumer for better and cheaper products.
Hence, the selling price of a product assumes a fundamental role in the purchase decision
by the consumer. In this context, online stores must carefully analyse and define the best
price for each product, based on several factors such as production/acquisition cost, positioning
of the product (e.g. anchor product) and the competition companies strategy. The
work done by market analysts changed drastically over the last years.
As the number of Web sites increases exponentially, the number of E-commerce web
sites also prosperous. Web page classification becomes more important in fields like Web
mining and information retrieval. The traditional classifiers are usually hand-crafted and
non-adaptive, that makes them inappropriate to use in a broader context. We introduce an
ensemble of methods and the posterior study of its results to create a more generic and
modular crawler and scraper for detection and information extraction on E-commerce web
pages. The collected information may then be processed and used in the pricing decision.
This framework goes by the name Prometheus and has the goal of extracting knowledge
from E-commerce Web sites.
The process requires crawling an online store and gathering product pages. This implies
that given a web page the framework must be able to determine if it is a product page.
In order to achieve this we classify the pages in three categories: catalogue, product and
”spam”. The page classification stage was addressed based on the html text as well as on
the visual layout, featuring both traditional methods and Deep Learning approaches.
Once a set of product pages has been identified we proceed to the extraction of the pricing
information. This is not a trivial task due to the disparity of approaches to create a web
page. Furthermore, most product pages are dynamic in the sense that they are truly a page
for a family of related products. For instance, when visiting a shoe store, for a particular
model there are probably a number of sizes and colours available. Such a model may be
displayed in a single dynamic web page making it necessary for our framework to explore
all the relevant combinations. This process is called scraping and is the last stage of the
Prometheus framework.O contínuo desenvolvimento social e económico tem conduzido ao longo do tempo a um
aumento do consumo, assim como a uma maior exigência do consumidor por produtos
melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel
fundamental na decisão de compra por parte de um consumidor. Nesse sentido, as lojas
online precisam de analisar e definir qual o melhor preço para cada produto, tendo como
base diversos fatores, tais como o custo de produção/venda, posicionamento do produto
(e.g. produto âncora) e as próprias estratégias das empresas concorrentes. O trabalho dos
analistas de mercado mudou drasticamente nos últimos anos.
O crescimento de sites na Web tem sido exponencial, o número de sites E-commerce
também tem prosperado. A classificação de páginas da Web torna-se cada vez mais importante,
especialmente em campos como mineração de dados na Web e coleta/extração
de informações. Os classificadores tradicionais são geralmente feitos manualmente e não
adaptativos, o que os torna inadequados num contexto mais amplo. Nós introduzimos
um conjunto de métodos e o estudo posterior dos seus resultados para criar um crawler
e scraper mais genéricos e modulares para extração de conhecimento em páginas de Ecommerce.
A informação recolhida pode então ser processada e utilizada na tomada de
decisão sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito
extrair conhecimento de Web sites de E-commerce.
Este processo necessita realizar a navegação sobre lojas online e armazenar páginas de
produto. Isto implica que dado uma página web a framework seja capaz de determinar
se é uma página de produto. Para atingir este objetivo nós classificamos as páginas em
três categorias: catálogo, produto e spam. A classificação das páginas foi realizada tendo
em conta o html e o aspeto visual das páginas, utilizando tanto métodos tradicionais como
Deep Learning.
Depois de identificar um conjunto de páginas de produto procedemos à extração de
informação sobre o preço. Este processo não é trivial devido à quantidade de abordagens
possíveis para criar uma página web. A maioria dos produtos são dinâmicos no sentido
em que um produto é na realidade uma família de produtos relacionados. Por exemplo,
quando visitamos uma loja online de sapatos, para um modelo em especifico existe
a provavelmente um conjunto de tamanhos e cores disponíveis. Esse modelo pode ser
apresentado numa única página dinâmica fazendo com que seja necessário para a nossa
Framework explorar estas combinações relevantes. Este processo é chamado de scraping e
é o último passo da Framework Prometheus
Multi-dimensional data refining strategy for effective fine-tuning LLMs
Data is a cornerstone for fine-tuning large language models, yet acquiring
suitable data remains challenging. Challenges encompassed data scarcity,
linguistic diversity, and domain-specific content. This paper presents lessons
learned while crawling and refining data tailored for fine-tuning Vietnamese
language models. Crafting such a dataset, while accounting for linguistic
intricacies and striking a balance between inclusivity and accuracy, demands
meticulous planning. Our paper presents a multidimensional strategy including
leveraging existing datasets in the English language and developing customized
data-crawling scripts with the assistance of generative AI tools. A fine-tuned
LLM model for the Vietnamese language, which was produced using resultant
datasets, demonstrated good performance while generating Vietnamese news
articles from prompts. The study offers practical solutions and guidance for
future fine-tuning models in languages like Vietnamese
Determinants of Voluntary Organizations’ Attention on Facebook: The Case of Norwegian Voluntary Organizations
publishedVersio
- …