16 research outputs found

    Predicting content change on the web

    Full text link

    Changes in corporate websites and business activity: automatic classification of corporate webpages

    Full text link
    [EN] Every time a firm or institution performs an activity on the Web, this is registered, leaving a "digital footprint”. Part this digital footprint is reflected on their websites as these officially represent them on the Web. We plan to automatically monitor the changes that periodically occur in a website to relate them with the business activity. The aim of this paper is to propose a theoretical classification of corporate webpages to associate changes that occur on them with the regular activity of the firms, and to evaluate the possibility of an automatic categorization using classification models. To generate the classification of corporate webpages, a significant number of today corporate webpages were analyzed and observed, distinguishing four theoretical types of corporate webpages. To evaluate the automatic categorization of corporate webpages, a dataset of 1005 today corporate pages was generated by manually labeling them and evaluating their automatic categorization using classification models.This work was partially supported by grants PID2019-107765RB-I00 and funded by MCIN/AEI/10.13039/501100011033.Valenzuela Rubilar, JM.; Domenech, J.; Pont, A. (2022). Changes in corporate websites and business activity: automatic classification of corporate webpages. En 4th International Conference on Advanced Research Methods and Analytics (CARMA 2022). Editorial Universitat Politècnica de València. 213-220. https://doi.org/10.4995/CARMA2022.2022.1509021322

    LiveRank: How to Refresh Old Datasets

    Get PDF
    This paper considers the problem of refreshing a dataset. More precisely , given a collection of nodes gathered at some time (Web pages, users from an online social network) along with some structure (hyperlinks, social relationships), we want to identify a significant fraction of the nodes that still exist at present time. The liveness of an old node can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the active nodes when using the LiveRank order. We study different scenarios from a static setting where the Liv-eRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks, for Web graphs as well as for online social networks

    Modeling and predicting temporal patterns of web content changes

    Get PDF
    AbstractThe technologies aimed at Web content discovery, retrieval and management face the compelling need of coping with its highly dynamic nature coupled with complex user interactions. This paper analyzes the temporal patterns of the content changes of three major news websites with the objective of modeling and predicting their dynamics. It has been observed that changes are characterized by a time dependent behavior with large fluctuations and significant differences across hours and days. To explain this behavior, we represent the change patterns as time series. The trend and seasonal components of the observed time series capture the weekly and daily periodicity, whereas the irregular components take into account the remaining fluctuations. Models based on trigonometric polynomials and ARMA components accurately reproduce the dynamics of the empirical change patterns and provide extrapolations into the future to be used for forecasting

    Towards a Timely Prediction of Earthquake Intensity with Social Media

    Get PDF
    A growing number of people is turning to Social Media in the aftermath of emergencies to search and publish critical and up to date information. Retrieval and exploitation of such information may prove crucial to decision makers in order to minimize the impact of disasters on the population and the infrastructures. Yet, to date, the task of the automatic assessment of the consequences of disasters has received little to no attention. Our work aims to bridge this gap, merging the theory behind statistical learning and predictive models with the data behind social media. Here we investigate the exploitation of Twitter data for the improvement of earthquake emergency management. We adopt a set of predictive linear models and evaluate their ability to map the intensity of worldwide earthquakes. The models build on a dataset of almost 5 million tweets and more than 7,000 globally distributed earthquakes. We run and discuss diagnostic tests and simulations on generated models to assess their significance and avoid overfitting. Finally we deal with the interpretation of the relations uncovered by the linear models and we conclude by illustrating how findings reported in this work can be leveraged by existing emergency management systems. Overall results show the effectiveness of the proposed techniques and allow to obtain an estimation of the earthquake intensity far earlier than conventional methods do. The employment of the proposed solutions can help understand scenarios where damage actually occurred in order to define where to concentrate the rescue teams and organize a prompt emergency response

    Measuring the semantic uncertainty of news events for evolution potential estimation

    Full text link
    © 2016 ACM. The evolution potential estimation of news events can support the decision making of both corporations and governments. For example, a corporation could manage its public relations crisis in a timely manner if a negative news event about this corporation is known with large evolution potential in advance. However, existing state-of-the-art methods are mainly based on time series historical data, which are not suitable for the news events with limited historical data and bursty properties. In this article, we propose a purely content-based method to estimate the evolution potential of the news events. The proposed method considers a news event at a given time point as a system composed of different keywords, and the uncertainty of this system is defined and measured as the Semantic Uncertainty of this news event. At the same time, an uncertainty space is constructed with two extreme states: the most uncertain state and the most certain state. We believe that the Semantic Uncertainty has correlation with the content evolution of the news events, so it can be used to estimate the evolution potential of the news events. In order to verify the proposed method, we present detailed experimental setups and results measuring the correlation of the Semantic Uncertainty with the Content Change of news events using collected news events data. The results show that the correlation does exist and is stronger than the correlation of value from the time-series-based method with the Content Change. Therefore, we can use the Semantic Uncertainty to estimate the evolution potential of news events

    Prediction of new outlinks for focused Web crawling

    Get PDF
    Discovering new hyperlinks enables Web crawlers to find new pages that have not yet been indexed. This is especially important for focused crawlers because they strive to provide a comprehensive analysis of specific parts of the Web, thus prioritizing discovery of new pages over discovery of changes in content. In the literature, changes in hyperlinks and content have been usually considered simultaneously. However, there is also evidence suggesting that these two types of changes are not necessarily related. Moreover, many studies about predicting changes assume that long history of a page is available, which is unattainable in practice. The aim of this work is to provide a methodology for detecting new links effectively using a short history. To this end, we use a dataset of ten crawls at intervals of one week. Our study consists of three parts. First, we obtain insight in the data by analyzing empirical properties of the number of new outlinks. We observe that these properties are, on average, stable over time, but there is a large difference between emergence of hyperlinks towards pages within and outside the domain of a target page (internal and external outlinks, respectively). Next, we provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. These models include the features used earlier in the literature, as well as new features introduced in this work. We analyze correlation between the features, and investigate their informativeness. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page. Finally, we propose ranking methods as guidelines for focused crawlers to efficiently discover new pages, which achieve excellent performance with respect to the corresponding targets

    Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

    Get PDF
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

    LiveRank: How to Refresh Old Crawls

    Get PDF
    International audienceThis paper considers the problem of refreshing a crawl. More precisely, given a collection of Web pages (with hyperlinks) gathered at some time, we want to identify a significant fraction of these pages that still exist at present time. The liveness of an old page can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order. We study different scenarios from a static setting where the LiveRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks for Web graphs
    corecore