713 research outputs found

    Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

    Get PDF
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

    Web Mining For Financial Market Prediction Based On Online Sentiments

    Get PDF
    Financial market prediction is a critically important research topic in financial data mining because of its potential commerce application and attractive profits. Previous studies in financial market prediction mainly focus on financial and economic indicators. Web information, as an information repository, has been used in customer relationship management and recommendation, but it is rarely considered to be useful in financial market prediction. In this paper, a combined web mining and sentiment analysis method is proposed to forecast financial markets using web information. In the proposed method, a spider is firstly employed to crawl tweets from Twitter. Secondly, Opinion Finder is offered to mining the online sentiments hidden in tweets. Thirdly, some new sentiment indicators are suggested and a stochastic time effective function (STEF) is introduced to integrate everyday sentiments. Fourthly, support vector regressions (SVRs) are used to model the relationship between online sentiments and financial market prices. Finally, the selective model can be serviced for financial market prediction. To validate the proposed method, Standard and Poor’s 500 Index (S&P 500) is used for evaluation. The empirical results show that our proposed forecasting method outperforms the traditional forecasting methods, and meanwhile, the proposed method can also capture individual behavior in financial market quickly and easily. These findings imply that the proposed method is a promising approach for financial market prediction

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time

    Deep Learning based Densenet Convolution Neural Network for Community Detection in Online Social Networks

    Get PDF
    Online Social Networks (OSNs) have become increasingly popular, with hundreds of millions of users in recent years. A community in a social network is a virtual group with shared interests and activities that they want to communicate. OSN and the growing number of users have also increased the need for communities. Community structure is an important topological property of OSN and plays an essential role in various dynamic processes, including the diffusion of information within the network. All networks have a community format, and one of the most continually addressed research issues is the finding of communities. However, traditional techniques didn't do a better community of discovering user interests. As a result, these methods cannot detect active communities.  To tackle this issues, in this paper presents Densenet Convolution Neural Network (DnetCNN) approach for community detection. Initially, we gather dataset from Kaggle repository. Then preprocessing the dataset to remove inconsistent and missing values. In addition to User Behavior Impact Rate (UBIR) technique to identify the user URL access, key term and page access. After that, Web Crawling Prone Factor Rate (WCPFR) technique is used find the malicious activity random forest and decision method. Furthermore, Spider Web Cluster Community based Feature Selection (SWC2FS) algorithm is used to choose finest attributes in the dataset. Based on the attributes, to find the community group using Densenet Convolution Neural Network (DnetCNN) approach. Thus, the experimental result produce better performance than other methods

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Fair comparison of skin detection approaches on publicly available datasets

    Full text link
    Skin detection is the process of discriminating skin and non-skin regions in a digital image and it is widely used in several applications ranging from hand gesture analysis to track body parts and face detection. Skin detection is a challenging problem which has drawn extensive attention from the research community, nevertheless a fair comparison among approaches is very difficult due to the lack of a common benchmark and a unified testing protocol. In this work, we investigate the most recent researches in this field and we propose a fair comparison among approaches using several different datasets. The major contributions of this work are an exhaustive literature review of skin color detection approaches, a framework to evaluate and combine different skin detector approaches, whose source code is made freely available for future research, and an extensive experimental comparison among several recent methods which have also been used to define an ensemble that works well in many different problems. Experiments are carried out in 10 different datasets including more than 10000 labelled images: experimental results confirm that the best method here proposed obtains a very good performance with respect to other stand-alone approaches, without requiring ad hoc parameter tuning. A MATLAB version of the framework for testing and of the methods proposed in this paper will be freely available from https://github.com/LorisNann

    Motion Planning of Uncertain Ordinary Differential Equation Systems

    Get PDF
    This work presents a novel motion planning framework, rooted in nonlinear programming theory, that treats uncertain fully and under-actuated dynamical systems described by ordinary differential equations. Uncertainty in multibody dynamical systems comes from various sources, such as: system parameters, initial conditions, sensor and actuator noise, and external forcing. Treatment of uncertainty in design is of paramount practical importance because all real-life systems are affected by it, and poor robustness and suboptimal performance result if it’s not accounted for in a given design. In this work uncertainties are modeled using Generalized Polynomial Chaos and are solved quantitatively using a least-square collocation method. The computational efficiency of this approach enables the inclusion of uncertainty statistics in the nonlinear programming optimization process. As such, the proposed framework allows the user to pose, and answer, new design questions related to uncertain dynamical systems. Specifically, the new framework is explained in the context of forward, inverse, and hybrid dynamics formulations. The forward dynamics formulation, applicable to both fully and under-actuated systems, prescribes deterministic actuator inputs which yield uncertain state trajectories. The inverse dynamics formulation is the dual to the forward dynamic, and is only applicable to fully-actuated systems; deterministic state trajectories are prescribed and yield uncertain actuator inputs. The inverse dynamics formulation is more computationally efficient as it requires only algebraic evaluations and completely avoids numerical integration. Finally, the hybrid dynamics formulation is applicable to under-actuated systems where it leverages the benefits of inverse dynamics for actuated joints and forward dynamics for unactuated joints; it prescribes actuated state and unactuated input trajectories which yield uncertain unactuated states and actuated inputs. The benefits of the ability to quantify uncertainty when planning the motion of multibody dynamic systems are illustrated through several case-studies. The resulting designs determine optimal motion plans—subject to deterministic and statistical constraints—for all possible systems within the probability space

    Agriculture 4.0 and beyond: Evaluating cyber threat intelligence sources and techniques in smart farming ecosystems

    Get PDF
    The digitisation of agriculture, integral to Agriculture 4.0, has brought significant benefits while simultaneously escalating cybersecurity risks. With the rapid adoption of smart farming technologies and infrastructure, the agricultural sector has become an attractive target for cyberattacks. This paper presents a systematic literature review that assesses the applicability of existing cyber threat intelligence (CTI) techniques within smart farming infrastructures (SFIs). We develop a comprehensive taxonomy of CTI techniques and sources, specifically tailored to the SFI context, addressing the unique cyber threat challenges in this domain. A crucial finding of our review is the identified need for a virtual Chief Information Security Officer (vCISO) in smart agriculture. While the concept of a vCISO is not yet established in the agricultural sector, our study highlights its potential significance. The implementation of a vCISO could play a pivotal role in enhancing cybersecurity measures by offering strategic guidance, developing robust security protocols, and facilitating real-time threat analysis and response strategies. This approach is critical for safeguarding the food supply chain against the evolving landscape of cyber threats. Our research underscores the importance of integrating a vCISO framework into smart farming practices as a vital step towards strengthening cybersecurity. This is essential for protecting the agriculture sector in the era of digital transformation, ensuring the resilience and sustainability of the food supply chain against emerging cyber risks

    Who wrote this scientific text?

    No full text
    The IEEE bibliographic database contains a number of proven duplications with indication of the original paper(s) copied. This corpus is used to test a method for the detection of hidden intertextuality (commonly named "plagiarism"). The intertextual distance, combined with the sliding window and with various classification techniques, identifies these duplications with a very low risk of error. These experiments also show that several factors blur the identity of the scientific author, including variable group authorship and the high levels of intertextuality accepted, and sometimes desired, in scientific papers on the same topic
    • …
    corecore