7 research outputs found

    PageRank optimization applied to spam detection

    Full text link
    We give a new link spam detection and PageRank demotion algorithm called MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked trusted and spam pages. We define the MaxRank of a page as the frequency of visit of this page by a random surfer minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function penalizes spam pages and hyperlink removals. The goal is to determine a hyperlink deletion policy that minimizes this score. The MaxRank is interpreted as a modified PageRank vector, used to sort web pages instead of the usual PageRank vector. The bias vector of this ergodic control problem, which is unique up to an additive constant, is a measure of the "spamicity" of each page, used to detect spam pages. We give a scalable algorithm for MaxRank computation that allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We show that our algorithm outperforms both TrustRank and AntiTrustRank for spam and nonspam page detection.Comment: 8 pages, 6 figure

    МодСли Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠΎΠ½Ρ‚Π΅Π½Ρ‚Π° новостных источников Π² систСмС ΠΌΠΎΠ½ΠΈΡ‚ΠΎΡ€ΠΈΠ½Π³Π° ΡΠΎΡ†ΠΈΠ°Π»ΡŒΠ½Ρ‹Ρ… ΠΌΠ΅Π΄ΠΈΠ°-рСсурсов

    Get PDF
    Π€ΠΎΡ€ΠΌΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΈ Π°Π½Π°Π»ΠΈΠ· ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠΎΠ½Ρ‚Π΅Π½Ρ‚Π° новостными источниками для Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ Π·Π°Π΄Π°Ρ‡ сбора ΠΈ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ сообщСний Π² систСмС ΠΌΠΎΠ½ΠΈΡ‚ΠΎΡ€ΠΈΠ½Π³Π° ΠΈ Π°Π½Π°Π»ΠΈΠ·Π° Π΄Π°Π½Π½Ρ‹Ρ…. ΠžΡΠΎΠ±Π΅Π½Π½ΠΎΡΡ‚ΠΈ Ρ„ΠΎΡ€ΠΌ Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ сообщСний, ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‰ΠΈΠ΅ ΠΎΠ±ΠΎΡΠ½ΠΎΠ²Π°Ρ‚ΡŒ ΠΈΡ… Ρ€Π°Π·Π΄Π΅Π»Π΅Π½ΠΈΠ΅ Π½Π° классы. ΠšΠ»Π°ΡΡΡ‹ источников с Ρ€Π°Π·Π±ΠΈΠ΅Π½ΠΈΠ΅ΠΌ ΠΏΠΎ частотС ΠΈ нСравномСрности ΠΈΡ… ΠΏΡƒΠ±Π»ΠΈΠΊΠ°Ρ†ΠΈ

    Decomposing ratings in service compositions

    Get PDF
    An important challenge for service-based systems is to be able to select services based on feedback from service consumers and, therefore, to be able to distinguish between good and bad services. However, ratings are normally provided to a service as a whole, without taking into consideration that services are normally formed by a composition of other services. In this paper we propose an approach to support the decomposition of ratings provided to a service composition into ratings to the participating services in a composition. The approach takes into consideration the rating provided for a service composition as a whole, past trust values of the services participating in the composition, and expected and observed QoS aspects of the services. A prototype tool has been implemented to illustrate and evaluate the work. Results of some experimental evaluation of the approach are also reported in the paper

    Exploring microtonal matching

    Get PDF
    Most research intomusic information retrieval thus far has only examined music from the western tradition. However, music of other origins often conforms to different tuning systems. Therefore there are problems both in representing this music as well as finding matches to queries from these diverse tuning systems. We discuss the issues associated with microtonal music retrieval and present some preliminary results from an experiment in applying scoring matrices to microtonal matching

    An Approach of QoS Evaluation for Web Services Design With Optimized Avoidance of SLA Violations

    Get PDF
    Quality of service (QoS) is an official agreement that governs the contractual commitments between service providers and consumers in respect to various nonfunctional requirements, such as performance, dependability, and security. While more Web services are available for the construction of software systems based upon service-oriented architecture (SOA), QoS has become a decisive factor for service consumers to choose from service providers who provide similar services. QoS is usually documented on a service-level agreement (SLA) to ensure the functionality and quality of services and to define monetary penalties in case of any violation of the written agreement. Consequently, service providers have a strong interest in keeping their commitments to avoid and reduce the situations that may cause SLA violations.However, there is a noticeable shortage of tools that can be used by service providers to either quantitively evaluate QoS of their services for the predication of SLA violations or actively adjust their design for the avoidance of SLA violations with optimized service reconfigurations. Developed in this dissertation research is an innovative framework that tackles the problem of SLA violations in three separated yet connected phases. For a given SOA system under examination, the framework employs sensitivity analysis in the first phase to identify factors that are influential to system performance, and the impact of influential factors on QoS is then quantitatively measured with a metamodel-based analysis in the second phase. The results of analyses are then used in the third phase to search both globally and locally for optimal solutions via a controlled number of experiments. In addition to technical details, this dissertation includes experiment results to demonstrate that this new approach can help service providers not only predicting SLA violations but also avoiding the unnecessary increase of the operational cost during service optimization

    Audio-Based Retrieval of Musical Score Data

    Get PDF
    Given an audio query, such as polyphonic musical piece, this thesis address the problem of retrieving a matching (similar) musical score data from a collection of musical scores. There are different techniques for measuring similarity between any musical piece such as metadata based similarity measure, collaborative filtering and content-based similarity measure. In this thesis, we use the information in the digital music itself for similarity measures and this technique is known as content-based similarity measure. First we extract chroma features to represents musical segments. Chroma feature captures both melodic information and harmonic information and is robust to timbre variation. Tempo variation in the performance of a same song may cause dissimilarity between them. In order to address this issue we extract beat sequences and combine them with chroma features to obtain beat synchronous chroma features. Next, we use Dynamic Time Warping (DTW) algorithm. This algorithm first computes the DTW matrix between two feature sequences and calculates the cost of traversing from starting point to end point of the matrix. Minimum the cost value, more similar the musical segments are. The performance of DTW is improved by choosing suitable path constraints and path weight. Then, we implement LSH algorithm, which first indexes the data and then searches for a similar item. Processing time of LSH is shorter than that of DTW. For a smaller fragment of query audio, say 30 seconds, LSH outperformed DTW. Performance of LSH depends on the number of hash tables, number of projections per table and width of the projection. Both algorithms were applied in two types of data sets, RWC (where audio and midi are from the same source) and TUT (where audio and midi are from different sources). The contribution of this thesis is twofold. First we proposed a suitable feature representation of a musical segment for melodic similarity. And then we apply two different similarity measure algorithms and enhance their performances. This thesis work also includes development of mobile application capable of recording audio from surroundings and displaying its acoustic features in real time

    Learning from Partially Labeled Data: Unsupervised and Semi-supervised Learning on Graphs and Learning with Distribution Shifting

    Get PDF
    This thesis focuses on two fundamental machine learning problems:unsupervised learning, where no label information is available, and semi-supervised learning, where a small amount of labels are given in addition to unlabeled data. These problems arise in many real word applications, such as Web analysis and bioinformatics,where a large amount of data is available, but no or only a small amount of labeled data exists. Obtaining classification labels in these domains is usually quite difficult because it involves either manual labeling or physical experimentation. This thesis approaches these problems from two perspectives: graph based and distribution based. First, I investigate a series of graph based learning algorithms that are able to exploit information embedded in different types of graph structures. These algorithms allow label information to be shared between nodes in the graph---ultimately communicating information globally to yield effective unsupervised and semi-supervised learning. In particular, I extend existing graph based learning algorithms, currently based on undirected graphs, to more general graph types, including directed graphs, hypergraphs and complex networks. These richer graph representations allow one to more naturally capture the intrinsic data relationships that exist, for example, in Web data, relational data, bioinformatics and social networks. For each of these generalized graph structures I show how information propagation can be characterized by distinct random walk models, and then use this characterization to develop new unsupervised and semi-supervised learning algorithms. Second, I investigate a more statistically oriented approach that explicitly models a learning scenario where the training and test examples come from different distributions. This is a difficult situation for standard statistical learning approaches, since they typically incorporate an assumption that the distributions for training and test sets are similar, if not identical. To achieve good performance in this scenario, I utilize unlabeled data to correct the bias between the training and test distributions. A key idea is to produce resampling weights for bias correction by working directly in a feature space and bypassing the problem of explicit density estimation. The technique can be easily applied to many different supervised learning algorithms, automatically adapting their behavior to cope with distribution shifting between training and test data
    corecore