6 research outputs found

    Methods for demoting and detecting Web spam

    Get PDF
    Web spamming has tremendously subverted the ranking mechanism of information retrieval in Web search engines. It manipulates data source maliciously either by contents or links with the intention of contributing negative impacts to Web search results. The altering order of the search results by spammers has increased the difficulty level of searching and time consumption for Web users to retrieve relevant information. In order to improve the quality of Web search engines results, the design of anti-Web spam techniques are developed in this thesis to detect and demote Web spam via trust and distrust and Web spam classification.A comprehensive literature on existing anti-Web spam techniques emphasizing on trust and distrust model and machine learning model is presented. Furthermore, several experiments are conducted to show the vulnerability of ranking algorithm towards Web spam. Two public available Web spam datasets are used for the experiments throughout the thesis - WEBSPAM-UK2006 and WEBSPAM-UK2007.Two link-based trust and distrust model algorithms are presented subsequently: Trust Propagation Rank and Trust Propagation Spam Mass. Both algorithms semi automatically detect and demote Web spam based on limited human experts’ evaluation of non-spam and spam pages. In the experiments, the results for Trust Propagation Rank and Trust Propagation Spam Mass have achieved up to 10.88% and 43.94% improvement over the benchmark algorithms.Thereafter, the weight properties which associated as the linkage between two Web hosts are introduced into the task of Web spam detection. In most studies, the weight properties are involved in ranking mechanism; in this research work, the weight properties are incorporated into distrust based algorithms to detect more spam. The experiments have shown that the weight properties enhanced existing distrust based Web spam detection algorithms for up to 30.26% and 31.30% on both aforementioned datasets.Even though the integration of weight properties has shown significant results in detecting Web spam, the discussion on distrust seed set propagation algorithm is presented to further enhance the Web spam detection experience. Distrust seed set propagation algorithm propagates the distrust score in a wider range to estimate the probability of other unevaluated Web pages for being spam. The experimental results have shown that the algorithm improved the distrust based Web spam detection algorithms up to 19.47% and 25.17% on both datasets.An alternative machine learning classifier - multilayered perceptron neural network is proposed in the thesis to further improve the detection rate of Web spam. In the experiments, the detection rate of Web spam using multilayered perceptron neural network has increased up to 14.02% and 3.53% over the conventional classifier – support vector machines. At the same time, a mechanism to determine the number of hidden neurons for multilayered perceptron neural network is presented in this thesis to simplify the designing process of network structure

    Utilizing Multi-modal Weak Signals to Improve User Stance Inference in Social Media

    Get PDF
    Social media has become an integral component of the daily life. There are millions of various types of content being released into social networks daily. This allows for an interesting view into a users\u27 view on everyday life. Exploring the opinions of users in social media networks has always been an interesting subject for the Natural Language Processing researchers. Knowing the social opinions of a mass will allow anyone to make informed policy or marketing related decisions. This is exactly why it is desirable to find comprehensive social opinions. The nature of social media is complex and therefore obtaining the social opinion becomes a challenging task. Because of how diverse and complex social media networks are, they typically resonate with the actual social connections but in a digital platform. Similar to how users make friends and companions in the real world, the digital platforms enable users to mimic similar social connections. This work mainly looks at how to obtain a comprehensive social opinion out of social media network. Typical social opinion quantifiers will look at text contributions made by users to find the opinions. Currently, it is challenging because the majority of users on social media will be consuming content rather than expressing their opinions out into the world. This makes natural language processing based methods impractical due to not having linguistic features. In our work we look to improve a method named stance inference which can utilize multi-domain features to extract the social opinion. We also introduce a method which can expose users opinions even though they do not have on-topical content. We also note how by introducing weak supervision to an unsupervised task of stance inference we can improve the performance. The weak supervision we bring into the pipeline is through hashtags. We show how hashtags are contextual indicators added by humans which will be much likelier to be related than a topic model. Lastly we introduce disentanglement methods for chronological social media networks which allows one to utilize the methods we introduce above to be applied in these type of platforms

    Eight Biennial Report : April 2005 – March 2007

    No full text

    Adquisición y representación del conocimiento mediante procesamiento del lenguaje natural

    Get PDF
    [Resumen] Este trabajo introduce un marco para la recuperación de información combinando el procesamiento del lenguaje natural y conocimiento de un dominio, abordando la totalidad del proceso de creación, gestión e interrogación de una colección documental. La perspectiva empleada integra automáticamente conocimiento lingüístico en un modelo formal de representación semántica, directamente manejable por el sistema. Ello permite la construcción de algoritmos que simplifican las tareas de mantenimiento, proporcionan un acceso más flexible al usuario no especializado, y eliminan componentes subjetivas que lleven a comportamientos difícilmente predecibles. La adquisición de conocimientos lingüísticos parte de un análisis de dependencias basado en un formalismo gramatical suavemente dependiente del contexto. Conjugamos de este modo eficacia computacional y potencia expresiva. La interpretación formal de la semántica descansa en la noción de grafo conceptual, sirviendo de base para la representación de la colección y para las consultas que la interrogan. En este contexto, la propuesta resuelve la generación automática de estas representaciones a partir del conocimiento lingüístico adquirido de los textos y constituyen el punto de partida para su indexación. Luego, se utilizan operaciones sobre grafos así como el principio de proyección y generalización para calcular y ordenar las respuestas, de tal manera que se considere la imprecisión intrínseca y el carácter incompleto de la recuperación. Además, el aspecto visual de los grafos permiten la construcción de interfaces de usuario amigables, conciliando precisión e intuición en su gestión. En este punto, la propuesta también engloba un marco de pruebas formales.[Resumo] Este traballo introduce un marco para a recuperación de información combinando procesamento da linguaxe natural e o coñecemento dun dominio, abordando a totalidade do proceso de creación, xestión e interrogación dunha colección documental. A perspectiva empregada integra automáticamente coñecementos lingüísticos nun modelo formal de representación semántica, directamente manexable polo sistema. Isto permite a construción de algoritmos que simplifican as tarefas de mantemento, proporcionan un acceso máis flexible ao usuario non especializado, e eliminan compoñentes subxectivos que levan a comportamentos difícilmente predicibles. A adquisición de coñecementos lingüísticos parte duhna análise de dependencias basada nun formalismo gramatical suavemente dependente do contexto. Conxugamos deste modo eficacia computacional e potencia expresiva. A interpretación formal da semántica descansa na noción de grafo conceptual, servindo de base para a representación da colección e para as consultas que a interrogan. Neste contexto, a proposta resolve a xeración automática destas representacións a partires do coñecemento lingüístico adquirido dos textos e constitúe o punto de partida para a súa indexación. Logo, empréganse operacións sobre grafos así como o principio de proxección e xeneralización para calcular e ordenar as respostas, de tal maneira que se considere a imprecisión intrínseca e o carácter incompleto da recuperación. Ademáis, o aspecto visual dos grafos permiten a construción de interfaces de usuario amigables, conciliando precisión e intuición na súa xestión. Neste punto, a proposta tamén engloba un marco de probas formais.[Abstract] This thesis introduces a framework for information retrieval combining natural language processing and a domain knowledge, dealing with the whole process of creation, management and interrogation of a documental collection. The perspective used integrates automatically linguistic knowledge in a formal model of semantic representation directly manageable by the system. This allows the construction of algorithms that simplify maintenance tasks, provide more flexible access to non-specialist user, and eliminate subjective components that lead to hardly predictable behavior. The linguistic knowledge adquisition starts from a dependency parse based on a midly context-sensitive grammatical formalism. In this way, we combine computational efficiency and expressive power. The formal interpretation of the semantics is based on the notion of conceptual graph, providing a basis for the representation of the collection and for queries that interrogate. In this context, the proposal addresses the automatic generation of these representations from linguistic knowledge acquired from texts and constitute the starting point for indexing. Then operations on graphs are used and the principle of projection and generalization to calculate and manage replies, so that is considered the inherent inaccuracy and incompleteness of the recovery. In addition, the visual aspect of graphs allow the construction of user-friendly interfaces, balancing precision and intuition in management. At this point, the proposal also includes a framework for formal testing

    Analyse und Vorhersage der Aktualisierungen von Web-Feeds

    Get PDF
    Feeds werden unter anderem eingesetzt, um Nutzer in einem einheitlichen Format und in aggregierter Form über Aktualisierungen oder neue Beiträge auf Webseiten zu informieren. Da bei Feeds in der Regel keine Benachrichtigungsfunktionalitäten angeboten werden, müssen Interessenten Feeds regelmäßig auf Aktualisierungen überprüfen. Die Betrachtung entsprechender Techniken bildet den Kern der Arbeit. Die in den verwandten Domänen Web Crawling und Web Caching eingesetzten Algorithmen zur Vorhersage der Zeitpunkte von Aktualisierungen werden aufgearbeitet und an die spezifischen Anforderungen der Domäne Feeds angepasst. Anschließend wird ein selbst entwickelter Algorithmus vorgestellt, der bereits ohne den Einsatz spezieller Konfigurationsparameter und ohne Trainingsphase im Durchschnitt bessere Vorhersagen trifft, als die übrigen betrachteten Algorithmen. Auf Basis der Analyse verschiedener Metriken zur Beurteilung der Qualität von Vorhersagen erfolgt die Definition eines zusammenfassenden Gütemaßes, welches den Vergleich von Algorithmen anhand eines einzigen Wertes ermöglicht. Darüber hinaus werden abfragespezifische Attribute der Feed-Formate untersucht und es wird empirisch gezeigt, dass die auf der partiellen Historie der Feeds basierende Vorhersage von Änderungen bereits bessere Ergebnisse erzielt, als die Einbeziehung der von den Diensteanbietern bereitgestellten Werte in die Berechnung ermöglicht. Die empirischen Evaluationen erfolgen anhand eines breitgefächerten, realen Feed-Datensatzes, welcher der wissenschaftlichen Gemeinschaft frei zur Verfügung gestellt wird, um den Vergleich mit neuen Algorithmen zu erleichtern
    corecore