1,844 research outputs found

    Modeling of Human Web Browsing Based on Theory of Interest-Driven Behavior

    Get PDF
    The ability to generate human-like Web-browsing requests is essential for testing and optimization of WWW systems. In this thesis a new model of human-browsing behavior the so-called HBB-IDT model has been proposed. The model is based on the theory of interest-driven human behavior and does not assume the availability of server-side logs (i.e., previous browsing history). The defining features of the model are: (1) human browsing on the internet is regarded as a dynamic interest-driven process; and (2) the users browsing interests are linked to actual characteristics of the visited Web pages. Given that the model does not rely on the existence of Web logs, it can be applied more generally than the previously proposed data-mining approaches. The experimental results show that the probability of generating human-like browsing sequences is much higher using the HBB-IDT model than using the pre-set request list model or the random crawling model

    Anomaly Detection over User Profiles for Intrusion Detection

    Get PDF
    Intrusion detection systems (IDS) have often been used to analyse network traffic to help network administrators quickly identify and respond to intrusions. These detection systems generally operate over the entire network, identifying ā€œanomaliesā€ atypical of the networkā€™s normal collective user activities. We show that anomaly detection could also be host-based so that the normal usage patterns of an individual user could be profiled. This enables the detection of masquerading intruders by comparing a learned user profile against the current sessionā€™s profile. A prototype behavioural IDS applies the concept of anomaly detection to user behaviour and compares the effects of using multiple characteristics to profile users. Behaviour captured within the system consists of application usage, application performance (CPU and memory), the websites a user visits, the number of windows a user has open, and their typing habits. The results show that such a system is entirely feasible, that characteristics physically related to the user are more relevant to profiling behaviour and that the combination of characteristics can significantly decrease the time taken to detect an intruder

    Collective emotions online and their influence on community life

    Get PDF
    E-communities, social groups interacting online, have recently become an object of interdisciplinary research. As with face-to-face meetings, Internet exchanges may not only include factual information but also emotional information - how participants feel about the subject discussed or other group members. Emotions are known to be important in affecting interaction partners in offline communication in many ways. Could emotions in Internet exchanges affect others and systematically influence quantitative and qualitative aspects of the trajectory of e-communities? The development of automatic sentiment analysis has made large scale emotion detection and analysis possible using text messages collected from the web. It is not clear if emotions in e-communities primarily derive from individual group members' personalities or if they result from intra-group interactions, and whether they influence group activities. We show the collective character of affective phenomena on a large scale as observed in 4 million posts downloaded from Blogs, Digg and BBC forums. To test whether the emotions of a community member may influence the emotions of others, posts were grouped into clusters of messages with similar emotional valences. The frequency of long clusters was much higher than it would be if emotions occurred at random. Distributions for cluster lengths can be explained by preferential processes because conditional probabilities for consecutive messages grow as a power law with cluster length. For BBC forum threads, average discussion lengths were higher for larger values of absolute average emotional valence in the first ten comments and the average amount of emotion in messages fell during discussions. Our results prove that collective emotional states can be created and modulated via Internet communication and that emotional expressiveness is the fuel that sustains some e-communities.Comment: 23 pages including Supporting Information, accepted to PLoS ON

    Supporting collocation learning with a digital library

    Get PDF
    Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    Personalizacija sadržaja novinskih webskih portala pomoću tehnika izlučivanja informacija i težinskih Voronoievih dijagrama

    Get PDF
    News web portals present information, in previously defined topic taxonomy, in both multimedia as well as textual format, that cover all aspects of our daily lives. The information presented has a high refresh rate and as such offers a local as well as a global snapshot of the world. This thesis deals with the presentation of information extraction techniques (from web news portals) and their use in standardization of categorization schemes and automatic classification of newly published content. As the personalization method, weighted Voronoi diagrams are proposed. The aim of the study is to create a virtual profile based on the semantic value of information of visited nodes (web pages formatted with HTML language) at the individual level. The results can greatly contribute to the applicability of the personalization data to specific information sources, including various web news portals. Also, by creating a publicly available collection of prepared data future research in this domain is enabled. Scientific contribution of this doctoral thesis is therefore: a universal classification scheme, that is based on the ODP taxonomy data, is developed, a way for information extraction about user preferences, based on the analysis of user behavior data when using the Web browser, is defined, personalization system, based on the weighted Voronoi diagrams, is implemented.Jedan od načina rjeÅ”avanja problema nastalih hiperprodukcijom informacija je putem personalizacije izvora informacija, u naÅ”em slučaju WWW okruženja, kreiranjem virtualnih profila temeljenih na analizi ponaÅ”ajnih karakteristika korisnika s ciljem gradiranja važnosti informacija na individualnoj bazi. Sama personalizacija je najviÅ”e koriÅ”tena u području pretraživanja informacija. U pregledu dosadaÅ”njih istraživanja valja napomenuti nekoliko različitih pristupa koji su koriÅ”teni u personalizaciji dostupnog sadržaja: ontologijski pristupi, kontekstualni modeli, rudarenje podataka. Ti pristupi su najzastupljeniji u pregledanoj literaturi. Analizom literature također je uočen problem nedostatka ujednačene taksonomije pojmova koji se koriste za anotaciju informacijskih čvorova. Prevladavajući pristup anotacijije koriÅ”tenje sustava označavanja koji se temelji na korisničkom unosu. Pregledani radovi ukazuju da korisnici na različitim sustavima vežu iste anotacije za iste i/ili slične objekte kod popularnih anotacija, da problem sinonima postoji ali da je zanemariv uz dovoljnu količinu podataka te da se anotacije koriÅ”tene od strane običnih korisnika i stručnjaka domene preklapaju u 52% slučajeva. Ti podaci upućuju na problem nedostatka unificiranog sustava označavanja informacijskog čvora. Sustavi označavanja nose sa sobom veliku količinu "informacijskog Å”uma" zbog individualne prirode označavanja informacijskog čvora koji je izravno vezan za korisnikovo poznavanje domene informacijskog čvora. Kao potencijalno rjeÅ”enje ovog uočenog nedostatka predlaže se koriÅ”tenje postojećih taksonomija definiranih putem web direktorija. Pregled literature, od nekoliko mogućih web direktorija, najviÅ”e spominje ODP web direktorij kao najkvalitetniju taksonomiju hijerarhijske domenske kategorizacije informacijskih čvorova. KoriÅ”tenje ODP kao taksonomije je navedeno unekoliko radova proučenih u sklopu obavljenog predistraživanja. KoriÅ”tenjem ODP taksonomije za klasifikaciju informacijskih čvorova omogućuje se određivanje domenske pripadnosti. Ta činjenica omogućuje dodjelu vrijednosti pripadnosti informacijskog čvora pojedinoj domeni. S obzirom na kompleksnu strukturu ODP taksonomije (12 hijerarhijskih razina podjele, 17 kategorija na prvoj razini) i velikom broju potencijalnih kategorija, predlaže koriÅ”tenje ODP taksonomije za klasifikaciju informacijskog čvora do razine 6. Uz uputu o broju hijerarhijskih razina koje se preporučuju za koriÅ”tenje prilikom analize ODP strukture, također ističe potrebu za dubinskom klasifikacijom dokumenata. Analizom literature primijećeno je da se problemu personalizacije pristupa prvenstveno u domeni pretraživanja informacija putem WWW sučelja te da je personalizacija informacija dostupnih putem web portala slabo istražena. Kroz brojne radove koji su konzultirani prilikom pripreme predistraživačke faze kao izvori podataka za analizu iskoriÅ”teni su različiti izvori informacija: serverske log datoteke, osobna povijest pregledavanja putem preglednikovih log datoteka, aplikacije za praćenje korisnikove interakcije sa sustavom , kolačići i drugi. Podaci prikupljeni putem jednog ili viÅ”e gore navedenih izvora daju nam uvid u individualno kretanje korisnika unutar definiranog informacijskog i vremenskog okvira. U pregledanoj literaturi se tako prikupljeni podaci koriste za personalizaciju informacija no ne na individualnoj razini nego na temelju grupiranja korisnika u tematski slične grupe/cjeline. Cilj ovog rada je testirati postojeće metode, koje su prepoznate od koristi za daljnji rad, te unapređenje tih metoda težinskim Voronoi dijagramima radi ostvarivanja personalizacije na individualnoj razini. KoriÅ”tenje težinskih Voronoi dijagrama do sada nije zabilježen u literaturi pa samim time predstavlja inovaciju na području personalizacije informacija. Od pomoći će u tom procesu biti i radovi koji se temeljno bave prepoznavanjem uzoraka koriÅ”tenja informacijskih čvorova, kojih ima značajan broj te se ne mogu svi spomenuti. Postojanje ponaÅ”ajnog uzorka povezanog bilo s dugoročnim i/ili kratkoročnim podacima o korisnikovu kretanju kroz informacijski prostor omogućuje kvalitetnije filtriranje i personalizaciju dostupnih informacija. S obzirom da je cilj ovog rada prikazati mogućnost individualne personalizacije, prepoznat je potencijal koriÅ”tenja težinskih Voronoi dijagrama za potrebe izgradnje virtualnog semantičkog profila te personalizaciju informacija

    Personalizacija sadržaja novinskih webskih portala pomoću tehnika izlučivanja informacija i težinskih Voronoievih dijagrama

    Get PDF
    News web portals present information, in previously defined topic taxonomy, in both multimedia as well as textual format, that cover all aspects of our daily lives. The information presented has a high refresh rate and as such offers a local as well as a global snapshot of the world. This thesis deals with the presentation of information extraction techniques (from web news portals) and their use in standardization of categorization schemes and automatic classification of newly published content. As the personalization method, weighted Voronoi diagrams are proposed. The aim of the study is to create a virtual profile based on the semantic value of information of visited nodes (web pages formatted with HTML language) at the individual level. The results can greatly contribute to the applicability of the personalization data to specific information sources, including various web news portals. Also, by creating a publicly available collection of prepared data future research in this domain is enabled. Scientific contribution of this doctoral thesis is therefore: a universal classification scheme, that is based on the ODP taxonomy data, is developed, a way for information extraction about user preferences, based on the analysis of user behavior data when using the Web browser, is defined, personalization system, based on the weighted Voronoi diagrams, is implemented.Jedan od načina rjeÅ”avanja problema nastalih hiperprodukcijom informacija je putem personalizacije izvora informacija, u naÅ”em slučaju WWW okruženja, kreiranjem virtualnih profila temeljenih na analizi ponaÅ”ajnih karakteristika korisnika s ciljem gradiranja važnosti informacija na individualnoj bazi. Sama personalizacija je najviÅ”e koriÅ”tena u području pretraživanja informacija. U pregledu dosadaÅ”njih istraživanja valja napomenuti nekoliko različitih pristupa koji su koriÅ”teni u personalizaciji dostupnog sadržaja: ontologijski pristupi, kontekstualni modeli, rudarenje podataka. Ti pristupi su najzastupljeniji u pregledanoj literaturi. Analizom literature također je uočen problem nedostatka ujednačene taksonomije pojmova koji se koriste za anotaciju informacijskih čvorova. Prevladavajući pristup anotacijije koriÅ”tenje sustava označavanja koji se temelji na korisničkom unosu. Pregledani radovi ukazuju da korisnici na različitim sustavima vežu iste anotacije za iste i/ili slične objekte kod popularnih anotacija, da problem sinonima postoji ali da je zanemariv uz dovoljnu količinu podataka te da se anotacije koriÅ”tene od strane običnih korisnika i stručnjaka domene preklapaju u 52% slučajeva. Ti podaci upućuju na problem nedostatka unificiranog sustava označavanja informacijskog čvora. Sustavi označavanja nose sa sobom veliku količinu "informacijskog Å”uma" zbog individualne prirode označavanja informacijskog čvora koji je izravno vezan za korisnikovo poznavanje domene informacijskog čvora. Kao potencijalno rjeÅ”enje ovog uočenog nedostatka predlaže se koriÅ”tenje postojećih taksonomija definiranih putem web direktorija. Pregled literature, od nekoliko mogućih web direktorija, najviÅ”e spominje ODP web direktorij kao najkvalitetniju taksonomiju hijerarhijske domenske kategorizacije informacijskih čvorova. KoriÅ”tenje ODP kao taksonomije je navedeno unekoliko radova proučenih u sklopu obavljenog predistraživanja. KoriÅ”tenjem ODP taksonomije za klasifikaciju informacijskih čvorova omogućuje se određivanje domenske pripadnosti. Ta činjenica omogućuje dodjelu vrijednosti pripadnosti informacijskog čvora pojedinoj domeni. S obzirom na kompleksnu strukturu ODP taksonomije (12 hijerarhijskih razina podjele, 17 kategorija na prvoj razini) i velikom broju potencijalnih kategorija, predlaže koriÅ”tenje ODP taksonomije za klasifikaciju informacijskog čvora do razine 6. Uz uputu o broju hijerarhijskih razina koje se preporučuju za koriÅ”tenje prilikom analize ODP strukture, također ističe potrebu za dubinskom klasifikacijom dokumenata. Analizom literature primijećeno je da se problemu personalizacije pristupa prvenstveno u domeni pretraživanja informacija putem WWW sučelja te da je personalizacija informacija dostupnih putem web portala slabo istražena. Kroz brojne radove koji su konzultirani prilikom pripreme predistraživačke faze kao izvori podataka za analizu iskoriÅ”teni su različiti izvori informacija: serverske log datoteke, osobna povijest pregledavanja putem preglednikovih log datoteka, aplikacije za praćenje korisnikove interakcije sa sustavom , kolačići i drugi. Podaci prikupljeni putem jednog ili viÅ”e gore navedenih izvora daju nam uvid u individualno kretanje korisnika unutar definiranog informacijskog i vremenskog okvira. U pregledanoj literaturi se tako prikupljeni podaci koriste za personalizaciju informacija no ne na individualnoj razini nego na temelju grupiranja korisnika u tematski slične grupe/cjeline. Cilj ovog rada je testirati postojeće metode, koje su prepoznate od koristi za daljnji rad, te unapređenje tih metoda težinskim Voronoi dijagramima radi ostvarivanja personalizacije na individualnoj razini. KoriÅ”tenje težinskih Voronoi dijagrama do sada nije zabilježen u literaturi pa samim time predstavlja inovaciju na području personalizacije informacija. Od pomoći će u tom procesu biti i radovi koji se temeljno bave prepoznavanjem uzoraka koriÅ”tenja informacijskih čvorova, kojih ima značajan broj te se ne mogu svi spomenuti. Postojanje ponaÅ”ajnog uzorka povezanog bilo s dugoročnim i/ili kratkoročnim podacima o korisnikovu kretanju kroz informacijski prostor omogućuje kvalitetnije filtriranje i personalizaciju dostupnih informacija. S obzirom da je cilj ovog rada prikazati mogućnost individualne personalizacije, prepoznat je potencijal koriÅ”tenja težinskih Voronoi dijagrama za potrebe izgradnje virtualnog semantičkog profila te personalizaciju informacija

    Concept-based Interactive Query Expansion Support Tool (CIQUEST)

    Get PDF
    This report describes a three-year project (2000-03) undertaken in the Information Studies Department at The University of Sheffield and funded by Resource, The Council for Museums, Archives and Libraries. The overall aim of the research was to provide user support for query formulation and reformulation in searching large-scale textual resources including those of the World Wide Web. More specifically the objectives were: to investigate and evaluate methods for the automatic generation and organisation of concepts derived from retrieved document sets, based on statistical methods for term weighting; and to conduct user-based evaluations on the understanding, presentation and retrieval effectiveness of concept structures in selecting candidate terms for interactive query expansion. The TREC test collection formed the basis for the seven evaluative experiments conducted in the course of the project. These formed four distinct phases in the project plan. In the first phase, a series of experiments was conducted to investigate further techniques for concept derivation and hierarchical organisation and structure. The second phase was concerned with user-based validation of the concept structures. Results of phases 1 and 2 informed on the design of the test system and the user interface was developed in phase 3. The final phase entailed a user-based summative evaluation of the CiQuest system. The main findings demonstrate that concept hierarchies can effectively be generated from sets of retrieved documents and displayed to searchers in a meaningful way. The approach provides the searcher with an overview of the contents of the retrieved documents, which in turn facilitates the viewing of documents and selection of the most relevant ones. Concept hierarchies are a good source of terms for query expansion and can improve precision. The extraction of descriptive phrases as an alternative source of terms was also effective. With respect to presentation, cascading menus were easy to browse for selecting terms and for viewing documents. In conclusion the project dissemination programme and future work are outlined

    Credibility analysis of textual claims with explainable evidence

    Get PDF
    Despite being a vast resource of valuable information, the Web has been polluted by the spread of false claims. Increasing hoaxes, fake news, and misleading information on the Web have given rise to many fact-checking websites that manually assess these doubtful claims. However, the rapid speed and large scale of misinformation spread have become the bottleneck for manual verification. This calls for credibility assessment tools that can automate this verification process. Prior works in this domain make strong assumptions about the structure of the claims and the communities where they are made. Most importantly, black-box techniques proposed in prior works lack the ability to explain why a certain statement is deemed credible or not. To address these limitations, this dissertation proposes a general framework for automated credibility assessment that does not make any assumption about the structure or origin of the claims. Specifically, we propose a feature-based model, which automatically retrieves relevant articles about the given claim and assesses its credibility by capturing the mutual interaction between the language style of the relevant articles, their stance towards the claim, and the trustworthiness of the underlying web sources. We further enhance our credibility assessment approach and propose a neural-network-based model. Unlike the feature-based model, this model does not rely on feature engineering and external lexicons. Both our models make their assessments interpretable by extracting explainable evidence from judiciously selected web sources. We utilize our models and develop a Web interface, CredEye, which enables users to automatically assess the credibility of a textual claim and dissect into the assessment by browsing through judiciously and automatically selected evidence snippets. In addition, we study the problem of stance classification and propose a neural-network-based model for predicting the stance of diverse user perspectives regarding the controversial claims. Given a controversial claim and a user comment, our stance classification model predicts whether the user comment is supporting or opposing the claim.Das Web ist eine riesige Quelle wertvoller Informationen, allerdings wurde es durch die Verbreitung von Falschmeldungen verschmutzt. Eine zunehmende Anzahl an Hoaxes, Falschmeldungen und irrefĆ¼hrenden Informationen im Internet haben viele Websites hervorgebracht, auf denen die Fakten Ć¼berprĆ¼ft und zweifelhafte Behauptungen manuell bewertet werden. Die rasante Verbreitung groƟer Mengen von Fehlinformationen sind jedoch zum Engpass fĆ¼r die manuelle ƜberprĆ¼fung geworden. Dies erfordert Tools zur Bewertung der GlaubwĆ¼rdigkeit, mit denen dieser ƜberprĆ¼fungsprozess automatisiert werden kann. In frĆ¼heren Arbeiten in diesem Bereich werden starke Annahmen gemacht Ć¼ber die Struktur der Behauptungen und die Portale, in denen sie gepostet werden. Vor allem aber kƶnnen die Black-Box-Techniken, die in frĆ¼heren Arbeiten vorgeschlagen wurden, nicht erklƤren, warum eine bestimmte Aussage als glaubwĆ¼rdig erachtet wird oder nicht. Um diesen EinschrƤnkungen zu begegnen, wird in dieser Dissertation ein allgemeines Framework fĆ¼r die automatisierte Bewertung der GlaubwĆ¼rdigkeit vorgeschlagen, bei dem keine Annahmen Ć¼ber die Struktur oder den Ursprung der Behauptungen gemacht werden. Insbesondere schlagen wir ein featurebasiertes Modell vor, das automatisch relevante Artikel zu einer bestimmten Behauptung abruft und deren GlaubwĆ¼rdigkeit bewertet, indem die gegenseitige Interaktion zwischen dem Sprachstil der relevanten Artikel, ihre Haltung zur Behauptung und der VertrauenswĆ¼rdigkeit der zugrunde liegenden Quellen erfasst wird. Wir verbessern unseren Ansatz zur Bewertung der GlaubwĆ¼rdigkeit weiter und schlagen ein auf neuronalen Netzen basierendes Modell vor. Im Gegensatz zum featurebasierten Modell ist dieses Modell nicht auf Feature-Engineering und externe Lexika angewiesen. Unsere beiden Modelle machen ihre EinschƤtzungen interpretierbar, indem sie erklƤrbare Beweise aus sorgfƤltig ausgewƤhlten Webquellen extrahieren. Wir verwenden unsere Modelle zur Entwicklung eines Webinterfaces, CredEye, mit dem Benutzer die GlaubwĆ¼rdigkeit einer Behauptung in Textform automatisch bewerten und verstehen kƶnnen, indem sie automatisch ausgewƤhlte BeweisstĆ¼cke einsehen. DarĆ¼ber hinaus untersuchen wir das Problem der Positionsklassifizierung und schlagen ein auf neuronalen Netzen basierendes Modell vor, um die Position verschiedener Benutzerperspektiven in Bezug auf die umstrittenen Behauptungen vorherzusagen. Bei einer kontroversen Behauptung und einem Benutzerkommentar sagt unser Einstufungsmodell voraus, ob der Benutzerkommentar die Behauptung unterstĆ¼tzt oder ablehnt
    • ā€¦
    corecore