7 research outputs found

    Intelligent Support for Information Retrieval of Web Documents

    Get PDF
    The main goal of this research was to investigate the means of intelligent support for retrieval of web documents. We have proposed the architecture of the web tool system --- Trillian, which discovers the interests of users without their interaction and uses them for autonomous searching of related web content. Discovered pages are suggested to the user. The discovery of user interests is based on analysis of documents visited by the users previously. We have created a module for completely transparent tracking of the user's movement on the web, which logs both visited URLs and contents of web pages. The post analysis step is based on a variant of the suffix tree clustering algorithm. We primarily focus on overall Trillian architecture design and the process of discovering topics of interests. We have implemented an experimental prototype of Trillian and evaluated the quality, speed and usefulness of the proposed system. We have shown that clustering is a feasible technique for extraction of interests from web documents. We consider the proposed architecture to be quite promising and suitable for future extensions

    Mining Traversal Patterns from Weighted Traversals and Graph

    Get PDF
    μ‹€μ„Έκ³„μ˜ λ§Žμ€ λ¬Έμ œλ“€μ€ κ·Έλž˜ν”„μ™€ κ·Έ κ·Έλž˜ν”„λ₯Ό μˆœνšŒν•˜λŠ” νŠΈλžœμž­μ…˜μœΌλ‘œ λͺ¨λΈλ§λ  수 μžˆλ‹€. 예λ₯Ό λ“€λ©΄, μ›Ή νŽ˜μ΄μ§€μ˜ μ—°κ²°κ΅¬μ‘°λŠ” κ·Έλž˜ν”„λ‘œ ν‘œν˜„λ  수 있고, μ‚¬μš©μžμ˜ μ›Ή νŽ˜μ΄μ§€ λ°©λ¬Έκ²½λ‘œλŠ” κ·Έ κ·Έλž˜ν”„λ₯Ό μˆœνšŒν•˜λŠ” νŠΈλžœμž­μ…˜μœΌλ‘œ λͺ¨λΈλ§λ  수 μžˆλ‹€. 이와 같이 κ·Έλž˜ν”„λ₯Ό μˆœνšŒν•˜λŠ” νŠΈλžœμž­μ…˜μœΌλ‘œλΆ€ν„° μ€‘μš”ν•˜κ³  κ°€μΉ˜ μžˆλŠ” νŒ¨ν„΄μ„ μ°Ύμ•„λ‚΄λŠ” 것은 의미 μžˆλŠ” 일이닀. μ΄λŸ¬ν•œ νŒ¨ν„΄μ„ μ°ΎκΈ° μœ„ν•œ μ§€κΈˆκΉŒμ§€μ˜ μ—°κ΅¬μ—μ„œλŠ” μˆœνšŒλ‚˜ κ·Έλž˜ν”„μ˜ κ°€μ€‘μΉ˜λ₯Ό κ³ λ €ν•˜μ§€ μ•Šκ³  λ‹¨μˆœνžˆ λΉˆλ°œν•˜λŠ” νŒ¨ν„΄λ§Œμ„ μ°ΎλŠ” μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•˜μ˜€λ‹€. μ΄λŸ¬ν•œ μ•Œκ³ λ¦¬μ¦˜μ˜ ν•œκ³„λŠ” 보닀 μ‹ λ’°μ„± 있고 μ •ν™•ν•œ νŒ¨ν„΄μ„ νƒμ‚¬ν•˜λŠ” 데 어렀움이 μžˆλ‹€λŠ” 것이닀. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μˆœνšŒλ‚˜ κ·Έλž˜ν”„μ˜ 정점에 λΆ€μ—¬λœ κ°€μ€‘μΉ˜λ₯Ό κ³ λ €ν•˜μ—¬ νŒ¨ν„΄μ„ νƒμ‚¬ν•˜λŠ” 두 가지 방법듀을 μ œμ•ˆν•œλ‹€. 첫 번째 방법은 κ·Έλž˜ν”„λ₯Ό μˆœνšŒν•˜λŠ” 정보에 κ°€μ€‘μΉ˜κ°€ μ‘΄μž¬ν•˜λŠ” κ²½μš°μ— 빈발 순회 νŒ¨ν„΄μ„ νƒμ‚¬ν•˜λŠ” 것이닀. κ·Έλž˜ν”„ μˆœνšŒμ— 뢀여될 수 μžˆλŠ” κ°€μ€‘μΉ˜λ‘œλŠ” 두 λ„μ‹œκ°„μ˜ 이동 μ‹œκ°„μ΄λ‚˜ μ›Ή μ‚¬μ΄νŠΈλ₯Ό λ°©λ¬Έν•  λ•Œ ν•œ νŽ˜μ΄μ§€μ—μ„œ λ‹€λ₯Έ νŽ˜μ΄μ§€λ‘œ μ΄λ™ν•˜λŠ” μ‹œκ°„ 등이 될 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ’€ 더 μ •ν™•ν•œ 순회 νŒ¨ν„΄μ„ λ§ˆμ΄λ‹ν•˜κΈ° μœ„ν•΄ ν†΅κ³„ν•™μ˜ μ‹ λ’° ꡬ간을 μ΄μš©ν•œλ‹€. 즉, 전체 순회의 각 간선에 λΆ€μ—¬λœ κ°€μ€‘μΉ˜λ‘œλΆ€ν„° μ‹ λ’° ꡬ간을 κ΅¬ν•œ ν›„ μ‹ λ’° κ΅¬κ°„μ˜ 내에 μžˆλŠ” μˆœνšŒλ§Œμ„ μœ νš¨ν•œ κ²ƒμœΌλ‘œ μΈμ •ν•˜λŠ” 방법이닀. μ΄λŸ¬ν•œ 방법을 μ μš©ν•¨μœΌλ‘œμ¨ λ”μš± μ‹ λ’°μ„± μžˆλŠ” 순회 νŒ¨ν„΄μ„ λ§ˆμ΄λ‹ν•  수 μžˆλ‹€. λ˜ν•œ μ΄λ ‡κ²Œ κ΅¬ν•œ νŒ¨ν„΄κ³Ό κ·Έλž˜ν”„ 정보λ₯Ό μ΄μš©ν•˜μ—¬ νŒ¨ν„΄ κ°„μ˜ μš°μ„ μˆœμœ„λ₯Ό κ²°μ •ν•  수 μžˆλŠ” 방법과 μ„±λŠ₯ ν–₯상을 μœ„ν•œ μ•Œκ³ λ¦¬μ¦˜λ„ μ œμ‹œν•œλ‹€. 두 번째 방법은 κ·Έλž˜ν”„μ˜ 정점에 κ°€μ€‘μΉ˜κ°€ λΆ€μ—¬λœ κ²½μš°μ— κ°€μ€‘μΉ˜κ°€ 고렀된 빈발 순회 νŒ¨ν„΄μ„ νƒμ‚¬ν•˜λŠ” 방법이닀. κ·Έλž˜ν”„μ˜ 정점에 뢀여될 수 μžˆλŠ” κ°€μ€‘μΉ˜λ‘œλŠ” μ›Ή μ‚¬μ΄νŠΈ λ‚΄μ˜ 각 λ¬Έμ„œμ˜ μ •λ³΄λŸ‰μ΄λ‚˜ μ€‘μš”λ„ 등이 될 수 μžˆλ‹€. 이 λ¬Έμ œμ—μ„œλŠ” 빈발 순회 νŒ¨ν„΄μ„ κ²°μ •ν•˜κΈ° μœ„ν•˜μ—¬ νŒ¨ν„΄μ˜ λ°œμƒ λΉˆλ„λΏλ§Œ μ•„λ‹ˆλΌ λ°©λ¬Έν•œ μ •μ μ˜ κ°€μ€‘μΉ˜λ₯Ό λ™μ‹œμ— κ³ λ €ν•˜μ—¬μ•Ό ν•œλ‹€. 이λ₯Ό μœ„ν•΄ λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ •μ μ˜ κ°€μ€‘μΉ˜λ₯Ό μ΄μš©ν•˜μ—¬ ν–₯후에 빈발 νŒ¨ν„΄μ΄ 될 κ°€λŠ₯성이 μžˆλŠ” 후보 νŒ¨ν„΄μ€ 각 λ§ˆμ΄λ‹ λ‹¨κ³„μ—μ„œ μ œκ±°ν•˜μ§€ μ•Šκ³  μœ μ§€ν•˜λŠ” μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•œλ‹€. λ˜ν•œ μ„±λŠ₯ ν–₯상을 μœ„ν•΄ 후보 νŒ¨ν„΄μ˜ 수λ₯Ό κ°μ†Œμ‹œν‚€λŠ” μ•Œκ³ λ¦¬μ¦˜λ„ μ œμ•ˆν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•œ 두 가지 방법에 λŒ€ν•˜μ—¬ λ‹€μ–‘ν•œ μ‹€ν—˜μ„ ν†΅ν•˜μ—¬ μˆ˜ν–‰ μ‹œκ°„ 및 μƒμ„±λ˜λŠ” νŒ¨ν„΄μ˜ 수 등을 비ꡐ λΆ„μ„ν•˜μ˜€λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μˆœνšŒμ— κ°€μ€‘μΉ˜κ°€ μžˆλŠ” κ²½μš°μ™€ κ·Έλž˜ν”„μ˜ 정점에 κ°€μ€‘μΉ˜κ°€ μžˆλŠ” κ²½μš°μ— 빈발 순회 νŒ¨ν„΄μ„ νƒμ‚¬ν•˜λŠ” μƒˆλ‘œμš΄ 방법듀을 μ œμ•ˆν•˜μ˜€λ‹€. μ œμ•ˆν•œ 방법듀을 μ›Ή λ§ˆμ΄λ‹κ³Ό 같은 뢄야에 μ μš©ν•¨μœΌλ‘œμ¨ μ›Ή ꡬ쑰의 효율적인 λ³€κ²½μ΄λ‚˜ μ›Ή λ¬Έμ„œμ˜ μ ‘κ·Ό 속도 ν–₯상, μ‚¬μš©μžλ³„ κ°œμΈν™”λœ μ›Ή λ¬Έμ„œ ꡬ좕 등이 κ°€λŠ₯ν•  것이닀.Abstract β…Ά Chapter 1 Introduction 1.1 Overview 1.2 Motivations 1.3 Approach 1.4 Organization of Thesis Chapter 2 Related Works 2.1 Itemset Mining 2.2 Weighted Itemset Mining 2.3 Traversal Mining 2.4 Graph Traversal Mining Chapter 3 Mining Patterns from Weighted Traversals on Unweighted Graph 3.1 Definitions and Problem Statements 3.2 Mining Frequent Patterns 3.2.1 Augmentation of Base Graph 3.2.2 In-Mining Algorithm 3.2.3 Pre-Mining Algorithm 3.2.4 Priority of Patterns 3.3 Experimental Results Chapter 4 Mining Patterns from Unweighted Traversals on Weighted Graph 4.1 Definitions and Problem Statements 4.2 Mining Weighted Frequent Patterns 4.2.1 Pruning by Support Bounds 4.2.2 Candidate Generation 4.2.3 Mining Algorithm 4.3 Estimation of Support Bounds 4.3.1 Estimation by All Vertices 4.3.2 Estimation by Reachable Vertices 4.4 Experimental Results Chapter 5 Conclusions and Further Works Reference

    Mining of uncertain Web log sequences with access history probabilities

    Get PDF
    An uncertain data sequence is a sequence of data that exist with some level of doubt or probability. Each data item in the uncertain sequence is represented with a label and probability values, referred to as existential probability, ranging from 0 to 1. Existing algorithms are either unsuitable or inefficient for discovering frequent sequences in uncertain data. This thesis presents mining of uncertain Web sequences with a method that combines access history probabilities from several Web log sessions with features of the PLWAP web sequential miner. The method is Uncertain Position Coded Pre-order Linked Web Access Pattern (U-PLWAP) algorithm for mining frequent sequential patterns in uncertain web logs. While PLWAP only considers a session of weblogs, U-PLWAP takes more sessions of weblogs from which existential probabilities are generated. Experiments show that U-PLWAP is at least 100% faster than U-apriori, and 33% faster than UF-growth. The UF-growth algorithm also fails to take into consideration the order of the items, thereby making U-PLWAP a richer algorithm in terms of the information its result contains

    Mining Web Log Sequential Patterns with Position Coded Pre-Order Linked WAP-Tree

    Full text link

    Finding Generalized Path Patterns for Web Log Data Mining

    No full text
    Conducting data mining on logs of web servers involves the determination of frequently occurring access sequences. We examine the problem of finding traversal patterns from web logs by considering the fact that irrelevant accesses to web documents may be interleaved within access patterns due to navigational purposes. We define a general type of pattern that takes into account this fact and also, we present a level-wise algorithm for the determination of these patterns, which is based on the underlying structure of the web site. The performance of the algorithm and its sensitivity to several parameters is examined experimentally with synthetic data
    corecore