Search CORE

20 research outputs found

다양한 딥 러닝 학습 환경 하의 컨텐츠 기반 이미지 검색

Author: 장영균
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022.2. 조남익.방대한 데이터베이스에서 질의에 대한 관련 이미지를 찾는 콘텐츠 기반 이미지 검색은 컴퓨터 비전 분야의 근본적인 작업 중 하나이다. 특히 빠르고 정확한 검색을 수행하기 위해 해싱 (Hashing) 및 곱 양자화 (Product Quantization, PQ) 로 대표되는 근사최근접 이웃 (Approximate Nearest Neighbor, ANN) 검색 방식이 이미지 검색 커뮤니티에서 주목받고 있다. 신경망 기반 딥 러닝 (CNN-based deep learning) 이 많은 컴퓨터 비전 작업에서 우수한 성능을 보여준 이후로, 해싱 및 곱 양자화 기반 이미지 검색 시스템 모두 개선을 위해 딥 러닝을 채택하고 있다. 본 학위 논문에서는 적절한 검색 시스템을 제안하기 위해 다양한 딥 러닝 학습 환경아래에서 이미지 검색 방법을 제안한다. 구체적으로, 이미지 검색의 목적을 고려하여 의미적으로 유사한 이미지를 검색하는 딥 러닝 해싱 시스템을 개발하기 위한 지도 학습 방법을 제안하고, 의미적, 시각적으로 모두 유사한 이미지를 검색하는 딥 러닝 곱 양자화 기반의 시스템을 구축하기 위한 준지도, 비지도 학습 방법을 제안한다. 또한, 이미지 검색 데이터베이스의 특성을 고려하여, 분류해야할 클래스 (class category) 가 많은 얼굴 이미지 데이터 세트와 하나 이상의 레이블 (label) 이 지정된 일반 이미지 세트를 분리하여 따로 검색 시스템을 구축한다. 먼저 이미지에 부여된 의미론적 레이블을 사용하는 지도 학습을 도입하여 해싱 기반 검색 시스템을 구축한다. 클래스 간 유사성 (다른 사람 사이의 유사한 외모) 과 클래스 내 변화(같은 사람의 다른 포즈, 표정, 조명) 와 같은 얼굴 이미지 구별의 어려움을 해결하기 위해 각 이미지의 클래스 레이블을 사용한다. 얼굴 이미지 검색 품질을 더욱 향상시키기 위해 SGH (Similarity Guided Hashing) 방식을 제안하며, 여기서 다중 데이터 증강 결과를 사용한 자기 유사성 학습이 훈련 중에 사용된다. 그리고 해싱 기반의 일반 이미지 검색 시스템을 구성하기 위해 DHD(Deep Hash Distillation) 방식을 제안한다. DHD에서는 지도 신호를 활용하기 위해 클래스별 대표성을 나타내는 훈련 가능한 해시 프록시 (proxy) 를 도입한다. 또한, 해싱에 적합한 자체 증류 기법을 제안하여 증강 데이터의 잠재력을 일반적인 이미지 검색 성능 향상에 적용한다. 둘째로, 레이블이 지정된 이미지 데이터와 레이블이 지정되지 않은 이미지 데이터를 모두 활용하는 준지도 학습을 조사하여 곱 양자화 기반 검색 시스템을 구축한다. 지도 학습 딥 러닝 기반의 이미지 검색 방법들은 우수한 성능을 보이려면 값비싼 레이블 정보가 충분해야 한다는 단점이 있다. 게다가, 레이블이 지정되지 않은 수많은 이미지 데이터는 훈련에서 제외된다는 한계가 있다. 이 문제를 해결하기 위해 벡터 양자화 기반 반지도 영상 검색 방식인 GPQ (Generalized Product Quantization) 네트워크를 제안한다. 레이블이 지정된 데이터 간의 의미론적 유사성을 유지하는 새로운 메트릭 학습 (Metric learning) 전략과 레이블이 지정되지 않은 데이터의 고유한 잠재력을 최대한 활용하는 엔트로피 정규화 방법을 사용하여 검색 시스템을 개선한다. 이 솔루션은 양자화 네트워크의 일반화 용량을 증가시켜 이전의 한계를 극복할 수 있게한다. 마지막으로, 딥 러닝 모델이 사람의 지도 없이 시각적으로 유사한 이미지 검색을 수행할 수 있도록 하기 위해 비지도 학습 알고리즘을 탐색한다. 비록 레이블 주석을 활용한 심층 지도 기반의 방법들이 기존 방법들에 대비 우수한 검색 성능을 보일지라도, 방대한 양의 훈련 데이터에 대해 정확하게 레이블을 지정하는 것은 힘들고 주석에서 오류가 발생하기 쉽다는 한계가 있다. 이 문제를 해결하기 위해 레이블 없이 자체 지도 방식으로 훈련하는 SPQ (Self-supervised Product Quantization) 네트워크 라는 심층 비지도 이미지 검색 방법을 제안한다. 새롭게 설계된 교차 양자화 대조 학습 방식으로 서로 다르게 변환된 이미지를 비교하여 곱 양자화의 코드워드와 심층 시각적 표현을 동시에 학습한다. 이 방식을 통해 이미지에 내제된 내용을 별도의 사람 지도 없이 네트워크가 스스로 이해하게 되고, 시각적으로 정확한 검색을 수행할 수 있는 설명 기능을 추출할 수 있게 된다. 벤치마크 데이터 세트에 대한 광범위한 이미지 검색 실험을 수행하여 제안된 방법이 다양한 평가 프로토콜에서 뛰어난 결과를 산출함을 확인했다. 지도 학습 기반의 얼굴 영상 검색의 경우 SGH는 저해상도 및 고해상도 얼굴 영상 모두에서 최고의 검색 성능을 달성하였고, DHD는 최고의 검색 정확도로 일반 영상 검색 실험에서 효율성을 입증한다. 준지도 일반 이미지 검색의 경우 GPQ는 레이블이 있는 이미지 데이터와 레이블이 없는 이미지 데이터를 모두 사용하는 프로토콜에 대한 최상의 검색 결과를 보여준다. 마지막으로, 비지도 학습 이미지 검색의 경우 지도 방식으로 미리 학습된 초기 값 없이도 SPQ를 사용하여 최상의 검색 점수를 얻었으며 시각적으로 유사한 이미지가 검색 결과로 성공적으로 검색되는 것을 관찰할 수 있다.Content-based image retrieval, which finds relevant images to a query from a huge database, is one of the fundamental tasks in the field of computer vision. Especially for conducting fast and accurate retrieval, Approximate Nearest Neighbor (ANN) search approaches represented by Hashing and Product Quantization (PQ) have been proposed to image retrieval community. Ever since neural network based deep learning has shown excellent performance in many computer vision tasks, both Hashing and product quantization-based image retrieval systems are also adopting deep learning for improvement. In this dissertation, image retrieval methods under various deep learning conditions are investigated to suggest the appropriate retrieval systems. Specifically, by considering the purpose of image retrieval, the supervised learning methods are proposed to develop the deep Hashing systems that retrieve semantically similar images, and the semi-supervised, unsupervised learning methods are proposed to establish the deep product quantization systems that retrieve both semantically and visually similar images. Moreover, by considering the characteristics of image retrieval database, the face image sets with numerous class categories, and the general image sets of one or more labeled images are separated to be explored when building a retrieval system. First, supervised learning with the semantic labels given to images is introduced to build a Hashing-based retrieval system. To address the difficulties of distinguishing face images, such as the inter-class similarities (similar appearance between different persons) and the intra-class variations (same person with different pose, facial expressions, illuminations), the identity label of each image is employed to derive the discriminative binary codes. To further develop the face image retrieval quality, Similarity Guided Hashing (SGH) scheme is proposed, where the self-similarity learning with multiple data augmentation results are employed during training. In terms of Hashing-based general image retrieval systems, Deep Hash Distillation (DHD) scheme is proposed, where the trainable hash proxy that presents class-wise representative is introduced to take advantage of supervised signals. Moreover, self-distillation scheme adapted for Hashing is utilized to improve general image retrieval performance by exploiting the potential of augmented data appropriately. Second, semi-supervised learning that utilizes both labeled and unlabeled image data is investigated to build a PQ-based retrieval system. Even if the supervised deep methods show excellent performance, they do not meet the expectations unless expensive label information is sufficient. Besides, there is a limitation that a tons of unlabeled image data is excluded from training. To resolve this issue, the vector quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network is proposed. A novel metric learning strategy that preserves semantic similarity between labeled data, and a entropy regularization term that fully exploits inherent potentials of unlabeled data are employed to improve the retrieval system. This solution increases the generalization capacity of the quantization network, which allows to overcome previous limitations. Lastly, to enable the network to perform a visually similar image retrieval on its own without any human supervision, unsupervised learning algorithm is explored. Although, deep supervised Hashing and PQ methods achieve the outstanding retrieval performances compared to the conventional methods by fully exploiting the label annotations, however, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, the deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner is proposed. A newly designed Cross Quantized Contrastive learning strategy is applied to jointly learn the PQ codewords and the deep visual representations by comparing individually transformed images (views). This allows to understand the image content and extract descriptive features so that the visually accurate retrieval can be performed. By conducting extensive image retrieval experiments on the benchmark datasets, the proposed methods are confirmed to yield the outstanding results under various evaluation protocols. For supervised face image retrieval, SGH achieves the best retrieval performance for both low and high resolution face image, and DHD also demonstrates its efficiency in general image retrieval experiments with the state-of-the-art retrieval performance. For semi-supervised general image retrieval, GPQ shows the best search results for protocols that use both labeled and unlabeled image data. Finally, for unsupervised general image retrieval, the best retrieval scores are achieved with SPQ even without supervised pre-training, and it can be observed that visually similar images are successfully retrieved as search results.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Contribution 3 1.2 Contents 4 2 Supervised Learning for Deep Hashing: Similarity Guided Hashing for Face Image Retrieval / Deep Hash Distillation for General Image Retrieval 5 2.1 Motivation and Overview for Face Image Retrieval 5 2.1.1 Related Works 9 2.2 Similarity Guided Hashing 10 2.3 Experiments 16 2.3.1 Datasets and Setup 16 2.3.2 Results on Small Face Images 18 2.3.3 Results on Large Face Images 19 2.4 Motivation and Overview for General Image Retrieval 20 2.5 Related Works 22 2.6 Deep Hash Distillation 24 2.6.1 Self-distilled Hashing 24 2.6.2 Teacher loss 27 2.6.3 Training 29 2.6.4 Hamming Distance Analysis 29 2.7 Experiments 32 2.7.1 Setup 32 2.7.2 Implementation Details 32 2.7.3 Results 34 2.7.4 Analysis 37 3 Semi-supervised Learning for Product Quantization: Generalized Product Quantization Network for Semi-supervised Image Retrieval 42 3.1 Motivation and Overview 42 3.1.1 Related Work 45 3.2 Generalized Product Quantization 47 3.2.1 Semi-Supervised Learning 48 3.2.2 Retrieval 52 3.3 Experiments 53 3.3.1 Setup 53 3.3.2 Results and Analysis 55 4 Unsupervised Learning for Product Quantization: Self-supervised Product Quantization for Deep Unsupervised Image Retrieval 58 4.1 Motivation and Overview 58 4.1.1 Related Works 61 4.2 Self-supervised Product Quantization 62 4.2.1 Overall Framework 62 4.2.2 Self-supervised Training 64 4.3 Experiments 67 4.3.1 Datasets 67 4.3.2 Experimental Settings 68 4.3.3 Results 71 4.3.4 Empirical Analysis 71 5 Conclusion 75 Abstract (In Korean) 88박

SNU Open Repository and Archive

Program analysis for android security and reliability

Author: Rahaman Sydur
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2023
Field of study

The recent, widespread growth and adoption of mobile devices have revolutionized the way users interact with technology. As mobile apps have become increasingly prevalent, concerns regarding their security and reliability have gained significant attention. The ever-expanding mobile app ecosystem presents unique challenges in ensuring the protection of user data and maintaining app robustness. This dissertation expands the field of program analysis with techniques and abstractions tailored explicitly to enhancing Android security and reliability. This research introduces approaches for addressing critical issues related to sensitive information leakage, device and user fingerprinting, mobile medical score calculators, as well as termination-induced data loss. Through a series of comprehensive studies and employing novel approaches that combine static and dynamic analysis, this work provides valuable insights and practical solutions to the aforementioned challenges. In summary, this dissertation makes the following contributions: (1) precise identifier leak tracking via a novel algebraic representation of leak signatures, (2) identifier processing graphs (IPGs), an abstraction for extracting and subverting user-based and device-based fingerprinting schemes, (3) interval-based verification of medical score calculator correctness, and (4) identifying potential data losses caused by app termination

Digital Commons @ New Jersey Institute of Technology (NJIT)

Learning to hash for large scale image retrieval

Author: Moran Sean James
Publication venue: The University of Edinburgh
Publication date: 27/06/2016
Field of study

This thesis is concerned with improving the effectiveness of nearest neighbour search. Nearest neighbour search is the problem of finding the most similar data-points to a query in a database, and is a fundamental operation that has found wide applicability in many fields. In this thesis the focus is placed on hashing-based approximate nearest neighbour search methods that generate similar binary hashcodes for similar data-points. These hashcodes can be used as the indices into the buckets of hashtables for fast search. This work explores how the quality of search can be improved by learning task specific binary hashcodes. The generation of a binary hashcode comprises two main steps carried out sequentially: projection of the image feature vector onto the normal vectors of a set of hyperplanes partitioning the input feature space followed by a quantisation operation that uses a single threshold to binarise the resulting projections to obtain the hashcodes. The degree to which these operations preserve the relative distances between the datapoints in the input feature space has a direct influence on the effectiveness of using the resulting hashcodes for nearest neighbour search. In this thesis I argue that the retrieval effectiveness of existing hashing-based nearest neighbour search methods can be increased by learning the thresholds and hyperplanes based on the distribution of the input data. The first contribution is a model for learning multiple quantisation thresholds. I demonstrate that the best threshold positioning is projection specific and introduce a novel clustering algorithm for threshold optimisation. The second contribution extends this algorithm by learning the optimal allocation of quantisation thresholds per hyperplane. In doing so I argue that some hyperplanes are naturally more effective than others at capturing the distribution of the data and should therefore attract a greater allocation of quantisation thresholds. The third contribution focuses on the complementary problem of learning the hashing hyperplanes. I introduce a multi-step iterative model that, in the first step, regularises the hashcodes over a data-point adjacency graph, which encourages similar data-points to be assigned similar hashcodes. In the second step, binary classifiers are learnt to separate opposing bits with maximum margin. This algorithm is extended to learn hyperplanes that can generate similar hashcodes for similar data-points in two different feature spaces (e.g. text and images). Individually the performance of these algorithms is often superior to competitive baselines. I unify my contributions by demonstrating that learning hyperplanes and thresholds as part of the same model can yield an additive increase in retrieval effectiveness

Edinburgh Research Archive

Towards Privacy and Security Concerns of Adversarial Examples in Deep Hashing Image Retrieval

Author: Xiao Yanru
Publication venue: ODU Digital Commons
Publication date: 01/12/2022
Field of study

With the explosive growth of images on the internet, image retrieval based on deep hashing attracts spotlights from both research and industry communities. Empowered by deep neural networks (DNNs), deep hashing enables fast and accurate image retrieval on large-scale data. However, inheriting from deep learning, deep hashing remains vulnerable to specifically designed input, called adversarial examples. By adding imperceptible perturbations on inputs, adversarial examples fool DNNs to make wrong decisions. The existence of adversarial examples not only raises security concerns for real-world deep learning applications, but also provides us with a technique to confront malicious applications. In this dissertation, we investigate privacy and security concerns in deep hashing image retrieval systems related to adversarial examples. Starting with a privacy concern, we stand on users side to preserve privacy information in images, which can be extracted by adversaries by retrieving similar images in image retrieval systems. Existing image processing-based privacy-preserving methods suffer from a trade-off of efficacy and usability. We propose a method introducing imperceptible adversarial perturbations on original images to prevent them from being retrieved. Users upload protected adversarial images instead of the original images to preserve privacy while maintaining usability. Then we shift to the security concerns. We act as attackers, proactively providing adversarial images to retrieval systems. These adversarial examples are embedded to specific targets so that the user retrieval results contain our unrelated adversarial images, e.g., users query with a “Husky dog” image, but retrieve adversarial “dog food” images in the result. A transferability-based attack is proposed for black-box models. We improve black-box transferability with the random noise as the proxy in optimization, achieving state-of-the-art success rate. Finally, we stand on retrieval systems side to mitigate the security concerns of adversarial attacks in deep hashing image retrieval. We propose a detection method that detects adversarial examples in the inference time. By studying unique adversarial behaviors in deep hashing image retrieval, our proposed method is constructed on criterions of these adversarial behaviors. The proposed method detects most of the adversarial examples with minimum overhead

Old Dominion University

Expanding The NIF Ecosystem - Corpus Conversion, Parsing And Processing Using The NLP Interchange Format 2.0

Author: Brümmer Martin
Publication venue
Publication date: 26/02/2018
Field of study

This work presents a thorough examination and expansion of the NIF ecosystem

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

Theory-Guided Algorithm Design for Scalable Machine Learning

Author: Cao Yiting
Publication venue
Publication date: 01/05/2023
Field of study

My thesis focuses on designing scalable machine learning algorithms leveraging theoretical advances in mathematics. In particular, I investigate two directions where scalability plays an important role: fair machine learning and randomized feature representations. In fair machine learning, my research concentrates on achieving individual fairness in the single model and decoupled model settings with minimum data labeling budgets. For randomized feature representations, I propose a model-agnostic framework for designing computationally efficient randomized machine learning algorithms with provable performance guarantees, which demonstrates that it is not necessary for individual models to be weakly trained before they are optimally ensembled. Furthermore, I also contribute to the scalable estimation of Kernel matrix spectral norm. Specifically, I propose to apply sketching techniques to efficiently estimate the spectral norm, theoretically derive the estimation error and empirically demonstrate the estimation efficiency in a time-constrained setting

SHAREOK repository

Automatic learning for the classification of chemical reactions and in statistical thermodynamics

Author: Latino Diogo Alexandre Rosa Serra
Publication venue: FCT - UNL
Publication date: 01/01/2008
Field of study

This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations

Repositório da Universidade Nova de Lisboa

Automatic Generation of Thematically Focused Information Portals from Web Data

Author: Sizov Sergej
Publication venue: Sonstige Einrichtungen. Sonstige Einrichtungen
Publication date: 01/01/2005
Field of study

Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology

Universaar

MPG.PuRe

Acronym