449 research outputs found

    Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling

    Full text link
    Spambot detection in online social networks is a long-lasting challenge involving the study and design of detection techniques capable of efficiently identifying ever-evolving spammers. Recently, a new wave of social spambots has emerged, with advanced human-like characteristics that allow them to go undetected even by current state-of-the-art algorithms. In this paper, we show that efficient spambots detection can be achieved via an in-depth analysis of their collective behaviors exploiting the digital DNA technique for modeling the behaviors of social network users. Inspired by its biological counterpart, in the digital DNA representation the behavioral lifetime of a digital account is encoded in a sequence of characters. Then, we define a similarity measure for such digital DNA sequences. We build upon digital DNA and the similarity between groups of users to characterize both genuine accounts and spambots. Leveraging such characterization, we design the Social Fingerprinting technique, which is able to discriminate among spambots and genuine accounts in both a supervised and an unsupervised fashion. We finally evaluate the effectiveness of Social Fingerprinting and we compare it with three state-of-the-art detection algorithms. Among the peculiarities of our approach is the possibility to apply off-the-shelf DNA analysis techniques to study online users behaviors and to efficiently rely on a limited number of lightweight account characteristics

    Classification of time series patterns from complex dynamic systems

    Full text link

    Algumas aplicações da Inteligência Artificial em Biotecnologia

    Get PDF
    The present work is a revision about neural networks. Initially presents a little introduction to neural networks, fuzzy logic, a brief history, and the applications of Neural Networks on Biotechnology. The chosen sub-areas of the applications of Neural Networks on Biotechnology are, Solid-State Fermentation Optimization, DNA Sequencing, Molecular Sequencing Analysis, Quantitative Structure-Activity Relationship, Soft Sensing, Spectra Interpretation, Data Mining, each one use a special kind of neural network like feedforward, recurrent, siamese, art, among others. Applications of the Neural-Networks in spectra interpretation and Quantitative Structure-activity relationships, is a direct application to Chemistry and consequently also to Biochemistry and Biotechnology. Soft Sensing is a special example for applications on Biotechnology. It is a method to measure variables that normally can’t be directly measure. Solid state fermentation was optimized and presenting, as result, a strong increasing of production efficiency.O presente trabalho é uma revisão sobre redes neurais. Inicialmente apresenta uma breve introdução a redes neurais, lógica difusa, um breve histórico, e aplicações de Redes Neurais em Biotecnologia. As subáreas escolhidas para aplicação das redes neurais são, Otimização da Fermentação no Estado-Sólido, Sequenciamento de DNA, Análise Molecular Sequencial, Relação Quantitativa Strutura-Atividade, Sensores inteligentes, Interpretação de espectros, Mineração de Dados, sendo que cada um usa um tipo especial de rede neural, tais como feed forward, recorrente, siamesa, art, entre outros. Aplicações de Redes Neurais em interpretação de espectros e Relação Quantitativa Estrutura-Atividade, como uma aplicação direta à química e consequentemente também para a Bioquímica e Biotecnologia. Os sensores Inteligentes são um exemplo especial de aplicação em Biotecnologia. É um método de medir variáveis que normalmente não podem ser medidas de forma direta. Fermentações no Estado-sólido foram otimizadas e, apresentaram como resultado um forte aumento do rendimento na produção final

    AP: Artificial Programming

    Get PDF
    The ability to automatically discover a program consistent with a given user intent (specification) is the holy grail of Computer Science. While significant progress has been made on the so-called problem of Program Synthesis, a number of challenges remain; particularly for the case of synthesizing richer and larger programs. This is in large part due to the difficulty of search over the space of programs. In this paper, we argue that the above-mentioned challenge can be tackled by learning synthesizers automatically from a large amount of training data. We present a first step in this direction by describing our novel synthesis approach based on two neural architectures for tackling the two key challenges of Learning to understand partial input-output specifications and Learning to search programs. The first neural architecture called the Spec Encoder computes a continuous representation of the specification, whereas the second neural architecture called the Program Generator incrementally constructs programs in a hypothesis space that is conditioned by the specification vector. The key idea of the approach is to train these architectures using a large set of (spec,P) pairs, where P denotes a program sampled from the DSL L and spec denotes the corresponding specification satisfied by P. We demonstrate the effectiveness of our approach on two preliminary instantiations. The first instantiation, called Neural FlashFill, corresponds to the domain of string manipulation programs similar to that of FlashFill. The second domain considers string transformation programs consisting of composition of API functions. We show that a neural system is able to perform quite well in learning a large majority of programs from few input-output examples. We believe this new approach will not only dramatically expand the applicability and effectiveness of Program Synthesis, but also would lead to the coming together of the Program Synthesis and Machine Learning research disciplines

    Building a finite state automaton for physical processes using queries and counterexamples on long short-term memory models

    Get PDF
    Most neural networks (NN) are commonly used as black-box functions. A network takes an input and produces an output, without the user knowing what rules and system dynamics have produced the specific output. In some situations, such as safety-critical applications, having the capability of understanding and validating models before applying them can be crucial. In this regard, some approaches for representing NN in more understandable ways, attempt to accurately extract symbolic knowledge from the networks using interpretable and simple systems consisting of a finite set of states and transitions known as deterministic finite-state automata (DFA). In this thesis, we have considered a rule extraction approach developed by Weiss et al. that employs the exact learning method L* to extract DFA from recurrent neural networks (RNNs) trained on classifying symbolic data sequences. Our aim has been to study the practicality of applying their rule extraction approach on more complex data based on physical processes consisting of continuous values. Specifically, we experimented with datasets of varying complexities, considering both the inherent complexity of the dataset itself and complexities introduced from different discretization intervals used to represent the continuous data values. Datasets incorporated in this thesis encompass sine wave prediction datasets, sequence value prediction datasets, and a safety-critical well-drilling pressure scenario generated through the use of the well-drilling simulator OpenLab and the sparse identification of nonlinear dynamical systems (SINDy) algorithm. We observe that the rule extraction algorithm is able to extract simple and small DFA representations of LSTM models. On the considered datasets, extracted DFA generally demonstrates worse performance than the LSTM models used for extraction. Overall, for both increasing problem complexity and more discretization intervals, the performance of the extracted DFA decreases. However, DFA extracted from datasets discretized using few intervals yields more impressive results, and the algorithm can in some cases extract DFA that outperforms their respective LSTM models.Masteroppgave i informatikkINF399MAMN-INFMAMN-PRO

    딥러닝 기반의 분자 특성 예측 연구

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 자연과학대학 협동과정 생물정보학전공, 2021.8. 윤성로.Deep learning (DL) has been advanced in various fields, such as vision tasks, language processing, and natural sciences. Recently, several remarkable researches in computational chemistry were accomplished by DL-based methods. However, the chemical system consists of diverse elements and their interactions. As a result, it is not trivial to predict chemical properties which are determined by intrinsically complicated factors. Consequently, conventional approaches usually depend on tremendous calculations for chemical simulations or predictions, which are cost-intensive and time-consuming. To address recent issues, we studied deep learning for computational chemistry. We focused on the chemical property prediction from molecular structure representations. A molecular structure is a complex of atoms and their arrangements. The molecular property is determined by the interactions from all these components. Therefore, molecular structural representations are the key factor in the chemical property prediction tasks. In particular, we explored public property prediction tasks in pharmacology, organic chemistry, and quantum chemistry. Molecular structures can be described as categorical sequences or geometric graphs. We utilized both representational formats for prediction tasks, and achieved competitive model performances. Our studies verified that the molecular representation is essential for various tasks in chemistry, and using appropriate types of neural networks for the representation type is significant to the model predictability.딥러닝 방법론은 이미지 및 언어 처리 분야를 포함하여, 공학 및 자연과학을 포함한 여러 분야에서 진보하였다. 최근에는 특히 계산 화학 분야에서 딥러닝 기반으로 연구된 우수한 성과들이 여럿 보고되었다. 그러나 화학적인 계 내에서는 많은 종류의 요소들과 상호작용들이 복잡하게 얽혀있다. 따라서 이러한 요소들을 이용하여 화학 특성을 예측하는 것은 쉽지 않은 일이다. 결과적으로, 전통적인 방법들은 주로 상당한 비용과 시간이 소요되는 엄청난 계산량을 기반으로 하였다. 이러한 한계점을 해결하기 위하여, 본 연구는 딥러닝을 활용한 화학에서의 계산 문제를 연구하였다. 본 연구에서는 특히 분자 구조 표현 데이터를 이용, 분자의 특성을 예측하는 문제들에 집중하였다. 분자 구조는 다양한 원자들이 특정한 배열을 이루고 있는 복합체이며, 분자 특성은 이러한 원자 및 그들의 상호 관계들에 의하여 결정 된다. 따라서, 분자 구조는 화학적 특성을 예측하는 문제에 있어서 필수적인 요소이다. 본 연구에서는 약학, 유기 화학, 양자 화학 등 다양한 분야에서의 화학 특성 예측연구들을 진행하였다. 분자 구조는 시퀀스 혹은 그래프 형태로 표현할 수 있고, 본 연구에서는 두 가지 형태를 모두 활용하여서 진행하였다. 본 연구는 분자 표현이 화학 분야 내의 여러 가지 태스크에 활용 될 수 있으며, 분자 표현에 따른 적절한 딥러닝 모델의 선택이 모델 성능을 크게 높일 수 있음을 보였다.1 Introduction 1 1.1 Motivation 1 1.2 Contents of dissertation 3 2 Background 8 2.1 Deep learning in Chemistry 8 2.2 Deep Learning for molecular property prediction 9 2.3 Approaches for molecular property prediction 12 2.3.1 Sequential modeling for molecular string 12 2.3.2 Structural modeling for molecular graph 15 2.4 Tasks on molecular properties 20 2.4.1 Pharmacological tasks 20 2.4.2 Biophysical and physiological tasks 21 2.4.3 Quantum-mechanical tasks 21 3 Application I. Drug class classification 23 3.1 Introduction 23 3.2 Proposed method 26 3.2.1 Preprocessing 27 3.2.2 Model architecture 27 3.2.3 Training and evaluation 30 3.3 Experimental results 31 3.4 Discussion 37 4 Application II. Biophysical property prediction 39 4.1 Introduction 39 4.2 Proposed method 41 4.2.1 Preprocessing 41 4.2.2 model architecture 42 4.2.3 Training and evaluation 45 4.3 Experimental results 47 4.4 Discussion 53 5 Application III. Quantum-mechanical property prediction 55 5.1 Introduction 55 5.2 Proposed method 57 5.2.1 Preprocessing 59 5.2.2 Model architecture 62 5.2.3 Training and evaluation 67 5.3 Experimental results 69 5.4 Discussion 70 6 Conclusion 74 Bibliography 76 초 록 93박

    Mining protein loops using a structural alphabet and statistical exceptionality

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.</p> <p>Results</p> <p>We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.</p> <p>Conclusions</p> <p>We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p
    corecore