Search CORE

5,514 research outputs found

A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction

Author: Freitas Alex A.
Publication venue: Morgan Kaufmann
Publication date: 01/01/1997
Field of study

This paper proposes a genetic programming (GP) framework for two major data mining tasks, namely classification and generalized rule induction. The framework emphasizes the integration between a GP algorithm and relational database systems. In particular, the fitness of individuals is computed by submitting SQL queries to a (parallel) database server. Some advantages of this integration from a data mining viewpoint are scalability, data-privacy control and automatic parallelization

CiteSeerX

Kent Academic Repository

FAST: FAST Analysis of Sequences Toolbox.

Author: Claudia J. Canales
Dana L. Carper
David H. Ardell
David H. Ardell
Katherine C.H. Amrine
Katherine C.H. Amrine
Kyle T. Kauffman
Peter J. Becich
Raymond S. Lee
Travis J. Lawrence
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

FAST (FAST Analysis of Sequences Toolbox) provides simple, powerful open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can simplify the documentation and reproducibility of bioinformatic protocols, supporting better transparency in biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R, and GenBank help make FAST easy and rewarding to learn. FAST automates numerical, taxonomic, and text-based sorting, selection and transformation of sequence records and alignment sites based on content, index ranges, descriptive tags, annotated features, and in-line calculated analytics, including composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics make FAST useful for molecular evolutionary analysis. FAST is portable, easy to install and secure thanks to the relative maturity of its Perl and BioPerl foundations, with stable releases posted to CPAN. Development as well as a publicly accessible Cookbook and Wiki are available on the FAST GitHub repository at https://github.com/tlawrence3/FAST. The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FastQ formatted files are also supported. FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

eScholarship - University of California

Automated Fixing of Programs with Contracts

Author: Furia Carlo A.
Meyer Bertrand
Nordio Martin
Pei Yu
Wei Yi
Zeller Andreas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

This paper describes AutoFix, an automatic debugging technique that can fix faults in general-purpose software. To provide high-quality fix suggestions and to enable automation of the whole debugging process, AutoFix relies on the presence of simple specification elements in the form of contracts (such as pre- and postconditions). Using contracts enhances the precision of dynamic analysis techniques for fault detection and localization, and for validating fixes. The only required user input to the AutoFix supporting tool is then a faulty program annotated with contracts; the tool produces a collection of validated fixes for the fault ranked according to an estimate of their suitability. In an extensive experimental evaluation, we applied AutoFix to over 200 faults in four code bases of different maturity and quality (of implementation and of contracts). AutoFix successfully fixed 42% of the faults, producing, in the majority of cases, corrections of quality comparable to those competent programmers would write; the used computational resources were modest, with an average time per fix below 20 minutes on commodity hardware. These figures compare favorably to the state of the art in automated program fixing, and demonstrate that the AutoFix approach is successfully applicable to reduce the debugging burden in real-world scenarios.Comment: Minor changes after proofreadin

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

The first analytical expression to estimate photometric redshifts suggested by a machine

Author: de Souza R. S.
Ishida E. E. O.
Krone-Martins A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

We report the first analytical expression purely constructed by a machine to determine photometric redshifts (

z_{\rm phot}

) of galaxies. A simple and reliable functional form is derived using

41,214

galaxies from the Sloan Digital Sky Survey Data Release 10 (SDSS-DR10) spectroscopic sample. The method automatically dropped the

u

and

z

bands, relying only on

g

r

and

i

for the final solution. Applying this expression to other

1,417,181

SDSS-DR10 galaxies, with measured spectroscopic redshifts (

z_{\rm spec}

), we achieved a mean

\langle (z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})\rangle\lesssim 0.0086

and a scatter

\sigma_{(z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})}\lesssim 0.045

when averaged up to

z \lesssim 1.0

. The method was also applied to the PHAT0 dataset, confirming the competitiveness of our results when faced with other methods from the literature. This is the first use of symbolic regression in cosmology, representing a leap forward in astronomy-data-mining connection.Comment: 6 pages, 4 figures. Accepted for publication in MNRAS Letter

arXiv.org e-Print Archive

CiteSeerX

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Universidade de São Paulo

ELTE Digital Institutional Repository (EDIT)

MPG.PuRe

Tracking Cyber Adversaries with Adaptive Indicators of Compromise

Author: Ching-Lung Cheung (5591246)
Danny Chan (8936)
Dino Samartzis (118537)
Jaro Karppinen (81295)
Kathryn Song-Eng Cheah (5591255)
Kazuhiro Chiba (453665)
Kenneth Man-Chee Cheung (3128883)
Pak Chung Sham (181109)
Shiro Ikegawa (106723)
Tatsuki Karasugi (5591249)
Timothy Shin-Heng Mak (5591252)
Xueya Zhou (3990752)
Yan Li (23143)
Yi-Hsiang Hsu (247531)
Yoshiharu Kawaguchi (166748)
You-Qiang Song (165465)
Publication venue
Publication date: 20/12/2017
Field of study

A forensics investigation after a breach often uncovers network and host indicators of compromise (IOCs) that can be deployed to sensors to allow early detection of the adversary in the future. Over time, the adversary will change tactics, techniques, and procedures (TTPs), which will also change the data generated. If the IOCs are not kept up-to-date with the adversary's new TTPs, the adversary will no longer be detected once all of the IOCs become invalid. Tracking the Known (TTK) is the problem of keeping IOCs, in this case regular expressions (regexes), up-to-date with a dynamic adversary. Our framework solves the TTK problem in an automated, cyclic fashion to bracket a previously discovered adversary. This tracking is accomplished through a data-driven approach of self-adapting a given model based on its own detection capabilities. In our initial experiments, we found that the true positive rate (TPR) of the adaptive solution degrades much less significantly over time than the naive solution, suggesting that self-updating the model allows the continued detection of positives (i.e., adversaries). The cost for this performance is in the false positive rate (FPR), which increases over time for the adaptive solution, but remains constant for the naive solution. However, the difference in overall detection performance, as measured by the area under the curve (AUC), between the two methods is negligible. This result suggests that self-updating the model over time should be done in practice to continue to detect known, evolving adversaries.Comment: This was presented at the 4th Annual Conf. on Computational Science & Computational Intelligence (CSCI'17) held Dec 14-16, 2017 in Las Vegas, Nevada, US

arXiv.org e-Print Archive

FigShare

Tracking Cyber Adversaries with Adaptive Indicators of Compromise

Author: Aimone James B.
Cox Jonathan A.
Dixon Kevin R.
Doak Justin E.
Follett David R.
Ingram Joe B.
James Conrad D.
Mulder Sam A.
Naegle John H.
Publication venue
Publication date: 20/12/2017
Field of study

arXiv.org e-Print Archive

Crossref

GenomeViz: visualizing microbial genomes

Author: Chakraborty Trinad
Ghai Rohit
Hain Torsten
Publication venue: BioMed Central
Publication date: 01/12/2004
Field of study

BACKGROUND: An increasing number of microbial genomes are being sequenced and deposited in public databases. In addition, several closely related strains are also being sequenced in order to understand the genetic basis of diversity and mechanisms that lead to the acquisition of new genetic traits. These exercises have necessitated the requirement for visualizing microbial genomes and performing genome comparisons on a finer scale. We have developed GenomeViz to enable rapid visualization and subsequent comparisons of several microbial genomes in an interactive environment. RESULTS: Here we describe a program that allows visualization of both qualitative and quantitative information from complete and partially sequenced microbial genomes. Using GenomeViz, data deriving from studies on genomic islands, gene/protein classifications, GC content, GC skew, whole genome alignments, microarrays and proteomics may be plotted. Several genomes can be visualized interactively at the same time from a comparative genomic perspective and publication quality circular genome plots can be created. CONCLUSIONS: GenomeViz should allow researchers to perform visualization and comparative analysis of up to eight different microbial genomes simultaneously

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eCOMPAGT – efficient Combination and Management of Phenotypes and Genotypes for Genetic Epidemiology

Author: Brandstätter Anita
Coassin Stefan
Kronenberg Florian
Schönherr Sebastian
Specht Günther
Weißensteiner Hansi
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background High-throughput genotyping and phenotyping projects of large epidemiological study populations require sophisticated laboratory information management systems. Most epidemiological studies include subject-related personal information, which needs to be handled with care by following data privacy protection guidelines. In addition, genotyping core facilities handling cooperative projects require a straightforward solution to monitor the status and financial resources of the different projects. Description We developed a database system for an efficient combination and management of phenotypes and genotypes (eCOMPAGT) deriving from genetic epidemiological studies. eCOMPAGT securely stores and manages genotype and phenotype data and enables different user modes with different rights. Special attention was drawn on the import of data deriving from TaqMan and SNPlex genotyping assays. However, the database solution is adjustable to other genotyping systems by programming additional interfaces. Further important features are the scalability of the database and an export interface to statistical software. Conclusion eCOMPAGT can store, administer and connect phenotype data with all kinds of genotype data and is available as a downloadable version at <url>http://dbis-informatik.uibk.ac.at/ecompagt</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central