11 research outputs found

    Automatic Synthesis of Regular Expressions from Examples

    Get PDF
    We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology

    Ingegnerizzazione di Algoritmi di Machine Learning

    No full text
    2009/2010Nowadays the available computing and information-storage resources grew up to a level that allows to easily collect and preserve huge amount of data. However, several organizations are still lacking the knowledge or the tools to process these data into useful informations. In this thesis work we will investigate several issues that can be solved effectively by means of machine learning techniques, ranging from web defacement detection to electricity prices forecasting, from Support Vector Machines to Genetic Programming. We will investigate a framework for web defacement detection meant to allow any organization to join the service by simply providing the URLs of the resources to be monitored along with the contact point of an administrator. Our approach is based on anomaly detection and allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without requiring any prior knowledge about those resources—thus being an unsupervised system. Furthermore, we will test several machine learning algorithms normally used for anomaly detection on the web defacement detection problem. We will present a scrolling system to be used on mobile devices to provide a more natural and effective user experience on small screens. We detect device motion by analyzing the video stream generated by the camera and then we transform the motion in a scrolling of the content rendered on the screen. This way, the user experiences the device screen like a small movable window on a larger virtual view, without requiring any dedicated motion-detection hardware. As regards information retrieval, we will present an approach for information extraction for multi-page printed document; the approach is designed for scenarios in which the set of possible document classes, i.e., document sharing similar content and layout, is large and may evolve over time. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout; we will consider both a static and a dynamic scenario, in which document classes are/are not known a priori and new classes can/can not appear at any time. Finally, we will move to the edge of machine learning: Genetic Programming. The electric power market is increasingly relying on competitive mechanisms taking the form of day-ahead auctions, in which buyers and sellers submit their bids in terms of prices and quantities for each hour of the next day. We propose a novel forecasting method based on Genetic Programming; key feature of our proposal is the handling of outliers, i.e., regions of the input space rarely seen during the learning.Oggigiorno le risorse disponibili in termini computazionali e di archiviazione sono cresciute ad un livello tale da permettere facilmente di raccogliere e conservare enormi quantità di dati. Comunque, molte organizzazioni mancano ancora della conoscenza o degli strumenti necessari a processare tali dati in informazioni utili. In questo lavoro di tesi si investigheranno svariati problemi che possono essere efficacemente risolti attraverso strumenti di machine learning, spaziando dalla rilevazione di web defacement alla previsione dei prezzi della corrente elettrica, dalle Support Vector Machine al Genetic Programming. Si investigherà una infrastruttura per la rilevazione dei defacement studiata per permettere ad una organizzazione di sottoscrivere il servizio in modo semplice, fornendo l'URL da monitorare ed un contatto dell'amministratore. L'approccio presentato si basa sull'anomaly detection e permette di monitorare l'integrità di molte risorse web remote in modo automatico e sconnesso da esse, senza richiedere alcuna conoscenza a priori di tali risorse---ovvero, realizzando un sistema non supervisionato. A questo scopo verranno anche testati vari algoritmi di machine learning solitamente usati per la rilevazione di anomalie. Si presenterà poi un sistema di scorrimento da usare su dispositivi mobili capace di fornire una interfaccia naturale ed efficace anche su piccoli schermi. Il sistema rileva il movimento del dispositivo analizzando il flusso video generato dalla macchina fotografica integrata, trasformando lo spostamento rilevato in uno scorrimento del contenuto visualizzato sullo schermo. In questo modo, all'utente sembrerà che il proprio dispositivo sia una piccola finestra spostabile su una vista virtuale più ampia, senza che sia richiesto alcun dispositivo dedicato esclusivamente alla rilevazione dello spostamento. Verrà anche proposto un sistema per l'estrazione di informazioni da documenti stampati multi pagina; l'approccio è studiato per scenari in cui l'insieme di possibili classi di documenti (simili per contenuto ed organizzazione del testo) è ampio e può evolvere nel tempo. L'approccio si basa sulla probabilità: è stata studiata la probabilità che una sequenza di blocchi contenga l'informazione cercata. Un elemento chiave nel comprendere i documenti stampati è la loro classificazione in base alla natura delle informazioni che contengono e la loro posizione nel documento; verranno considerati sia uno scenario statico che uno dinamico, in cui il numero di classi di documenti è/non è noto a priori e nuove classi possono/non possono apparire nel tempo. Infine, ci si muoverà verso i confini del machine learning: il Genetic Programming. Il mercato della corrente elettrica si basa sempre più su aste in cui ogni giorno venditori ed acquirenti fanno delle offerte per l'acquisto di lotti di energia per il giorno successivo, con una granularità oraria della fornitura. Verrà proposto un nuovo metodo di previsione basato sul Genetic Programming; l'elemento chiave della soluzione qui presentata è la capacità di gestire i valori anomali, ovvero valori raramente osservati durante il processo di apprendimento.XXIII Ciclo198

    The Reaction Time to Web Site Defacements

    No full text
    Web site defacement has become a common threat for organizations exposed on the web. There exist several statistics that indicate the number of incidents of this sort but there is a crucial piece of information still lacking: the typical duration of a defacement. Clearly, a defacement lasting one week is much more harmful than one of few minutes. In this paper we present the results of a two months monitoring activity that we performed over more than 62000 defacements in order to figure out whether and when} a reaction to the defacement is taken. We show that such time tends to be unacceptably long---in the order of several days---and with a long-tailed distribution. We believe our findings may improve the understanding of this phenomenon and highlight issues deserving attention by the research community

    Camera-based Scrolling Interface for Hand-held Devices

    No full text

    A Probabilistic Approach to Printed Document Understanding

    No full text
    3We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results---e.g., a success rate often greater than 90% even for classes with just two samples.reservedmixedMedvet E.; Bartoli A.; Davanzo G.Medvet, Eric; Bartoli, Alberto; Davanzo, Giorgi

    The Reaction Time to Web Site Defacements

    No full text

    Open World Classification of Printed Invoices

    No full text
    4A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy---e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.nonemixedSorio E.; Bartoli A.; Davanzo G.; Medvet E.Sorio, Enrico; Bartoli, Alberto; Davanzo, Giorgio; Medvet, Eri

    Human Colostrum and Breast Milk Contain High Levels of TNF-Related Apoptosis-Inducing Ligand (TRAIL).

    No full text
    Background: TNF-related apoptosis inducing ligand (TRAIL) is a pleiotropic cytokine, which plays a key role in the immune system as well as in controlling the balance of apoptosis and proliferation in various organs and tissues.Objective: To investigate the presence and levels of soluble TRAIL in human colostrum and milk.Methods: The levels of soluble human TRAIL were measured in human colostrum (day 2 after delivery) and breast milk (day 5 after delivery). The presence of TRAIL was also measured in infant formula.Results: Levels of soluble TRAIL in the colostrum and mature human milk were, respectively, at least 400 and 100 fold higher than those detected in human serum. No TRAIL was detected in formula.Conclusion: Human soluble TRAIL is present at extremely high levels in human colostrum and human milk and might have a significant role in mediating the anti-cancer activity of human milk
    corecore