11 research outputs found
Automatic Synthesis of Regular Expressions from Examples
We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology
Ingegnerizzazione di Algoritmi di Machine Learning
2009/2010Nowadays the available computing and information-storage resources
grew up to a level that allows to easily collect and preserve huge amount
of data. However, several organizations are still lacking the knowledge
or the tools to process these data into useful informations.
In this thesis work we will investigate several issues that can be solved
effectively by means of machine learning techniques, ranging from web
defacement detection to electricity prices forecasting, from Support Vector Machines to Genetic Programming.
We will investigate a framework for web defacement detection meant
to allow any organization to join the service by simply providing the
URLs of the resources to be monitored along with the contact point
of an administrator. Our approach is based on anomaly detection and
allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without
requiring any prior knowledge about those resources—thus being an unsupervised system. Furthermore, we will test several machine learning
algorithms normally used for anomaly detection on the web defacement
detection problem.
We will present a scrolling system to be used on mobile devices to
provide a more natural and effective user experience on small screens.
We detect device motion by analyzing the video stream generated by the
camera and then we transform the motion in a scrolling of the content
rendered on the screen. This way, the user experiences the device screen
like a small movable window on a larger virtual view, without requiring
any dedicated motion-detection hardware.
As regards information retrieval, we will present an approach for information extraction for multi-page printed document; the approach is
designed for scenarios in which the set of possible document classes, i.e.,
document sharing similar content and layout, is large and may evolve
over time. Our approach is based on probability: we derived a general
form for the probability that a sequence of blocks contains the searched
information. A key step in the understanding of printed documents is
their classification based on the nature of information they contain and
their layout; we will consider both a static and a dynamic scenario, in
which document classes are/are not known a priori and new classes can/can not appear at any time.
Finally, we will move to the edge of machine learning: Genetic Programming. The electric power market is increasingly relying on competitive mechanisms taking the form of day-ahead auctions, in which buyers
and sellers submit their bids in terms of prices and quantities for each
hour of the next day. We propose a novel forecasting method based on
Genetic Programming; key feature of our proposal is the handling of
outliers, i.e., regions of the input space rarely seen during the learning.Oggigiorno le risorse disponibili in termini computazionali e di archiviazione sono cresciute ad un livello tale da permettere facilmente di raccogliere e conservare enormi quantità di dati. Comunque, molte organizzazioni mancano ancora della conoscenza o degli strumenti necessari a processare tali dati in informazioni utili. In questo lavoro di tesi si investigheranno svariati problemi che possono essere efficacemente risolti attraverso strumenti di machine learning, spaziando dalla rilevazione di web defacement alla previsione dei prezzi della corrente elettrica, dalle Support Vector Machine al Genetic Programming. Si investigherà una infrastruttura per la rilevazione dei defacement studiata per permettere ad una organizzazione di sottoscrivere il servizio in modo semplice, fornendo l'URL da monitorare ed un contatto dell'amministratore. L'approccio presentato si basa sull'anomaly detection e permette di monitorare l'integrità di molte risorse web remote in modo automatico e sconnesso da esse, senza richiedere alcuna conoscenza a priori di tali risorse---ovvero, realizzando un sistema non supervisionato. A questo scopo verranno anche testati vari algoritmi di machine learning solitamente usati per la rilevazione di anomalie. Si presenterà poi un sistema di scorrimento da usare su dispositivi mobili capace di fornire una interfaccia naturale ed efficace anche su piccoli schermi. Il sistema rileva il movimento del dispositivo analizzando il flusso video generato dalla macchina fotografica integrata, trasformando lo spostamento rilevato in uno scorrimento del contenuto visualizzato sullo schermo. In questo modo, all'utente sembrerà che il proprio dispositivo sia una piccola finestra spostabile su una vista virtuale più ampia, senza che sia richiesto alcun dispositivo dedicato esclusivamente alla rilevazione dello spostamento. Verrà anche proposto un sistema per l'estrazione di informazioni da documenti stampati multi pagina; l'approccio è studiato per scenari in cui l'insieme di possibili classi di documenti (simili per contenuto ed organizzazione del testo) è ampio e può evolvere nel tempo. L'approccio si basa sulla probabilità : è stata studiata la probabilità che una sequenza di blocchi contenga l'informazione cercata. Un elemento chiave nel comprendere i documenti stampati è la loro classificazione in base alla natura delle informazioni che contengono e la loro posizione nel documento; verranno considerati sia uno scenario statico che uno dinamico, in cui il numero di classi di documenti è/non è noto a priori e nuove classi possono/non possono apparire nel tempo. Infine, ci si muoverà verso i confini del machine learning: il Genetic Programming. Il mercato della corrente elettrica si basa sempre più su aste in cui ogni giorno venditori ed acquirenti fanno delle offerte per l'acquisto di lotti di energia per il giorno successivo, con una granularità oraria della fornitura. Verrà proposto un nuovo metodo di previsione basato sul Genetic Programming; l'elemento chiave della soluzione qui presentata è la capacità di gestire i valori anomali, ovvero valori raramente osservati durante il processo di apprendimento.XXIII Ciclo198
The Reaction Time to Web Site Defacements
Web site defacement has become a common threat for organizations exposed on the web. There
exist several statistics that indicate the number of incidents of this sort but there is a
crucial piece of information still lacking: the
typical duration of a defacement. Clearly, a defacement lasting one week is much more harmful
than one of few minutes. In this paper we present the results of a two months monitoring
activity that we performed over more than 62000 defacements in order to figure out
whether and when} a reaction to the defacement is taken. We show that such
time tends to be unacceptably long---in the order of several days---and with a long-tailed
distribution. We believe our findings may
improve the understanding of this phenomenon and highlight issues
deserving attention by the research community
A Probabilistic Approach to Printed Document Understanding
3We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time.
Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results---e.g., a success rate often greater than 90% even for classes with just two samples.reservedmixedMedvet E.; Bartoli A.; Davanzo G.Medvet, Eric; Bartoli, Alberto; Davanzo, Giorgi
Open World Classification of Printed Invoices
4A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy---e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.nonemixedSorio E.; Bartoli A.; Davanzo G.; Medvet E.Sorio, Enrico; Bartoli, Alberto; Davanzo, Giorgio; Medvet, Eri
Human Colostrum and Breast Milk Contain High Levels of TNF-Related Apoptosis-Inducing Ligand (TRAIL).
Background: TNF-related apoptosis inducing ligand (TRAIL) is a pleiotropic cytokine, which plays a key role in the immune system as well as in controlling the balance of apoptosis and proliferation in various organs and tissues.Objective: To investigate the presence and levels of soluble TRAIL in human colostrum and milk.Methods: The levels of soluble human TRAIL were measured in human colostrum (day 2 after delivery) and breast milk (day 5 after delivery). The presence of TRAIL was also measured in infant formula.Results: Levels of soluble TRAIL in the colostrum and mature human milk were, respectively, at least 400 and 100 fold higher than those detected in human serum. No TRAIL was detected in formula.Conclusion: Human soluble TRAIL is present at extremely high levels in human colostrum and human milk and might have a significant role in mediating the anti-cancer activity of human milk