1,231 research outputs found
Integrazione di dati on-demand
Sempre più spesso aziende e organizzazioni basano le proprie decisioni sui dati di cui dispongono. Garantire la qualità di tali dati è fondamentale per poter effettuare analisi accurate e affidabili. L'integrazione dei dati consiste nel combinare dati acquisiti da molteplici sorgenti eterogenee per fornire all'utente finale una vista unitaria e coerente su tali dati. Si tratta perciò di un processo fondamentale per incrementare il valore dei dati disponibili. In passato, operando su numeri limitati di sorgenti, il paradigma di riferimento, noto come ETL, richiedeva di estrarre i dati grezzi, pulirli e immagazzinarli in un data warehouse per poterli poi analizzare. Al giorno d'oggi, operando su milioni di sorgenti, è sempre più diffuso il paradigma noto invece come ELT, per il quale i dati grezzi vengono raccolti in grandi quantità e immagazzinati così come sono, ad esempio in un data lake. Gli utenti possono poi pulire le porzioni di dati utili per le loro applicazioni. È pertanto necessario studiare soluzioni innovative per l'integrazione dei dati, maggiormente adatte alle nuove sfide che tale modello comporta.
Uno dei processi fondamentali per l'integrazione dei dati è la riconciliazione di entità, che consiste nell'individuare i profili che descrivono la stessa entità reale (duplicati) per consolidarli in un unico profilo coerente. Tradizionalmente, questo processo viene effettuato sull'intero dataset prima di poter operare su di esso, risultando perciò spesso molto costoso. In molti casi, solo una porzione delle entità pulite si rivela utile per l'applicazione dell'utente finale. Ad esempio, operando su dati raccolti dal Web, è fondamentale poter filtrare le entità d'interesse senza dover pulire l'intera mole di dati, in continua crescita. Allo stesso modo, quando si effettuano interrogazioni su un data lake, si vuole pulire su richiesta solo la porzione di interesse, ottenendo i relativi risultati nel minor tempo possibile. Per rispondere a tali esigenze presentiamo BrewER, un framework per eseguire interrogazioni SQL su dati sporchi emettendo progressivamente i risultati come se fossero stati ottenuti sui dati puliti. BrewER focalizza il processo di pulizia su un'entità alla volta, in base a una priorità definita dall'utente nella clausola ORDER BY. Per molte applicazioni, come l'esplorazione dei dati, BrewER consente di risparmiare una grande quantità di tempo e risorse.
I duplicati non esistono però solo a livello di singoli profili, ma anche a livello di dataset. È infatti comune ad esempio che un data scientist per le proprie analisi effettui trasformazioni su un dataset presente nel data lake aziendale, immagazzinando poi anche la nuova versione ottenuta all'interno del data lake stesso. Situazioni simili si verificano nel Web, ad esempio su Wikipedia, dove le tabelle vengono spesso duplicate e le copie ottenute hanno uno sviluppo indipendente, con la possibile insorgenza di inconsistenze. Individuare automaticamente queste tabelle duplicate consente di renderle coerenti con operazione di pulizia dei dati o propagazione delle modifiche, oppure di rimuovere le copie ridondanti per liberare spazio di archiviazione o risparmiare futuro lavoro agli editori. La ricerca di tabelle duplicate è stata perlopiù ignorata dalla letteratura esistente. Per colmare questa mancanza presentiamo Sloth, un framework che, date due tabelle, consente di determinarne la più grande sottotabella in comune, consentendo di quantificarne la similarità e di rilevare le possibili inconsistenze.
BrewER e Sloth rappresentano soluzioni innovative per l'integrazione dei dati nello scenario ELT, utilizzando le risorse a disposizione su richiesta e indirizzando il processo di integrazione dei dati verso un approccio orientato alle applicazioni.Companies and organizations depend heavily on their data to make informed business decisions. Therefore, guaranteeing high data quality is critical to ensure the reliability of data analysis. Data integration, which aims to combine data acquired from several heterogeneous sources to provide users with a unified consistent view, plays a fundamental role to enhance the value of the data at hand. In the past, when data integration involved a limited number of sources, ETL (extract, transform, load) established as the most popular paradigm: once collected, raw data is cleaned, then stored in a data warehouse to perform analysis on it. Nowadays, big data integration needs to deal with millions of sources; thus, the paradigm is more and more moving towards ELT (extract, load, transform). A huge amount of raw data is collected and directly stored (e.g., in a data lake), then different users can transform portions of it according to the task at hand. Hence, novel approaches to data integration need to be explored to address the challenges raised by this paradigm.
One of the fundamental building blocks for data integration is entity resolution (ER), which aims at detecting profiles that describe the same real-world entity, to consolidate them into a single consistent representation. ER is typically employed as an expensive offline cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire continuously growing data. Similarly, when querying data lakes, we want to transform data on-demand and return results in a timely manner. Hence, we propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, according to the priority defined by the user through the ORDER BY clause. For a wide range of applications (e.g., data exploration), a significant amount of resources can therefore be saved.
Further, duplicates not only exist at profile level, as in the case for ER, but also at dataset level. In the ELT scenario, it is common for data scientists to retrieve datasets from the enterprise’s data lake, perform transformations for their analysis, then store back the new datasets into the data lake. Similarly, in Web contexts such as Wikipedia, a table can be duplicated at a given time, with the different copies having independent development, possibly leading to the insurgence of inconsistencies. Automatically detecting duplicate tables would allow to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. While dataset discovery research developed efficient tools to retrieve unionable or joinable tables, the problem of detecting duplicate tables has been mostly overlooked in the existing literature. To fill this gap, we therefore present Sloth, a framework to efficiently determine the largest overlap (i.e., the largest common subtable) between two tables. The detection of the largest overlap allows to quantify the similarity between the two tables and spot their inconsistencies.
BrewER and Sloth represent novel solutions to perform big data integration in the ELT scenario, fostering on-demand use of available resources and shifting this fundamental task towards a task-driven paradigm
Task-Driven Big Data Integration
Data integration aims at combining data acquired from different autonomous sources to provide the user with a unified view of this data. One of the main challenges in data integration processes is entity resolution, whose goal is to detect the different representations of the same real-world entity across the sources, in order to produce a unique and consistent representation for it. The advent of big data has challenged traditional data integration paradigms, making the offline batch approach to entity resolution no longer suitable for several scenarios (e.g., when performing data exploration or dealing with datasets that change with a high frequency). Therefore, it becomes of primary importance to produce new solutions capable of operating effectively in such situations.
In this paper, I present some contributions made during the first half of my PhD program, mainly focusing on the design of a framework to perform entity resolution in an on-demand fashion, building on the results achieved by the progressive and query-driven approaches to this task. Moreover, I also briefly describe two projects in which I took part as a member of my research group, touching on some real-world applications of big data integration techniques, to conclude with some ideas on the future directions of my research
India’s Season of Dissent: An Interview with Poet Karthika Naïr
Ghazal: India’s Season of Dissent This year, this night, this hour, rise to salute the season of dissent.Sikhs, Hindus, Muslims—Indians, all—seek their nation of dissent. We the people of…they chant: the mantra that birthed a republic.Even my distant eyes echo flares from this beacon of dissent. Kolkata, Kasargod, Kanpur, Nagpur, Tripura… watch it spread,tip to tricoloured tip, then soar: the winged horizon of dissent. Dibrugarh: five hundred students face the CAA and lathiwieldingcops with T..
Evaluation of in-situ shrinkage and expansion properties of polymer composite materials for adhesive anchor systems by a novel approach based on digital image correlation
The curing reaction of thermosetting resins is associated with chemical shrinkage which is overlapped with
thermal expansion as a result of the exothermal enthalpy. Final material properties of the polymer are determined
by this critical process. For adhesive anchor systems the overall shrinkage behavior of the material is
very important for the ultimate bond behavior between adhesive and the borehole wall. An approach for the insitu
measurement of 3-dimensional shrinkage and thermal expansion with digital image correlation (DIC) is
presented, overcoming the common limitation of DIC to solids. Two polymer-based anchor systems (filled epoxy,
vinylester) were investigated and models were developed, showing good agreement with experimental results.
Additionally, measurements with differential scanning calorimetry (DSC) provided supporting information about
the curing reaction. The vinylester system showed higher shrinkage but much faster reaction compared to the
investigated epoxy
Tissue Inhibitor of Metalloproteinase–3 (TIMP-3) induces FAS dependent apoptosis in human vascular smooth muscle cells
Over expression of Tissue Inhibitor of Metalloproteinases-3 (TIMP-3) in vascular smooth muscle cells (VSMCs) induces apoptosis and reduces neointima formation occurring after saphenous vein interposition grafting or coronary stenting. In studies to address the mechanism of TIMP-3-driven apoptosis in human VSMCs we find that TIMP-3 increased activation of caspase-8 and apoptosis was inhibited by expression of Cytokine response modifier A (CrmA) and dominant negative FAS-Associated protein with Death Domain (FADD). TIMP-3 induced apoptosis did not cause mitochondrial depolarisation, increase activation of caspase-9 and was not inhibited by over-expression of B-cell Lymphoma 2 (Bcl2), indicating a mitochondrial independent/type-I death receptor pathway. TIMP-3 increased levels of the First Apoptosis Signal receptor (FAS) and depletion of FAS with shRNA showed TIMP-3-induced apoptosis was FAS dependent. TIMP-3 induced formation of the Death-Inducing Signalling Complex (DISC), as detected by immunoprecipitation and by immunofluorescence. Cellular-FADD-like IL-1 converting enzyme-Like Inhibitory Protein (c-FLIP) localised with FAS at the cell periphery in the absence of TIMP-3 and this localisation was lost on TIMP-3 expression with c-FLIP adopting a perinuclear localisation. Although TIMP-3 inhibited FAS shedding, this did not increase total surface levels of FAS but instead increased FAS levels within localised regions at the cell surface. A Disintegrin And Metalloproteinase 17 (ADAM17) is inhibited by TIMP-3 and depletion of ADAM17 with shRNA significantly decreased FAS shedding. However ADAM17 depletion did not induce apoptosis or replicate the effects of TIMP-3 by increasing localised clustering of cell surface FAS. ADAM17-depleted cells could activate caspase-3 when expressing levels of TIMP-3 that were otherwise sub-apoptotic, suggesting a partial role for ADAM17 mediated ectodomain shedding in TIMP-3 mediated apoptosis. We conclude that TIMP-3 induced apoptosis in VSMCs is highly dependent on FAS and is associated with changes in FAS and c-FLIP localisation, but is not solely dependent on shedding of the FAS ectodomain
Deep intrauterine insemination in sow: results of a field trial
RiassuntoTraditional insemination techniques in pigs depose a high number of spermatozoa (2 to 3x109 spermatozoa) in a large volume of liquid (80-100 ml) into the cervix channel. The dose can be reduced markedly deposing it directly into the uterine horn. Previous studies showed that fertility rate and litter size were not significantly different with 5 or 15x107 spermatozoa in 10 ml into the uterus. The goal of this study was to determine the on-farm application and the reproductive performance of the deep intrauterine insemination (Firflex® probe, MAGAPOR, Spain) in sows. Experiments were conducted under field conditions in 4 commercial pig farms in the North of Italy (September 2003 and March 2004). A total of 166 crossbred multiparous sows were randomly selected after weaning and assigned to one of the following groups: Group 1 – traditional insemination with 3x109 sperm./dose, two insemination per oestrus (n=94) and Group 2 – deep intrauterine insemination with 15x107 sperm./dose, one insemination pe..
Entity Resolution On-Demand for Querying Dirty Datasets
Entity Resolution (ER) is the process of identifying and merging records that refer to the same real-world entity. ER is usually applied as an expensive cleaning step on the entire data before consuming it, yet the relevance of cleaned entities ultimately depends on the user’s specific application, which may only require a small portion of the entities. We introduce BrewER, a framework designed to evaluate SQL SP queries on unclean data while progressively providing results as if they were obtained from cleaned data. BrewER aims at cleaning a single entity at a time, adhering to an ORDER BY predicate, thus it inherently supports top-k queries and stop-and-resume execution. This approach can save a significant amount of resources for various applications. BrewER has been implemented as an open-source Python library and can be seamlessly employed with existing ER tools and algorithms. We thoroughly demonstrated its efficiency through its evaluation on four real-world datasets
- …