18 research outputs found

    Fusing Data with Correlations

    Full text link
    Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a na\"ive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (\emph{negative correlation}), extractors may focus on different types of information (\emph{negative correlation}), and extractors may apply common rules in extraction (\emph{positive correlation, without copying}). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding.Comment: Sigmod'201

    Quality-Aware Association Rule Mining

    No full text
    International audienc

    Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation

    No full text
    International audienceIn many applications, data mining and machine learning methods are extensively used to analyze Web data and dis- cover actionable knowledge. But, “dirty data” is a chronic plague that causes incorrect results, misleading conclusions, generally followed by inadequate decisions. To ensure the validity of output results, avoid bias or data snooping, it is necessary to control not only the whole Web data analytics pipeline, but most importantly the quality of Web data with appropriate data preparation and curation choices. For a given dataset and a given machine leaning model, a plethora of data preprocessing techniques and alternative data cleaning strategies may lead to dramatically different outputs with unequal quality performance. It is then crucial to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data

    Quality-Aware Association Rule Mining

    No full text
    International audienc

    Discovering Transition Pathways Towards Coviability with Machine Learning

    No full text
    5 pages, 1 figureInternational audienceCoviability refers to the multiple socio-ecological arrangements and governance structures under which humans and nature can coexist in functional, fair, and persistent ways. Transitioning to a coviable state in environmentally degraded and socially vulnerable territories is challenging. This paper presents an ongoing French-Brazilian joint research project combining machine learning, agroecology, and social sciences to discover coviability pathways that can be adopted and implemented by local populations in the North-East region of Brazil

    Combining Ecological and Socio-Environmental Data and Networks to Achieve Sustainability

    No full text
    Environmental degradation in Brazil has been recently amplified by the expansion of agribusiness, livestock and mining activities with dramatic repercussions on ecosystem functions and services. The anthropogenic degradation of landscapes has substantial impacts on indigenous peoples and small organic farmers whose lifestyles are intimately linked to diverse and functional ecosystems.Understanding how we can apply science and technology to benefit from biodiversity and promote socio-ecological transitions ensuring equitable and sustainable use of common natural resources is a critical challenge brought on by the Anthropocene.We present our approach to combine biodiversity and environmental data, supported by two funded research projects: DATAPB (Data of Paraíba) to develop tools for FAIR (Findable, Accessible, Interoperable and Reusable) data sharing for governance and educational projects and the International Joint Laboratory IDEAL (artificial Intelligence, Data analytics, and Earth observation applied to sustAinability Lab) launched in 2023 by the French Institute for Sustainable Development (IRD, Institut de Recherche pour le Développement) and co-coordinated by the authors, with 50 researchers in 11 Brazilian and French institutions working on Artificial Intelligence and socio-ecological research in four Brazilian Northeast states: Paraíba, Rio Grande do Norte, Pernambuco, and Ceará (Berti-Equille and Raimundo 2023).As the keystone of these transdisciplinary projects, the concept-paradigm of socio-ecological coviability (Barrière et al. 2019) proposes that we should explore multiple ways by which relationships between humans and nonhumans (fauna, flora, natural resources) can reach functional and persistent states.Transdisciplinary approaches to agroecological transitions are urgently needed to address questions such as:How can researchers, local communities, and policymakers co-produce participatory diagnoses that depict the coviability of a territory?How can we conserve biodiversity and ecosystem functions, promote social inclusion, value traditional knowledge, and strengthen bioeconomies at local and regional scales?How can biodiversity, social and environmental data, and networks help local communities in shaping adaptation pathways towards sustainable agroecological practices?These questions require transdisciplinary approaches and effective collaboration among environmental, social, and computer scientists, with the involvement of local stakeholders (Biggs et al. 2012). As such, our methodology relies on two approaches:A large-scale study of socio-ecological determinants of coviability over nine states and 1794 municipalities in Northeast Brazil, combines multiple data sources from IBGE (Instituto Brasileiro de Geografia e Estatística), IPEA (Instituto de Pesquisa Econômica Aplicada) , MapBiomas, Brazil Data Cube, and our partners: GBIF (Global Biodiversity Information Facility), INCT Odisseia (Observatory of the dynamics of the interactions between societes and their environments), and ICMBio (Instituto Chico Mendes de Conservação da Biodiversidade) to enable the computation of proxies and indicators of biodiversity structure, ecosystem functions, and socio-economic organization at different scales. We will perform exploratory data analysis and use artificial intelligence (Rolnick et al. 2022) to identify proxies for adaptability, resilience, and vulnerabilities.A multilayer network approach for modeling the interplay between socio-ecological and governance systems will be desgined and tested using adaptive network modeling (Raimundo et al. 2018). Beyond multilayer networks to model socio-ecological dynamics (Keyes et al. 2021), we will incorporate the evolution of the governance systems at the landscape scale and apply Latin Hypercube methods to explore the parameter space (Raimundo et al. 2014) and get a broad characterization of the model dynamics with insights into how the interplay of coupled adaptive systems influence socio-ecological resilience under multiple ecological and socio-economic scenarios. The overall methodology and study case scenarios will be presented
    corecore