Search CORE

20,655 research outputs found

Unsupervised String Transformation Learning for Entity Consolidation

Author: Abedjan Ziawasch
Deng Dong
Elmagarmid Ahmed
Ilyas Ihab F.
Li Guoliang
Madden Samuel
Ouzzani Mourad
Stonebraker Michael
Tang Nan
Tao Wenbo
Publication venue
Publication date: 30/07/2018
Field of study

Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management (MDM) systems, can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way (i.e., they share a transformation). Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool

arXiv.org e-Print Archive

Crossref

The consolidation phase: Survival strategies of farmers stabilizing and developing their businesses

Author: Rantamaki-Lahtinen Leena
Vare Minna
Publication venue
Publication date
Field of study

In earlier studies, past succession is found to contribute positively to the farm growth. However, there is lack of information on how are the farms succeeding after the starting phase. In this study, it is analysed how farmers that have recently started their farm enterprise differ from more experienced farmers in some key farm management areas such as farm and farmer characteristics, strategic objectives and development plans. The data were collected by postal survey from Salo region in South-Western Finland. In the study, farmers are divided in to three different groups according to the farmer’s age and experience. According to the results, early phase farmers are in certain areas better equipped than older generations. They have better education and better networks than others. Moreover, the younger entrepreneurs consider their networks more important than their senior colleagues. Like expected, at early phase farmers had invested significantly more and have more liabilities than the others. In addition, the early phase farmers are the most active also for developing their farms. The late phase farmers were the least active, even if they were going to have succession within the next years. This might be problematic for the successor, too. However, in order to improve the viability of whole farming sector, the farms should be developed as continuum.farm management, multivariate data analysis, Farm Management,

Research Papers in Economics

Enriching ontological user profiles with tagging history for multi-domain recommendations

Author: Alani Harith
Cantador Iván
Castells Pablo
Fernandez Miriam
Szomszor Martin
Publication venue
Publication date: 01/01/2008
Field of study

Many advanced recommendation frameworks employ ontologies of various complexities to model individuals and items, providing a mechanism for the expression of user interests and the representation of item attributes. As a result, complex matching techniques can be applied to support individuals in the discovery of items according to explicit and implicit user preferences. Recently, the rapid adoption of Web2.0, and the proliferation of social networking sites, has resulted in more and more users providing an increasing amount of information about themselves that could be exploited for recommendation purposes. However, the unification of personal information with ontologies using the contemporary knowledge representation methods often associated with Web2.0 applications, such as community tagging, is a non-trivial task. In this paper, we propose a method for the unification of tags with ontologies by grounding tags to a shared representation in the form of Wordnet and Wikipedia. We incorporate individuals' tagging history into their ontological profiles by matching tags with ontology concepts. This approach is preliminary evaluated by extending an existing news recommendation system with user tagging histories harvested from popular social networking sites

CiteSeerX

Southampton (e-Prints Soton)

Open Research Online (The Open University)

Biblos-e Archivo

SLO-aware Colocation of Data Center Tasks Based on Instantaneous Processor Requirements

Author: Boutin Eric
Goel Ashish
Wang Meng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/09/2017
Field of study

In a cloud data center, a single physical machine simultaneously executes dozens of highly heterogeneous tasks. Such colocation results in more efficient utilization of machines, but, when tasks' requirements exceed available resources, some of the tasks might be throttled down or preempted. We analyze version 2.1 of the Google cluster trace that shows short-term (1 second) task CPU usage. Contrary to the assumptions taken by many theoretical studies, we demonstrate that the empirical distributions do not follow any single distribution. However, high percentiles of the total processor usage (summed over at least 10 tasks) can be reasonably estimated by the Gaussian distribution. We use this result for a probabilistic fit test, called the Gaussian Percentile Approximation (GPA), for standard bin-packing algorithms. To check whether a new task will fit into a machine, GPA checks whether the resulting distribution's percentile corresponding to the requested service level objective, SLO is still below the machine's capacity. In our simulation experiments, GPA resulted in colocations exceeding the machines' capacity with a frequency similar to the requested SLO.Comment: Author's version of a paper published in ACM SoCC'1

arXiv.org e-Print Archive

Crossref