47 research outputs found

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster

    Clustering Approaches for Multi-source Entity Resolution

    Get PDF
    Entity Resolution (ER) or deduplication aims at identifying entities, such as specific customer or product descriptions, in one or several data sources that refer to the same real-world entity. ER is of key importance for improving data quality and has a crucial role in data integration and querying. The previous generation of ER approaches focus on integrating records from two relational databases or performing deduplication within a single database. Nevertheless, in the era of Big Data the number of available data sources is increasing rapidly. Therefore, large-scale data mining or querying systems need to integrate data obtained from numerous sources. For example, in online digital libraries or E-Shops, publications or products are incorporated from a large number of archives or suppliers across the world or within a specified region or country to provide a unified view for the user. This process requires data consolidation from numerous heterogeneous data sources, which are mostly evolving. By raising the number of sources, data heterogeneity and velocity as well as the variance in data quality is increased. Therefore, multi-source ER, i.e. finding matching entities in an arbitrary number of sources, is a challenging task. Previous efforts for matching and clustering entities between multiple sources (> 2) mostly treated all sources as a single source. This approach excludes utilizing metadata or provenance information for enhancing the integration quality and leads up to poor results due to ignorance of the discrepancy between quality of sources. The conventional ER pipeline consists of blocking, pair-wise matching of entities, and classification. In order to meet the new needs and requirements, holistic clustering approaches that are capable of scaling to many data sources are needed. The holistic clustering-based ER should further overcome the restriction of pairwise linking of entities by making the process capable of grouping entities from multiple sources into clusters. The clustering step aims at removing false links while adding missing true links across sources. Additionally, incremental clustering and repairing approaches need to be developed to cope with the ever-increasing number of sources and new incoming entities. To this end, we developed novel clustering and repairing schemes for multi-source entity resolution. The approaches are capable of grouping entities from multiple clean (duplicate-free) sources, as well as handling data from an arbitrary combination of clean and dirty sources. The multi-source clustering schemes exclusively developed for multi-source ER can obtain superior results compared to general purpose clustering algorithms. Additionally, we developed incremental clustering and repairing methods in order to handle the evolving sources. The proposed incremental approaches are capable of incorporating new sources as well as new entities from existing sources. The more sophisticated approach is able to repair previously determined clusters, and consequently yields improved quality and a reduced dependency on the insert order of the new entities. To ensure scalability, the parallel variation of all approaches are implemented on top of the Apache Flink framework which is a distributed processing engine. The proposed methods have been integrated in a new end-to-end ER tool named FAMER (FAst Multi-source Entity Resolution system). The FAMER framework is comprised of Linking and Clustering components encompassing both batch and incremental ER functionalities. The output of Linking part is recorded as a similarity graph where each vertex represents an entity and each edge maintains the similarity relationship between two entities. Such a similarity graph is the input of the Clustering component. The comprehensive comparative evaluations overall show that the proposed clustering and repairing approaches for both batch and incremental ER achieve high quality while maintaining the scalability

    Using Multiple Instance Learning techniques to rank maize ears according to their traits

    Get PDF
    AbstractMultiple-Instance Learning (MIL) is a sub-field of machine learning. Its main goal is to do accurate predictions on new data based on a predictive model generated from previously group of labeled bags of data, known as training data, containing many instances. MIL has many real world important applications such as image retrieval or text categorization and medical diagnosis problems.It is often difficult for crop breeders to predict yield by combining different yield components to produce better plants with superior performance. Data analysis is one area that is striving to let farmers have an idea of their expected yield pre-harvest. Accurate early yield prediction will improve agricultural strategies plan, proper resources allocation and improve management of maize ear cultivation with consequent increase in productivity. Most experiments on maize ears traits only considered ear evaluation and maize improvement without yield prediction. One of the experiments that included yield prediction was PR. NDCG measure which was developed to rank maize evaluation for Sousa Valley Best Ear Competition. The focus of this work was to make an intelligent regression models recognition and analysis by running some MIL algorithms to predict and assign real value to maize yield from randomly group N vary parameter sizes of maize ear traits and soil parameters of maize population dataset. Furthermore, this dissertation also ranked the models per result and establish a relationship between variables

    Machine learning for managing structured and semi-structured data

    Get PDF
    As the digitalization of private, commercial, and public sectors advances rapidly, an increasing amount of data is becoming available. In order to gain insights or knowledge from these enormous amounts of raw data, a deep analysis is essential. The immense volume requires highly automated processes with minimal manual interaction. In recent years, machine learning methods have taken on a central role in this task. In addition to the individual data points, their interrelationships often play a decisive role, e.g. whether two patients are related to each other or whether they are treated by the same physician. Hence, relational learning is an important branch of research, which studies how to harness this explicitly available structural information between different data points. Recently, graph neural networks have gained importance. These can be considered an extension of convolutional neural networks from regular grids to general (irregular) graphs. Knowledge graphs play an essential role in representing facts about entities in a machine-readable way. While great efforts are made to store as many facts as possible in these graphs, they often remain incomplete, i.e., true facts are missing. Manual verification and expansion of the graphs is becoming increasingly difficult due to the large volume of data and must therefore be assisted or substituted by automated procedures which predict missing facts. The field of knowledge graph completion can be roughly divided into two categories: Link Prediction and Entity Alignment. In Link Prediction, machine learning models are trained to predict unknown facts between entities based on the known facts. Entity Alignment aims at identifying shared entities between graphs in order to link several such knowledge graphs based on some provided seed alignment pairs. In this thesis, we present important advances in the field of knowledge graph completion. For Entity Alignment, we show how to reduce the number of required seed alignments while maintaining performance by novel active learning techniques. We also discuss the power of textual features and show that graph-neural-network-based methods have difficulties with noisy alignment data. For Link Prediction, we demonstrate how to improve the prediction for unknown entities at training time by exploiting additional metadata on individual statements, often available in modern graphs. Supported with results from a large-scale experimental study, we present an analysis of the effect of individual components of machine learning models, e.g., the interaction function or loss criterion, on the task of link prediction. We also introduce a software library that simplifies the implementation and study of such components and makes them accessible to a wide research community, ranging from relational learning researchers to applied fields, such as life sciences. Finally, we propose a novel metric for evaluating ranking results, as used for both completion tasks. It allows for easier interpretation and comparison, especially in cases with different numbers of ranking candidates, as encountered in the de-facto standard evaluation protocols for both tasks.Mit der rasant fortschreitenden Digitalisierung des privaten, kommerziellen und öffentlichen Sektors werden immer größere Datenmengen verfügbar. Um aus diesen enormen Mengen an Rohdaten Erkenntnisse oder Wissen zu gewinnen, ist eine tiefgehende Analyse unerlässlich. Das immense Volumen erfordert hochautomatisierte Prozesse mit minimaler manueller Interaktion. In den letzten Jahren haben Methoden des maschinellen Lernens eine zentrale Rolle bei dieser Aufgabe eingenommen. Neben den einzelnen Datenpunkten spielen oft auch deren Zusammenhänge eine entscheidende Rolle, z.B. ob zwei Patienten miteinander verwandt sind oder ob sie vom selben Arzt behandelt werden. Daher ist das relationale Lernen ein wichtiger Forschungszweig, der untersucht, wie diese explizit verfügbaren strukturellen Informationen zwischen verschiedenen Datenpunkten nutzbar gemacht werden können. In letzter Zeit haben Graph Neural Networks an Bedeutung gewonnen. Diese können als eine Erweiterung von CNNs von regelmäßigen Gittern auf allgemeine (unregelmäßige) Graphen betrachtet werden. Wissensgraphen spielen eine wesentliche Rolle bei der Darstellung von Fakten über Entitäten in maschinenlesbaren Form. Obwohl große Anstrengungen unternommen werden, so viele Fakten wie möglich in diesen Graphen zu speichern, bleiben sie oft unvollständig, d. h. es fehlen Fakten. Die manuelle Überprüfung und Erweiterung der Graphen wird aufgrund der großen Datenmengen immer schwieriger und muss daher durch automatisierte Verfahren unterstützt oder ersetzt werden, die fehlende Fakten vorhersagen. Das Gebiet der Wissensgraphenvervollständigung lässt sich grob in zwei Kategorien einteilen: Link Prediction und Entity Alignment. Bei der Link Prediction werden maschinelle Lernmodelle trainiert, um unbekannte Fakten zwischen Entitäten auf der Grundlage der bekannten Fakten vorherzusagen. Entity Alignment zielt darauf ab, gemeinsame Entitäten zwischen Graphen zu identifizieren, um mehrere solcher Wissensgraphen auf der Grundlage einiger vorgegebener Paare zu verknüpfen. In dieser Arbeit stellen wir wichtige Fortschritte auf dem Gebiet der Vervollständigung von Wissensgraphen vor. Für das Entity Alignment zeigen wir, wie die Anzahl der benötigten Paare reduziert werden kann, während die Leistung durch neuartige aktive Lerntechniken erhalten bleibt. Wir erörtern auch die Leistungsfähigkeit von Textmerkmalen und zeigen, dass auf Graph-Neural-Networks basierende Methoden Schwierigkeiten mit verrauschten Paar-Daten haben. Für die Link Prediction demonstrieren wir, wie die Vorhersage für unbekannte Entitäten zur Trainingszeit verbessert werden kann, indem zusätzliche Metadaten zu einzelnen Aussagen genutzt werden, die oft in modernen Graphen verfügbar sind. Gestützt auf Ergebnisse einer groß angelegten experimentellen Studie präsentieren wir eine Analyse der Auswirkungen einzelner Komponenten von Modellen des maschinellen Lernens, z. B. der Interaktionsfunktion oder des Verlustkriteriums, auf die Aufgabe der Link Prediction. Außerdem stellen wir eine Softwarebibliothek vor, die die Implementierung und Untersuchung solcher Komponenten vereinfacht und sie einer breiten Forschungsgemeinschaft zugänglich macht, die von Forschern im Bereich des relationalen Lernens bis hin zu angewandten Bereichen wie den Biowissenschaften reicht. Schließlich schlagen wir eine neuartige Metrik für die Bewertung von Ranking-Ergebnissen vor, wie sie für beide Aufgaben verwendet wird. Sie ermöglicht eine einfachere Interpretation und einen leichteren Vergleich, insbesondere in Fällen mit einer unterschiedlichen Anzahl von Kandidaten, wie sie in den de-facto Standardbewertungsprotokollen für beide Aufgaben vorkommen

    Proceedings der 11. Internationalen Tagung Wirtschaftsinformatik (WI2013) - Band 1

    Get PDF
    The two volumes represent the proceedings of the 11th International Conference on Wirtschaftsinformatik WI2013 (Business Information Systems). They include 118 papers from ten research tracks, a general track and the Student Consortium. The selection of all submissions was subject to a double blind procedure with three reviews for each paper and an overall acceptance rate of 25 percent. The WI2013 was organized at the University of Leipzig between February 27th and March 1st, 2013 and followed the main themes Innovation, Integration and Individualization.:Track 1: Individualization and Consumerization Track 2: Integrated Systems in Manufacturing Industries Track 3: Integrated Systems in Service Industries Track 4: Innovations and Business Models Track 5: Information and Knowledge ManagementDie zweibändigen Tagungsbände zur 11. Internationalen Tagung Wirtschaftsinformatik (WI2013) enthalten 118 Forschungsbeiträge aus zehn thematischen Tracks der Wirtschaftsinformatik, einem General Track sowie einem Student Consortium. Die Selektion der Artikel erfolgte nach einem Double-Blind-Verfahren mit jeweils drei Gutachten und führte zu einer Annahmequote von 25%. Die WI2013 hat vom 27.02. - 01.03.2013 unter den Leitthemen Innovation, Integration und Individualisierung an der Universität Leipzig stattgefunden.:Track 1: Individualization and Consumerization Track 2: Integrated Systems in Manufacturing Industries Track 3: Integrated Systems in Service Industries Track 4: Innovations and Business Models Track 5: Information and Knowledge Managemen

    Letters from the War of Ecosystems – An Analysis of Independent Software Vendors in Mobile Application Marketplaces

    Get PDF
    The recent emergence of a new generation of mobile application marketplaces has changed the business in the mobile ecosystems. The marketplaces have gathered over a million applications by hundreds of thousands of application developers and publishers. Thus, software ecosystems—consisting of developers, consumers and the orchestrator—have emerged as a part of the mobile ecosystem. This dissertation addresses the new challenges faced by mobile application developers in the new ecosystems through empirical methods. By using the theories of two-sided markets and business ecosystems as the basis, the thesis assesses monetization and value creation in the market as well as the impact of electronic Word-of-Mouth (eWOM) and developer multihoming— i. e. contributing for more than one platform—in the ecosystems. The data for the study was collected with web crawling from the three biggest marketplaces: Apple App Store, Google Play and Windows Phone Store. The dissertation consists of six individual articles. The results of the studies show a gap in monetization among the studied applications, while a majority of applications are produced by small or micro-enterprises. The study finds only weak support for the impact of eWOM on the sales of an application in the studied ecosystem. Finally, the study reveals a clear difference in the multi-homing rates between the top application developers and the rest. This has, as discussed in the thesis, an impact on the future market analyses—it seems that the smart device market can sustain several parallel application marketplaces.Muutama vuosi sitten julkistetut uuden sukupolven mobiilisovellusten kauppapaikat ovat muuttaneet mobiiliekosysteemien liiketoimintadynamiikkaa. Nämä uudet markkinapaikat ovat jo onnistuneet houkuttelemaan yli miljoona sovellusta sadoilta tuhansilta ohjelmistokehittäjiltä. Nämä kehittäjät yhdessä markkinapaikan organisoijan sekä loppukäyttäjien kanssa ovat muodostaneet ohjelmistoekosysteemin osaksi laajempaa mobiiliekosysteemiä. Tässä väitöskirjassa tarkastellaan mobiilisovellusten kehittäjien uudenlaisilla kauppapaikoilla kohtaamia haasteita empiiristen tutkimusmenetelmien kautta. Väitöskirjassa arvioidaan sovellusten monetisaatiota ja arvonluontia sekä verkon asiakasarviointien (engl. electronicWord-of-Mouth, eWOM) ja kehittäjien moniliittymisen (engl. multi-homing) — kehittäjä on sitoutunut useammalle kuin yhdelle ekosysteemille — vaikutuksia ekosysteemissä. Työn teoreettinen tausta rakentuu kaksipuolisten markkinapaikkojen ja liiketoimintaekosysteemien päälle. Tutkimuksen aineisto on kerätty kolmelta suurimmalta mobiilisovellusmarkkinapaikalta: Apple App Storesta, Google Playstä ja Windows Phone Storesta. Tämä artikkeliväitöskirja koostuu kuudesta itsenäisestä tutkimuskäsikirjoituksesta. Artikkelien tulokset osoittavat puutteita monetisaatiossa tutkittujen sovellusten joukossa. Merkittävä osa tarkastelluista sovelluksista on pienten yritysten tai yksittäisten kehittäjien julkaisemia. Tutkimuksessa löydettiin vain heikkoa tukea eWOM:in positiiviselle vaikutukselle sovellusten myyntimäärissä. Työssä myös osoitetaan merkittävä ero menestyneimpien sovelluskehittäjien sekä muiden kehittäjien moniliittymiskäyttäytymisen välillä. Tällä havainnolla on merkitystä tuleville markkina-analyyseille ja sen vaikutuksia on käsitelty työssä. Tulokset esimerkiksi viittaavat siihen, että markkinat pystyisivät ylläpitämään useita kilpailevia kauppapaikkoja.Siirretty Doriast

    Pedestrian Mobility Mining with Movement Patterns

    Get PDF
    In street-based mobility mining, pedestrian volume estimation receives increasing attention, as it provides important applications such as billboard evaluation, attraction ranking and emergency support systems. In practice, empirical measurements are sparse due to budget limitations and constrained mounting options. Therefore, estimation of pedestrian quantity is required to perform pedestrian mobility analysis at unobserved locations. Accurate pedestrian mobility analysis is difficult to achieve due to the non-random path selection of individual pedestrians (resulting from motivated movement behaviour), causing the pedestrian volumes to distribute non-uniformly among the traffic network. Existing approaches (pedestrian simulations and data mining methods) are hard to adjust to sensor measurements or require more expensive input data (e.g. high fidelity floor plans or total number of pedestrians in the site) and are thus unfeasible. In order to achieve a mobility model that encodes pedestrian volumes accurately, we propose two methods under the regression framework which overcome the limitations of existing methods. Namely, these two methods incorporate not just topological information and episodic sensor readings, but also prior knowledge on movement preferences and movement patterns. The first one is based on Least Squares Regression (LSR). The advantage of this method is the easy inclusion of route choice heuristics and robustness towards contradicting measurements. The second method is Gaussian Process Regression (GPR). The advantages of this method are the possibilities to include expert knowledge on pedestrian movement and to estimate the uncertainty in predicting the unknown frequencies. Furthermore the kernel matrix of the pedestrian frequencies returned by the method supports sensor placement decisions. Major benefits of the regression approach are (1) seamless integration of expert data and (2) simple reproduction of sensor measurements. Further advantages are (3) invariance of the results against traffic network homeomorphism and (4) the computational complexity depends not on the number of modeled pedestrians but on the traffic network complexity. We compare our novel approaches to state-of-the-art pedestrian simulation (Generalized Centrifugal Force Model) as well as existing Data Mining methods for traffic volume estimation (Spatial k-Nearest Neighbour) and commonly used graph kernels for the Gaussian Process Regression (Squared Exponential, Regularized Laplacian and Diffusion Kernel) in terms of prediction performance (measured with mean absolute error). Our methods showed significantly lower error rates. Since pattern knowledge is not easy to obtain, we present algorithms for pattern acquisition and analysis from Episodic Movement Data. The proposed analysis of Episodic Movement Data involve spatio-temporal aggregation of visits and flows, cluster analyses and dependency models. For pedestrian mobility data collection we further developed and successfully applied the recently evolved Bluetooth tracking technology. The introduced methods are combined to a system for pedestrian mobility analysis which comprises three layers. The Sensor Layer (1) monitors geo-coded sensor recordings on people’s presence and hands this episodic movement data in as input to the next layer. By use of standardized Open Geographic Consortium (OGC) compliant interfaces for data collection, we support seamless integration of various sensor technologies depending on the application requirements. The Query Layer (2) interacts with the user, who could ask for analyses within a given region and a certain time interval. Results are returned to the user in OGC conform Geography Markup Language (GML) format. The user query triggers the (3) Analysis Layer which utilizes the mobility model for pedestrian volume estimation. The proposed approach is promising for location performance evaluation and attractor identification. Thus, it was successfully applied to numerous industrial applications: Zurich central train station, the zoo of Duisburg (Germany) and a football stadium (Stade des Costières Nîmes, France)

    FCAIR 2012 Formal Concept Analysis Meets Information Retrieval Workshop co-located with the 35th European Conference on Information Retrieval (ECIR 2013) March 24, 2013, Moscow, Russia

    Get PDF
    International audienceFormal Concept Analysis (FCA) is a mathematically well-founded theory aimed at data analysis and classifiation. The area came into being in the early 1980s and has since then spawned over 10000 scientific publications and a variety of practically deployed tools. FCA allows one to build from a data table with objects in rows and attributes in columns a taxonomic data structure called concept lattice, which can be used for many purposes, especially for Knowledge Discovery and Information Retrieval. The Formal Concept Analysis Meets Information Retrieval (FCAIR) workshop collocated with the 35th European Conference on Information Retrieval (ECIR 2013) was intended, on the one hand, to attract researchers from FCA community to a broad discussion of FCA-based research on information retrieval, and, on the other hand, to promote ideas, models, and methods of FCA in the community of Information Retrieval
    corecore