58 research outputs found

    β

    Get PDF

    Embedding Techniques to Solve Large-scale Entity Resolution

    Get PDF
    Entity resolution (ER) identifies and links records that belong to the same real-world entities, where an entity refer to any real-world object. It is a primary task in data integration. Accurate and efficient ER substantially impacts various commercial, security, and scientific applications. Often, there are no unique identifiers for entities in datasets/databases that would make the ER task easy. Therefore record matching depends on entity identifying attributes and approximate matching techniques. The issues of efficiently handling large-scale data remain an open research problem with the increasing volumes and velocities in modern data collections. Fast, scalable, real-time and approximate entity matching techniques that provide high-quality results are highly demanding. This thesis proposes solutions to address the challenges of lack of test datasets and the demand for fast indexing algorithms in large-scale ER. The shortage of large-scale, real-world datasets with ground truth is a primary concern in developing and testing new ER algorithms. Usually, for many datasets, there is no information on the ground truth or ‘gold standard’ data that specifies if two records correspond to the same entity or not. Moreover, obtaining test data for ER algorithms that use personal identifying keys (e.g., names, addresses) is difficult due to privacy and confidentiality issues. Towards this challenge, we proposed a numerical simulation model that produces realistic large-scale data to test new methods when suitable public datasets are unavailable. One of the important findings of this work is the approximation of vectors that represent entity identification keys and their relationships, e.g., dissimilarities and errors. Indexing techniques reduce the search space and execution time in the ER process. Based on the ideas of the approximate vectors of entity identification keys, we proposed a fast indexing technique (Em-K indexing) suitable for real-time, approximate entity matching in large-scale ER. Our Em-K indexing method provides a quick and accurate block of candidate matches for a querying record by searching an existing reference database. All our solutions are metric-based. We transform metric or non-metric spaces to a lowerdimensional Euclidean space, known as configuration space, using multidimensional scaling (MDS). This thesis discusses how to modify MDS algorithms to solve various ER problems efficiently. We proposed highly efficient and scalable approximation methods that extend the MDS algorithm for large-scale datasets. We empirically demonstrate the improvements of our proposed approaches on several datasets with various parameter settings. The outcomes show that our methods can generate large-scale testing data, perform fast real-time and approximate entity matching, and effectively scale up the mapping capacity of MDS.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 202

    Interactive Feature Selection and Visualization for Large Observational Data

    Get PDF
    Data can create enormous values in both scientific and industrial fields, especially for access to new knowledge and inspiration of innovation. As the massive increases in computing power, data storage capacity, as well as capability of data generation and collection, the scientific research communities are confronting with a transformation of exploiting the advanced uses of the large-scale, complex, and high-resolution data sets in situation awareness and decision-making projects. To comprehensively analyze the big data problems requires the analyses aiming at various aspects which involves of effective selections of static and time-varying feature patterns that fulfills the interests of domain users. To fully utilize the benefits of the ever-growing size of data and computing power in real applications, we proposed a general feature analysis pipeline and an integrated system that is general, scalable, and reliable for interactive feature selection and visualization of large observational data for situation awareness. The great challenge tackled in this dissertation was about how to effectively identify and select meaningful features in a complex feature space. Our research efforts mainly included three aspects: 1. Enable domain users to better define their interests of analysis; 2. Accelerate the process of feature selection; 3. Comprehensively present the intermediate and final analysis results in a visualized way. For static feature selection, we developed a series of quantitative metrics that related the user interest with the spatio-temporal characteristics of features. For timevarying feature selection, we proposed the concept of generalized feature set and used a generalized time-varying feature to describe the selection interest. Additionally, we provided a scalable system framework that manages both data processing and interactive visualization, and effectively exploits the computation and analysis resources. The methods and the system design together actualized interactive feature selections from two representative large observational data sets with large spatial and temporal resolutions respectively. The final results supported the endeavors in applications of big data analysis regarding combining the statistical methods with high performance computing techniques to visualize real events interactively

    Mining complex trees for hidden fruit : a graph–based computational solution to detect latent criminal networks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Technology at Massey University, Albany, New Zealand.

    Get PDF
    The detection of crime is a complex and difficult endeavour. Public and private organisations – focusing on law enforcement, intelligence, and compliance – commonly apply the rational isolated actor approach premised on observability and materiality. This is manifested largely as conducting entity-level risk management sourcing ‘leads’ from reactive covert human intelligence sources and/or proactive sources by applying simple rules-based models. Focusing on discrete observable and material actors simply ignores that criminal activity exists within a complex system deriving its fundamental structural fabric from the complex interactions between actors - with those most unobservable likely to be both criminally proficient and influential. The graph-based computational solution developed to detect latent criminal networks is a response to the inadequacy of the rational isolated actor approach that ignores the connectedness and complexity of criminality. The core computational solution, written in the R language, consists of novel entity resolution, link discovery, and knowledge discovery technology. Entity resolution enables the fusion of multiple datasets with high accuracy (mean F-measure of 0.986 versus competitors 0.872), generating a graph-based expressive view of the problem. Link discovery is comprised of link prediction and link inference, enabling the high-performance detection (accuracy of ~0.8 versus relevant published models ~0.45) of unobserved relationships such as identity fraud. Knowledge discovery uses the fused graph generated and applies the “GraphExtract” algorithm to create a set of subgraphs representing latent functional criminal groups, and a mesoscopic graph representing how this set of criminal groups are interconnected. Latent knowledge is generated from a range of metrics including the “Super-broker” metric and attitude prediction. The computational solution has been evaluated on a range of datasets that mimic an applied setting, demonstrating a scalable (tested on ~18 million node graphs) and performant (~33 hours runtime on a non-distributed platform) solution that successfully detects relevant latent functional criminal groups in around 90% of cases sampled and enables the contextual understanding of the broader criminal system through the mesoscopic graph and associated metadata. The augmented data assets generated provide a multi-perspective systems view of criminal activity that enable advanced informed decision making across the microscopic mesoscopic macroscopic spectrum

    Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

    Get PDF
    The amount of Semantic Web data is growing rapidly today. Individual users, academic institutions and businesses have already published and are continuing to publish their data in Semantic Web standards, such as RDF and OWL. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. Furthermore, data published by each individual publisher may not be complete. This situation makes it difficult for end users to consume the available Semantic Web data effectively. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. In the Semantic Web, the owl:sameAs predicate is used to link two equivalent (coreferent) ontology instances. An important question is where these owl:sameAs links come from. Although manual interlinking is possible on small scales, when dealing with large-scale datasets (e.g., millions of ontology instances), automated linking becomes necessary. This dissertation summarizes contributions to several aspects of entity coreference research in the Semantic Web. First of all, by developing the EPWNG algorithm, we advance the performance of the state-of-the-art by 1% to 4%. EPWNG finds coreferent ontology instances from different data sources by comparing every pair of instances and focuses on achieving high precision and recall by appropriately collecting and utilizing instance context information domain-independently. We further propose a sampling and utility function based context pruning technique, which provides a runtime speedup factor of 30 to 75. Furthermore, we develop an on-the-fly candidate selection algorithm, P-EPWNG, that enables the coreference process to run 2 to 18 times faster than the state-of-the-art on up to 1 million instances while only making a small sacrifice in the coreference F1-scores. This is achieved by utilizing the matching histories of the instances to prune instance pairs that are not likely to be coreferent. We also propose Offline, another candidate selection algorithm, that not only provides similar runtime speedup to P-EPWNG but also helps to achieve higher candidate selection and coreference F1-scores due to its more accurate filtering of true negatives. Different from P-EPWNG, Offline pre-selects candidate pairs by only comparing their partial context information that is selected in an unsupervised, automatic and domain-independent manner.In order to be able to handle really heterogeneous datasets, a mechanism for automatically determining predicate comparability is proposed. Combing this property matching approach with EPWNG and Offline, our system outperforms state-of-the-art algorithms on the 2012 Billion Triples Challenge dataset on up to 2 million instances for both coreference F1-score and runtime. An interesting project, where we apply the EPWNG algorithm for assisting cervical cancer screening, is discussed in detail. By applying our algorithm to a combination of different patient clinical test results and biographic information, we achieve higher accuracy compared to its ablations. We end this dissertation with the discussion of promising and challenging future work

    Strategy performance relationship in Europe

    Get PDF
    A presente tese é composta por três estudos que conectam a temática da relação entre a estratégia empresarial e a performance das empresas cotadas na Europa no índice STOXX. Tendo por base a tipologia estratégica desenvolvida por Miles & Snow, que assenta na forma como as empresas decidem resolver três problemas fundamentais: comerciais, de engenharia (ou operacionais), e administrativos, a nossa investigação pretende verificar se essa orientação estratégica é ou não independente da cultura nacional, do grau de incerteza e dinamismo nacional e da transformação digital/tecnológica. O primeiro estudo investiga a importância da orientação estratégica e da cultura nacional na performance das empresas cotadas na Europa, analisando a orientação estratégica dessas empresas no período entre 2012 e 2019. Os resultados empíricos demonstram que a estratégia influencia de forma significativa a performance das empresas. Além disso, os resultados mostram ainda que a cultura nacional tem uma influência moderadora na relação entre estratégia e performance. No segundo estudo, é investigado o comportamento estratégico das empresas cotadas na europa no período entre 2015 e 2019, num contexto de incerteza e dinamismo (capacidade de inovação nacional). As análises efetuadas permitem uma melhor compreensão do comportamento estratégico que melhor se adapta a estes contextos e a diferentes níveis de performance. Por forma a se aferir os diferentes níveis de performance, aplicou-se um modelo econométrico baseado na regressão por quantis, como complemento da investigação. Para uma melhor compreensão da relação entre a estratégia e performance, analisou-se o efeito das variáveis explicativas em vários quantis da distribuição condicional, em níveis mais baixos e mais elevados de performance. Por fim, no terceiro pretende-se assim avaliar se a estratégia das empresas ainda é válida no contexto da transformação digital, das empresas mais tecnológicas cotadas no índice STOXX600. Os resultados dos dados em painel obtidos no período entre 2015 e 2019 mostram que a estratégia prospectora (prospector) influencia significativamente a performance das empresas. Os resultados também sugerem, que a presença de digital officer´s (CDO) na equipa de gestão de topo, tem um impacto positivo e significativo na performance das empresas, uma vez que estes gestores de topo desempenham um papel importante na transformação do negócio, atuando como facilitadores da transformação digital e ajudando na exploração de novas oportunidades no futuro.The present dissertation is composed by three studies that connect the relationship between business strategy and performance of listed companies in Europe in the STOXX index in different ways. Supporting our research on the framework developed by Miles and Snow, which is based on the way companies solve three fundamental problems: entrepreneurial, engineering, and administrative, our investigation aims to verify whether or not this strategic orientation is independent of the national culture, the degree of uncertainty and dynamism and the digital/technological transformation. The first study investigates the importance of strategic orientation and national culture in the performance of listed companies in Europe. Analysing the strategic orientation of the listed companies for the period between 2012 and 2019, there is empirical evidence that strategy impacts significantly on firm performance. Moreover, our results show that national culture has a moderating influence in the strategy-performance linkage. In the second study, we explore the strategic behaviour of the listed companies for the period between 2015 and 2019, in the context of uncertainty and dynamism (national innovation capabilities). The analyses carried out enables a better understanding of the strategic behaviour that better fits to these contexts and at different levels performance. In order to access that different performance levels, we introduced an econometric model with a quantile regression approach, to complement our research with a better understanding strategy-performance linkage, analysing the effect of the explanatory variables at various quantiles of conditional performance distribution at low and high levels. Finally, the third study assess if the business strategy in the context of digital transformation, of high-technology companies listed on the STOXX600 index. Panel data results obtained for the period between 2015 and 2019, shows that prospector strategy impacts significantly on firm performance. Findings also suggest that the presence of digital officer´s (CDO) in the top management team, is significant and impact on market value performance. Since this top manager´s play an important role in the business transformation, acting as enabler´s of digital transformation and helping in exploring new opportunities in the future

    The Social and Cultural Alienation of First and Second Generation Immigrant Youths: Interrogating Mainstream Bullying Discourse

    Get PDF
    Bullying is a multidirectional and a multileveled social problem, which affects every member of a community, thus it requires a diverse and multidisciplinary method to be addressed. Even though immigrant youths are as much prone to this phenomenon, if not more, as their native-born peers, the advertisements, media and news outlets have portrayed youth bullying as a white subject. The socio-cultural differences, past experiences of political, social and domestic violence, and the difficulties of integration and accommodation with the unfamiliar lifestyle of the new country heavily affect the vulnerability of the ignored young immigrant populations. If the bullying experts, bullying prevention activists and the justice system brackets their prejudices and look at every incident involving youths violent carefully and analytically, regardless of the victims socio-cultural backgrounds and the colour of their skin, it might uncover many other fatal bullying cases. In this thesis, I would like to have a closer look at three students fatal cases: Kiranjit Nijjar, Hamid Aminzada, and Zaid Youssef and Michael Menjivar, as examples of this faulty view on the characteristics of the bully and the bullied

    Sustainable Agricultural Practices-Impact on Soil Quality and Plant Health

    Get PDF
    Agricultural practices involving the excessive use of chemical fertilizers and pesticides pose major risks to the environment and to human health. The development and adoption of sustainable ecofriendly agricultural management to preserve and enhance the physical, chemical, and biological properties of soils and improve agroecosystem functions is a challenge for both scientists and farmers. The Special Issue entitled “Sustainable Agricultural Practices—Impact on Soil Quality and Plant Health” is a collection of 10 original contributions addressing the state of the art of sustainable agriculture and its positive impact on soil quality. The content of this Special Issue covers a wide range of topics, including the use of beneficial soil microbes, intercropping, organic farming and its effects on soil bacteria and nutrient stocks, application of plant-based nematicides and zeolite amendments, sustainability in CH4 emissions, and the effect of irrigation, fertilization, and environmental conditions as well as land suitability on crop production

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing
    corecore