    Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

    Abstract. One challenge for Linked Data is scalably establishing high-quality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional ap-proaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we pro-pose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating literal values that are chosen using domain-independent unsupervised learning. We index the instances on the chosen predicates ’ literal values to efficiently look up similar in-stances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don’t always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system

    Data Profiling to Reveal Meaningful Structures for Standardization

    Today many organisations and enterprises are using data from several sources either for strategic decision making or other business goals such as data integration. Data quality problems are always a hindrance to effective and efficient utilization of such data. Tools have been built to clean and standardize data, however, there is a need to pre-process this data by applying techniques and processes from statistical semantics, NLP, and lexical analysis. Data profiling employed these techniques to discover, reveal commonalties and differences in the inherent data structures, present ideas for creation of unified data model, and provide metrics for data standardization and verification. The IBM WebSphere tool was used to pre-process dataset/records by design and implementation of rule sets which were developed in QualityStage and tasks which were created in DataStage. Data profiling process generated set of statistics (frequencies), token/phrase relationships (RFDs, GRFDs), and other findings in the dataset that provided an overall view of the data source's inherent properties and structures. The examination of data ( identifying violations of the normal forms and other data commonalities) from a dataset and collecting the desired information provided useful statistics for data standardization and verification by enable disambiguation and classification of data.Master i Informatikk - programutviklingMAMN-INFPRINFP

    Creating Relational Data from Unstructured and Ungrammatical Data Sources

    Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

    The amount of Semantic Web data is growing rapidly today. Individual users, academic institutions and businesses have already published and are continuing to publish their data in Semantic Web standards, such as RDF and OWL. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. Furthermore, data published by each individual publisher may not be complete. This situation makes it difficult for end users to consume the available Semantic Web data effectively. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. In the Semantic Web, the owl:sameAs predicate is used to link two equivalent (coreferent) ontology instances. An important question is where these owl:sameAs links come from. Although manual interlinking is possible on small scales, when dealing with large-scale datasets (e.g., millions of ontology instances), automated linking becomes necessary. This dissertation summarizes contributions to several aspects of entity coreference research in the Semantic Web. First of all, by developing the EPWNG algorithm, we advance the performance of the state-of-the-art by 1% to 4%. EPWNG finds coreferent ontology instances from different data sources by comparing every pair of instances and focuses on achieving high precision and recall by appropriately collecting and utilizing instance context information domain-independently. We further propose a sampling and utility function based context pruning technique, which provides a runtime speedup factor of 30 to 75. Furthermore, we develop an on-the-fly candidate selection algorithm, P-EPWNG, that enables the coreference process to run 2 to 18 times faster than the state-of-the-art on up to 1 million instances while only making a small sacrifice in the coreference F1-scores. This is achieved by utilizing the matching histories of the instances to prune instance pairs that are not likely to be coreferent. We also propose Offline, another candidate selection algorithm, that not only provides similar runtime speedup to P-EPWNG but also helps to achieve higher candidate selection and coreference F1-scores due to its more accurate filtering of true negatives. Different from P-EPWNG, Offline pre-selects candidate pairs by only comparing their partial context information that is selected in an unsupervised, automatic and domain-independent manner.In order to be able to handle really heterogeneous datasets, a mechanism for automatically determining predicate comparability is proposed. Combing this property matching approach with EPWNG and Offline, our system outperforms state-of-the-art algorithms on the 2012 Billion Triples Challenge dataset on up to 2 million instances for both coreference F1-score and runtime. An interesting project, where we apply the EPWNG algorithm for assisting cervical cancer screening, is discussed in detail. By applying our algorithm to a combination of different patient clinical test results and biographic information, we achieve higher accuracy compared to its ablations. We end this dissertation with the discussion of promising and challenging future work

    A scientific-research activities information system

    Cilj - Cilj istraživanja je razvoj modela, implementacija prototipa i verifikacija sistema za ekstrakciju metodologija iz naučnih članaka iz oblasti Informatike. Da bi se, pomoću tog sistema, naučnicima mogao obezbediti bolji uvid u metodologije u svojim oblastima potrebno je ekstrahovane metodolgije povezati sa metapodacima vezanim za publikaciju iz koje su ekstrahovani. Iz tih razloga istraživanje takoñe za cilj ima i razvoj modela sistema za automatsku ekstrakciju metapodataka iz naučnih članaka. Metodologija - Ekstrahovane metodologije se kategorizuju u četiri kategorije: kategorizuju se u četiri semantičke kategorije: zadatak (Task), metoda (Method), resurs/osobina (Resource/Feature) i implementacija (Implementation). Sistem se sastoji od dva nivoa: prvi je automatska identifikacija metodoloških rečenica; drugi nivo vrši prepoznavanje metodoloških fraza (segmenata). Zadatak ekstrakcije i kategorizacije formalizovan je kao problem označavanja sekvenci i upotrebljena su četiri zasebna Conditional Random Fields modela koji su zasnovani na sintaktičkim frazama. Sistem je evaluiran na ručno anotiranom korpusu iz oblasti Automatske Ekstrakcije Termina koji se sastoji od 45 naučnih članaka. Sistem za automatsku ekstrakciju metapodataka zasnovan je na klasifikaciji. Klasifikacija metapodataka vrši se u osam unapred definisanih sematičkih kategorija: Naslov, Autori, Pripadnost, Adresa, Email, Apstrakt, Ključne reči i Mesto publikacije. Izvršeni su eksperimenti sa svim standardnim modelima za klasifikaciju: naivni bayes, stablo odlučivanja, k-najbližih suseda i mašine potpornih vektora. Rezultati - Sistem za ekstrakciju metodologija postigao je sledeće rezultate: F-mera od 53% za identifikaciju Task i Method kategorija (sa preciznošću od 70%) dok su vrednosti za F-mere za Resource/Feature i Implementation kategorije bile 60% (sa preciznošću od 67%) i 75% (sa preciznošću od 85%) respektivno. Nakon izvršenih klasifikacionih eksperimenata, za sistem za ekstrakciju metapodataka, utvrñeno je da mašine potpornih vektora (SVM) pružaju najbolje performanse. Dobijeni rezultati SVM modela su generalno dobri, F-mera preko 85% kod skoro svih kategorija, a preko 90% kod većine. Ograničenja istraživanja/implikacije - Sistem za ekstrakciju metodologija, kao i sistem za esktrakciju metapodataka primenljivi su samo na naučne članke na engleskom jeziku. Praktične implikacije - Predloženi modeli mogu se, pre svega, koristiti za analizu i pregled razvoja naučnih oblasti kao i za kreiranje sematički bogatijih informacionih sistema naučno-istraživačke delatnosti. Originalnost/vrednost - Originalni doprinosi su sledeći: razvijen je model za ekstrakciju i semantičku kategorijzaciju metodologija iz naučnih članaka iz oblasti Informatike, koji nije opisan u postojećoj literaturi. Izvršena je analiza uticaja različitih vrsta osobina na ekstrakciju metodoloških fraza. Razvijen je u potpunosti automatizovan sistem za ekstrakciju metapodataka u informacionim sistemima naučno-istraživačke delatnosti.Purpose - The purpose of this research is model development, software prototype implementation and verification of the system for the identification of methodology mentions in scientific publications in a subdomain of automatic terminology extraction. In order to provide a better insight for scientists into the methodologies in their fields extracted methodologies should be connected with the metadata associated with the publication from which they are extracted. For this reason the purpose of this research was also a development of a system for the automatic extraction of metadata from scientific publications. Design/methodology/approach - Methodology mentions are categorized in four semantic categories: Task, Method, Resource/Feature and Implementation. The system comprises two major layers: the first layer is an automatic identification of methodological sentences; the second layer highlights methodological phrases (segments). Extraction and classification of the segments was 171 formalized as a sequence tagging problem and four separate phrase-based Conditional Random Fields were used to accomplish the task. The system has been evaluated on a manually annotated corpus comprising 45 full text articles. The system for the automatic extraction of metadata from scientific publications is based on classification. The metadata are classified eight pre-defined categories: Title, Authors, Affiliation, Address, Email, Abstract, Keywords and Publication Note. Experiments were performed with standard classification models: Decision Tree, Naive Bayes, K-nearest Neighbours and Support Vector Machines. Findings - The results of the system for methodology extraction show an Fmeasure of 53% for identification of both Task and Method mentions (with 70% precision), whereas the Fmeasures for Resource/Feature and Implementation identification was 60% (with 67% precision) and 75% (with 85% precision) respectively. As for the system for the automatic extraction of metadata Support Vector Machines provided the best performance. The Fmeasure was over 85% for almost all of the categories and over 90% for the most of them. Research limitations/implications - Both the system for the extractions of methodologies and the system for the extraction of metadata are only applicable to the scientific papers in English language. 172 Practical implications - The proposed models can be used in order to gain insight into a development of a scientific discipline and also to create semantically rich research activity information systems. Originality/Value - The main original contributions are: a novel model for the extraction of methodology mentions from scientific publications. The impact of the various types of features on the performance of the system was determined and presented. A fully automated system for the extraction of metadata for the rich research activity information systems was developed

    Finite state models in information extraction

    Disertacija je posvećena istraživanju naučne oblasti nazvane ekstrakcija informacija (engl. information extraction), koja predstavlja podoblast veštačke inteligencije, a u sebi kombinuje i koristi tehnike i dostignuća više različitih oblasti računarstva. Termin "ekstrakcija informacija" će biti korišćen u dva različita konteksta. U jednom od njih misli se na ekstrakciju informacije kao naučnu oblast i tada će se koristiti skraćenica IE, preuzeta iz anglosaksonske literature u značenju "Information Extraction". U drugom slučaju, kada se bude mislilo na sam proces i postupak izdvajanja informacija iz teksta, koristiće se oblik "ekstrakcija informacija". Ova disertacija predstavlja, pored pregleda postojećih metoda iz ove oblasti, i jedan originalni pristup i metod za ekstrakciju informacija baziran na konačnim transduktorima. Tokom istraživanja i rada na disertaciji, a primenom pomenutog metoda, kao rezultat formirana je baza podataka o mikroorganizmima koja sadrži fenotipske i genotipske karakteristike za 2412 vrsta i 873 rodova, namenjena za istraživanja iz oblasti bioinformatike i genetike. Baza i korišćeni metod su detaljno prikazani u nekoliko radova, publikovanih u časopisima ili izlaganih na međunarodnim konferencijama (Pajić, 2011; Pajić i sar. 2011a; Pajić i sar. 2011b) U glavi 1 dat je uvod u oblast ekstrakcije informacije, unutar koga je opisan istorijat i razvoj metoda ove oblasti. Dalje je opisana klasifikacija tekstualnih resursa nad kojima se vrši ekstrakcija informacija, kao i klasifikacija samih informacija. Na kraju glave 1 oblast ekstrakcije informacije je upoređena sa drugim srodnim disciplinama računarstva. Glava 2 je posvećena prikazu teorijskih osnova na kojima su zasnovana istraživanja ove disertacije. Razmatrana je teorija formalnih jezika i modela konačnih stanja, kao i njihova uzajamna veza i veza sa ekstrakcijom informacija. Akcenat je stavljen na konačne modele i metode koji su zasnovani na modelima konačnih stanja. Ovi metodi pokazuju veću preciznost od drugih metoda za ekstrakciju informacije, te su nezamenljivi u situacijama kada je tačnost izdvojenih podataka iz teksta od presudnog značaja. Pojedini pojmovi ekstrakcije informacija - jezik relevantnih informacija, jezik izdvojenih informacija, pravila ekstrakcije, definisani su iz ugla teorije formalnih jezika. Formulisano je i dokazano osnovno svojstvo relacije transdukcije za zadato pravilo ekstrakcije. Definisan je i pojam jezika konteksta informacija i dokazano je njegovo svojstvo regularnosti...This dissertation is on research and studying in scientific field called information extraction, which can be seen as a sub-area of artificial intelligence and which combines and uses techniques and achievements of several computer science areas. The term „information extraction“ will be used in two different contexts. In the first one, the term will refer to the scientific area and the acronym IE will be used in that case. In the second case, this term will refer to the very process of extracting information. Beside the IE state-of-the-art survey, an original approach and a method for information extraction based on finite state transducers are presented. A database with microbial phenotype and genotype characteristics, for 2412 species and 873 genera has been created, as a result of the research and the work on the dissertation. The database is intended for research, in bioinformatics and genetics. The method used for the creation of the database and the database itself are described in details and published in several journals and conference proceedings (Pajić, 2011; Pajić et al. 2011a; Pajić et al. 2011b). In the Section 1, the introduction to IE is given, together with the history of development of methods in this area. The classification of textual resources that are used for information extraction and classification of the information itself are described. At the end of the Section 1, IE is compared with other related disciplines of computer science. Section 2 contains some excerpts from formal language theory and abstract automata, on which the dissertation is based. The mutual relationship between these two areas and their connection with IE are described. The emphasis is put on the final state models and methods based on them. These methods show higher precision than other methods for extracting information, and are indispensable in situations where the accuracy of data extracted from the text is of crucial importance. Some specific terms of information extraction - the language of the relevant information, the language of extracted information and extraction rules, are defined from the perspective of formal language theory. The basic feature of the transduction relation for the given rule extraction is formulated and proved. The language of information context is defined and its regularilty is proven..