7 research outputs found

    A Vector Based Method of Ontology Matching

    Get PDF

    Extended Vector Space Model with Semantic Relatedness on Java Archive Search Engine

    Full text link
    Byte code as information source is a novel approach which enable Java archive search engine to be built without relying on another resources except the Java archive itself [1]. Unfortunately, its effectiveness is not considerably high since some relevant documents may not be retrieved because of vocabulary mismatch. In this research, a vector space model (VSM) is extended with semantic relatedness to overcome vocabulary mismatch issue in Java archive search engine. Aiming the most effective retrieval model, some sort of equations in retrieval models are also proposed and evaluated such as sum up all related term, substituting non-existing term with most related term, logaritmic normalization, context-specific relatedness, and low-rank query-related retrieved documents. In general, semantic relatedness improves recall as a tradeoff of its precision reduction. We also proposed a scheme to take the advantage of relatedness without affected by its disadvantage (VSM + considering non-retrieved documents as low-rank retrieved documents using semantic relatedness). This scheme assures that relatedness score should be ranked lower than standard exact-match score. This scheme yields 1.754% higher effectiveness than our standard VSM

    Mining semantic relations from product features

    Get PDF
    El proyecto ha sido desarrollado en el Departamento de Sistemas de Información (IfIS) de la Unversidad Técnica de Braunschweig (TU Braunschweig) y tiene como objetivo principal la extracción de relaciones semánticas entre, los términos extraídos inicialmente de las características de productos que forman grupos semánticos, y los términos que aparecen en los comentarios que los usuarios hacen sobre los mismos productos, con el objetivo de formar un grupo semántico mayor. Para conseguir este objetivo se ha trabajado con comentarios reales con técnicas de procesamiento de lenguaje natural y posteriormente se han aplicado técnicas clásicas de recuperación de información, como son LSI/LSA, PLSI/PLSA y LDA. El grueso del proyecto se basa en el análisis de cómo estas técnicas, que no están diseñadas para esto, alcanzan los objetivos descritos. Los resultados demuestran que LDA es el método que más se ajusta a estos objetivos, corroborando las premisas iniciales

    Bioinformatička platforma za izvršavanje Federated SPARQL upita nad ontološkim bazama podataka i detektovanje sličnih podataka utvrđivanjem njihove semantičke povezanosti

    Get PDF
    Značaj bioinformatike, kao interdisciplinarne oblasti, bazira se na velikom broju bioloških podataka koji se mogu adekvatno upotrebiti i procesirati primenom aktuelnih informatičkih tehnologija. Ono što je od vitalnog značaja u domenu bioinformatike danas, jeste dostupnost podataka relevantnih za istraživanja, kao i saznanje o tome da takvi podaci već postoje. Značajan preduslov za to je da su potrebni podaci javno dostupni, integrisani i da su razvijeni mehanizmi za njihovu pretragu. U cilju rešavanja datih problema bioinformatička zajednica koristi tehnologije semantičkog veba. U tom pogledu razvijeni su mnogi semantički repozitorijumi i softverska rešenja, koji su izrazito potpomogli istraživačkim aktivnostima na bioinformatičkoj sceni. Međutim, ovi pristupi često se suočavaju sa problemima jer su se mnoge baze podataka razvijale u izolovanom okruženju, bez poštovanja osnovnih standarda bioinformatičke zajednice. Ove heterogene baze, koje su karike mnogih visoko specijalizovanih i nezavisnih resursa, često koriste različite konvencije, rečnike i formate za predstavljanje podataka. Zbog toga se aktuelna softverska rešenja suočavaju sa različitim izazovima u cilju pretrage i otkrivanja relevantnih podataka. Takođe, mnoge baze podataka se preklapaju, čime se pokrivaju, odnosno prikrivaju slični podaci, formirajući na taj način polu-homogene ili homogene izvore podataka. U takvim slučajevima semantička korelacija ovakvih baza često je nejasna i neophodno je primeniti odgovarajuće metode za analizu podataka, kako bi se utvrdili slični podaci. Ova disertacija je nastala kao rezultat istraživanja u cilju prevazilaženja nedostataka postojećih rešenja. U disertaciji je prikazan doprinos u razvoju bioinformatičke platforme, koja se ogleda u nizu originalnih softverskih pristupa koji predstavljaju osnovu ključnih funkcionalnosti: izvršavanje Federated SPARQL upita nad inicijalnim (i korisnički selektovanim) bazama podataka u cilju otkrivanja podataka relevantnih za bioinformatička istraživanja, kao i detektovanje sličnih podataka koje je zasnovano na utvrđivanju semantičke povezanosti podataka. Izvršavanje Federated SPARQL upita izvodi se nad bazama podataka koje koriste Resource Description Framework (RDF) kao model podataka. Rezultati upita se mogu naknadno filtrirati, čime se doprinosi poboljšanju njihove značajnosti. Filtriranje podrazumeva odabir specifičnih svojstava (predikata) prilikom dinamičke projekcije RDF strukture baze podataka i izvršavanje dinamički generisanih star-shaped SPARQL upita. Algoritam, koji je razvijen za potrebe detekcije sličnih podatka, prezentuje originalan pristup i primenjuje se nad instancama ontoloških baza podataka. On koristi principe ontološkog poravnanja, rudarenje tekstualnih podataka, model vektorskog prostora za matematičku reprezentaciju podataka i meru kosinusne sličnosti za numeričko određivanje sličnosti podataka. Treba napomenuti da je Platforma nastala kao posledica višegodišnjeg istraživanja u okviru CPCTAS (Centre for PreClinical Testing of Active Substances) i Laboratorije za ćelijsku i molekularnu biologiju kao deo Instituta za biologiju i ekologiju Prirodno-matematičkog fakulteta Univerziteta u Kragujevcu. Aktivnost Laboratorije pokriva jednu od važnih bioinformatičkih podgrana - prekliničko testiranje bioaktivnih supstanci (potencijalnih lekova za kancer). Primarni cilj Platforme je da istraživanja u okviru Laboratorije učini produktivnijim i efikasnijim. Validacija Platforme je sprovedena nad testnim i relanim bioinformatičkim izvorima podataka, ukazujući na visoku iskorišćenost resursa. Zahvaljujući efikasnim metodama Platforme otvoren je put za nova istraživanja u oblasti bioinformatike, ali i u bilo kojoj drugoj oblasti koja pokriva ontološko modelovanje podataka.The importance of bioinformatics, as an interdisciplinary field, is based on a large number of biological data that can be adequately used and processed using current information technology. What is of vital importance in the field of bioinformatics today is the availability of data relevant to the research, as well as the knowledge that such data already exists. An important prerequisite for this is that the necessary data is publicly available, integrated and that mechanisms for their search have been developed. In order to solve these problems, the bioinformatics community uses semantic web technologies. In this respect, many semantic repositories and software solutions have been developed, which have significantly contributed to the research activities in the bioinformatic scene. However, these approaches often face problems because many databases have developed in an isolated environment, without respecting the basic standards of the bioinformatics community. These heterogeneous databases, which links a number of highly specialized and independent resources, often use different conventions, vocabularies and formats for presenting data. Therefore, current software solutions face different challenges in order to search for and discover relevant data. Also, many databases overlap, covering or concealing similar data, thus forming a homogeneous or semi-homogenous data sources. In such cases, the semantic correlation of such databases is often unclear and it is necessary to apply appropriate methods for data analysis, to determine similar data. This dissertation was created as a result of research in order to overcome the shortcomings of existing solutions. The dissertation presents a contribution to the development of the bioinformatics platform, which presents a number of genuine software approaches that are the basis of key functionalities: executing Federated SPARQL queries over initial (and user selected) databases in order to discover data relevant to bioinformatics research, and the detection of similar data based on determining the semantic relatedness of data. Execution of Federated SPARQL queries is performed over databases that use the Resource Description Framework (RDF) as a data model. Query results can be subsequently filtered, thereby contributing to the improvement of their significance. Filtering involves selecting specific properties (predicates) during the dynamic projection of the RDF database structure and executing dynamically generated star-shaped SPARQL queries. The algorithm, developed for the detection of similar data, presents the original approach and is applied to instances of ontological databases. It uses the principles of ontological alignment, text data mining, the vector space model for the mathematical representation of data, and the cosine similarity measure for the numerical determination of the similarity of data. It should be noted that the Platform was the result of long-term research within the CPCTAS (Center for PreClinical Testing of Active Substances) Laboratory for Cellular and Molecular Biology as part of the Institute of Biology and Ecology at the Faculty of Science, University of Kragujevac. Laboratory activity covers one of the important bioinformatics subgroups - preclinical testing of bioactive substances (potential drugs for cancer). The primary goal of the Platform is to make Laboratory research more productive and more efficient. Platform validation was conducted over real and test bioinformatic data sources, indicating high utilization of resources. Thanks to effective Platform methods, a new path for new research in the field of bioinformatics has been opened, but also in any other area that covers ontological data modelling