7 research outputs found

    Space mission design ontology : extraction of domain-specific entities and concepts similarity analysis

    Get PDF
    Expert Systems, computer programs able to capture human expertise and mimic experts’ reasoning, can support the design of future space missions by assimilating and facilitating access to accumulated knowledge. To organise these data, the virtual assistant needs to understand the concepts characterising space systems engineering. In other words, it needs an ontology of space systems. Unfortunately, there is currently no official European space systems ontology. Developing an ontology is a lengthy and tedious process, involving several human domain experts, and therefore prone to human error and subjectivity. Could the foundations of an ontology be instead semi-automatically extracted from unstructured data related to space systems engineering? This paper presents an implementation of the first layers of the Ontology Learning Layer Cake, an approach to semi-automatically generate an ontology. Candidate entities and synonyms are extracted from three corpora: a set of 56 feasibility reports provided by the European Space Agency, 40 books on space mission design publicly available and a collection of 273 Wikipedia pages. Lexica of relevant space systems entities are semi-automatically generated based on three different methods: a frequency analysis, a term frequency-inverse document frequency analysis, and a Weirdness Index filtering. The frequency-based lexicon of the combined corpora is then fed to a word embedding method, word2vec, to learn the context of each entity. With a cosine similarity analysis, concepts with similar contexts are matched

    A Maximum-Entropy approach for accurate document annotation in the biomedical domain

    Get PDF
    The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH)

    A New Approach to Information Extraction in User-Centric E-Recruitment Systems

    Get PDF
    In modern society, people are heavily reliant on information available online through various channels, such as websites, social media, and web portals. Examples include searching for product prices, news, weather, and jobs. This paper focuses on an area of information extraction in e-recruitment, or job searching, which is increasingly used by a large population of users in across the world. Given the enormous volume of information related to job descriptions and users’ profiles, it is complicated to appropriately match a user’s profile with a job description, and vice versa. Existing information extraction techniques are unable to extract contextual entities. Thus, they fall short of extracting domain-specific information entities and consequently affect the matching of the user profile with the job description. The work presented in this paper aims to extract entities from job descriptions using a domain-specific dictionary. The extracted information entities are enriched with knowledge using Linked Open Data. Furthermore, job context information is expanded using a job description domain ontology based on the contextual and knowledge information. The proposed approach appropriately matches users’ profiles/queries and job descriptions. The proposed approach is tested using various experiments on data from real life jobs’ portals. The results show that the proposed approach enriches extracted data from job descriptions, and can help users to find more relevant jobs

    Context-based multimedia semantics modelling and representation

    Get PDF
    The evolution of the World Wide Web, increase in processing power, and more network bandwidth have contributed to the proliferation of digital multimedia data. Since multimedia data has become a critical resource in many organisations, there is an increasing need to gain efficient access to data, in order to share, extract knowledge, and ultimately use the knowledge to inform business decisions. Existing methods for multimedia semantic understanding are limited to the computable low-level features; which raises the question of how to identify and represent the high-level semantic knowledge in multimedia resources.In order to bridge the semantic gap between multimedia low-level features and high-level human perception, this thesis seeks to identify the possible contextual dimensions in multimedia resources to help in semantic understanding and organisation. This thesis investigates the use of contextual knowledge to organise and represent the semantics of multimedia data aimed at efficient and effective multimedia content-based semantic retrieval.A mixed methods research approach incorporating both Design Science Research and Formal Methods for investigation and evaluation was adopted. A critical review of current approaches for multimedia semantic retrieval was undertaken and various shortcomings identified. The objectives for a solution were defined which led to the design, development, and formalisation of a context-based model for multimedia semantic understanding and organisation. The model relies on the identification of different contextual dimensions in multimedia resources to aggregate meaning and facilitate semantic representation, knowledge sharing and reuse. A prototype system for multimedia annotation, CONMAN was built to demonstrate aspects of the model and validate the research hypothesis, H₁.Towards providing richer and clearer semantic representation of multimedia content, the original contributions of this thesis to Information Science include: (a) a novel framework and formalised model for organising and representing the semantics of heterogeneous visual data; and (b) a novel S-Space model that is aimed at visual information semantic organisation and discovery, and forms the foundations for automatic video semantic understanding

    A framework for active software engineering ontology

    Get PDF
    The passive structure of ontologies results in the ineffectiveness to access and manage the knowledge captured in them. This research has developed a framework for active Software Engineering Ontology based on a multi-agent system. It assists software development teams to effectively access, manage and share software engineering knowledge as well as project information to enable effective and efficient communication and coordination among teams. The framework has been evaluated through the prototype system as proof-of-concept experiments

    Bioinformatička platforma za izvršavanje Federated SPARQL upita nad ontološkim bazama podataka i detektovanje sličnih podataka utvrđivanjem njihove semantičke povezanosti

    Get PDF
    Značaj bioinformatike, kao interdisciplinarne oblasti, bazira se na velikom broju bioloških podataka koji se mogu adekvatno upotrebiti i procesirati primenom aktuelnih informatičkih tehnologija. Ono što je od vitalnog značaja u domenu bioinformatike danas, jeste dostupnost podataka relevantnih za istraživanja, kao i saznanje o tome da takvi podaci već postoje. Značajan preduslov za to je da su potrebni podaci javno dostupni, integrisani i da su razvijeni mehanizmi za njihovu pretragu. U cilju rešavanja datih problema bioinformatička zajednica koristi tehnologije semantičkog veba. U tom pogledu razvijeni su mnogi semantički repozitorijumi i softverska rešenja, koji su izrazito potpomogli istraživačkim aktivnostima na bioinformatičkoj sceni. Međutim, ovi pristupi često se suočavaju sa problemima jer su se mnoge baze podataka razvijale u izolovanom okruženju, bez poštovanja osnovnih standarda bioinformatičke zajednice. Ove heterogene baze, koje su karike mnogih visoko specijalizovanih i nezavisnih resursa, često koriste različite konvencije, rečnike i formate za predstavljanje podataka. Zbog toga se aktuelna softverska rešenja suočavaju sa različitim izazovima u cilju pretrage i otkrivanja relevantnih podataka. Takođe, mnoge baze podataka se preklapaju, čime se pokrivaju, odnosno prikrivaju slični podaci, formirajući na taj način polu-homogene ili homogene izvore podataka. U takvim slučajevima semantička korelacija ovakvih baza često je nejasna i neophodno je primeniti odgovarajuće metode za analizu podataka, kako bi se utvrdili slični podaci. Ova disertacija je nastala kao rezultat istraživanja u cilju prevazilaženja nedostataka postojećih rešenja. U disertaciji je prikazan doprinos u razvoju bioinformatičke platforme, koja se ogleda u nizu originalnih softverskih pristupa koji predstavljaju osnovu ključnih funkcionalnosti: izvršavanje Federated SPARQL upita nad inicijalnim (i korisnički selektovanim) bazama podataka u cilju otkrivanja podataka relevantnih za bioinformatička istraživanja, kao i detektovanje sličnih podataka koje je zasnovano na utvrđivanju semantičke povezanosti podataka. Izvršavanje Federated SPARQL upita izvodi se nad bazama podataka koje koriste Resource Description Framework (RDF) kao model podataka. Rezultati upita se mogu naknadno filtrirati, čime se doprinosi poboljšanju njihove značajnosti. Filtriranje podrazumeva odabir specifičnih svojstava (predikata) prilikom dinamičke projekcije RDF strukture baze podataka i izvršavanje dinamički generisanih star-shaped SPARQL upita. Algoritam, koji je razvijen za potrebe detekcije sličnih podatka, prezentuje originalan pristup i primenjuje se nad instancama ontoloških baza podataka. On koristi principe ontološkog poravnanja, rudarenje tekstualnih podataka, model vektorskog prostora za matematičku reprezentaciju podataka i meru kosinusne sličnosti za numeričko određivanje sličnosti podataka. Treba napomenuti da je Platforma nastala kao posledica višegodišnjeg istraživanja u okviru CPCTAS (Centre for PreClinical Testing of Active Substances) i Laboratorije za ćelijsku i molekularnu biologiju kao deo Instituta za biologiju i ekologiju Prirodno-matematičkog fakulteta Univerziteta u Kragujevcu. Aktivnost Laboratorije pokriva jednu od važnih bioinformatičkih podgrana - prekliničko testiranje bioaktivnih supstanci (potencijalnih lekova za kancer). Primarni cilj Platforme je da istraživanja u okviru Laboratorije učini produktivnijim i efikasnijim. Validacija Platforme je sprovedena nad testnim i relanim bioinformatičkim izvorima podataka, ukazujući na visoku iskorišćenost resursa. Zahvaljujući efikasnim metodama Platforme otvoren je put za nova istraživanja u oblasti bioinformatike, ali i u bilo kojoj drugoj oblasti koja pokriva ontološko modelovanje podataka.The importance of bioinformatics, as an interdisciplinary field, is based on a large number of biological data that can be adequately used and processed using current information technology. What is of vital importance in the field of bioinformatics today is the availability of data relevant to the research, as well as the knowledge that such data already exists. An important prerequisite for this is that the necessary data is publicly available, integrated and that mechanisms for their search have been developed. In order to solve these problems, the bioinformatics community uses semantic web technologies. In this respect, many semantic repositories and software solutions have been developed, which have significantly contributed to the research activities in the bioinformatic scene. However, these approaches often face problems because many databases have developed in an isolated environment, without respecting the basic standards of the bioinformatics community. These heterogeneous databases, which links a number of highly specialized and independent resources, often use different conventions, vocabularies and formats for presenting data. Therefore, current software solutions face different challenges in order to search for and discover relevant data. Also, many databases overlap, covering or concealing similar data, thus forming a homogeneous or semi-homogenous data sources. In such cases, the semantic correlation of such databases is often unclear and it is necessary to apply appropriate methods for data analysis, to determine similar data. This dissertation was created as a result of research in order to overcome the shortcomings of existing solutions. The dissertation presents a contribution to the development of the bioinformatics platform, which presents a number of genuine software approaches that are the basis of key functionalities: executing Federated SPARQL queries over initial (and user selected) databases in order to discover data relevant to bioinformatics research, and the detection of similar data based on determining the semantic relatedness of data. Execution of Federated SPARQL queries is performed over databases that use the Resource Description Framework (RDF) as a data model. Query results can be subsequently filtered, thereby contributing to the improvement of their significance. Filtering involves selecting specific properties (predicates) during the dynamic projection of the RDF database structure and executing dynamically generated star-shaped SPARQL queries. The algorithm, developed for the detection of similar data, presents the original approach and is applied to instances of ontological databases. It uses the principles of ontological alignment, text data mining, the vector space model for the mathematical representation of data, and the cosine similarity measure for the numerical determination of the similarity of data. It should be noted that the Platform was the result of long-term research within the CPCTAS (Center for PreClinical Testing of Active Substances) Laboratory for Cellular and Molecular Biology as part of the Institute of Biology and Ecology at the Faculty of Science, University of Kragujevac. Laboratory activity covers one of the important bioinformatics subgroups - preclinical testing of bioactive substances (potential drugs for cancer). The primary goal of the Platform is to make Laboratory research more productive and more efficient. Platform validation was conducted over real and test bioinformatic data sources, indicating high utilization of resources. Thanks to effective Platform methods, a new path for new research in the field of bioinformatics has been opened, but also in any other area that covers ontological data modelling

    Data provisioning in simulation workflows

    Get PDF
    Computer-based simulations become more and more important, e.g., to imitate real-world experiments such as crash tests, which would otherwise be too expensive or not feasible at all. Thereby, simulation workflows may be used to control the interaction with simulation tools performing necessary numerical calculations. The input data needed by these tools often come from diverse data sources that manage their data in a multiplicity of proprietary formats. Hence, simulation workflows additionally have to carry out many complex data provisioning tasks. These tasks filter and transform heterogeneous input data in such a way that underlying simulation tools can properly ingest them. Furthermore, some simulations use different tools that need to exchange data between each other. Here, even more complex data transformations are needed to cope with the differences in data formats and data granularity as they are expected by involved tools. Nowadays, scientists conducting simulations typically have to design their simulation workflows on their own. So, they have to implement many low-level data transformations that realize the data provisioning for and the data exchange between simulation tools. In doing so, they waste time for workflow design, which hinders them to concentrate on their core issue, i.e., the simulation itself. This thesis introduces several novel concepts and methods that significantly alleviate the design of the complex data provisioning in simulation workflows. Firstly, it addresses the issue that most existing workflow systems offer multiple and diverse data provisioning techniques. So, scientists are frequently overwhelmed with selecting certain techniques that are appropriate for their workflows. This thesis discusses how to conquer the multiplicity and diversity of available techniques by their systematic classification. The resulting classes of techniques are then compared with each other considering relevant functional and non-functional requirements for data provisioning in simulation workflows. The major outcome of this classification and comparison is a set of guidelines that assist scientists in choosing proper data provisioning techniques. Another problem with existing workflow systems is that they often do not support all kinds of data resources or data management operations required by concrete computer-based simulations. So, this thesis proposes extensions of conventional workflow languages that offer a generic solution to data provisioning in arbitrary simulation workflows. These extensions allow for specifying any data management operation that may be described via the query or command languages of involved data resources, e.g., arbitrary SQL statements or shell commands. The proposed extensions of workflow languages still do not remove the burden from scientists to specify many complex data management operations using low-level query and command languages. Hence, this thesis introduces a novel pattern-based approach that even further enhances the abstraction support for simulation workflow design. Instead of specifying many workflow tasks, scientists only need to select a small number of abstract patterns to describe the high-level simulation process they have in mind. Furthermore, scientists are familiar with the parameters to be specified for the patterns, because these parameters correspond to terms or concepts that are related to their domain-specific simulation methodology. A rule-based transformation approach offers flexible means to finally map high-level patterns onto executable simulation workflows. Another major contribution is a pattern hierarchy arranging different kinds of patterns according to clearly distinguished abstraction levels. This facilitates a holistic separation of concerns and provides a systematic framework to incorporate different kinds of persons and their various skills into workflow design, e.g., not only scientists, but also data engineers. Altogether, the pattern-based approach conquers the data complexity associated with simulation workflows, which allows scientists to concentrate on their core issue again, namely on the simulation itself. The last contribution is a complementary optimization method to increase the performance of local data processing in simulation workflows. This method introduces various techniques that partition relevant local data processing tasks between the components of a workflow system in a smart way. Thereby, such tasks are either assigned to the workflow execution engine or to a tightly integrated local database system. Corresponding experiments revealed that, even for a moderate data size of about 0.5 MB, this method is able to reduce workflow duration by nearly a factor of 9
    corecore