1,710 research outputs found

    SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering

    Full text link
    Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheet version information. Thus, the spreadsheet version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We observed that the versioned spreadsheets in each evolution group exhibit certain common features (e.g., similar table headers and worksheet names). Based on this observation, we proposed an automatic clustering algorithm, SpreadCluster. SpreadCluster learns the criteria of features from the versioned spreadsheets in VEnron, and then automatically clusters spreadsheets with the similar features into the same evolution group. We applied SpreadCluster on all spreadsheets in the Enron corpus. The evaluation result shows that SpreadCluster could cluster spreadsheets with higher precision and recall rate than the filename-based approach used by VEnron. Based on the clustering result by SpreadCluster, we further created a new versioned spreadsheet corpus VEnron2, which is much bigger than VEnron. We also applied SpreadCluster on the other two spreadsheet corpora FUSE and EUSES. The results show that SpreadCluster can cluster the versioned spreadsheets in these two corpora with high precision.Comment: 12 pages, MSR 201

    Enhanced SPARQL-based design rationale retrieval

    Get PDF
    Design rationale (DR) is an important category within design knowledge, and effective reuse of it depends on its successful retrieval. In this paper, an ontology-based DR retrieval approach is presented, which allows users to search by entering normal queries such as questions in natural language. First, an ontology-based semantic model of DR is developed based on the extended issue-based information system-based DR representation in order to effectively utilize the semantics embedded in DR, and a database of ontology-based DR is constructed, which supports SPARQL queries. Second, two SPARQL query generation methods are proposed. The first method generates initial SPARQL queries from natural language queries automatically using template matching, and the other generates initial SPARQL queries automatically from DR record-based queries. In addition, keyword extension and optimization is conducted to enhance the SPARQL-based retrieval. Third, a design rationale retrieval prototype system is implemented. The experimental results show the advantages of the proposed approach

    A Process Modelling Framework Based on Point Interval Temporal Logic with an Application to Modelling Patient Flows

    Get PDF
    This thesis considers an application of a temporal theory to describe and model the patient journey in the hospital accident and emergency (A&E) department. The aim is to introduce a generic but dynamic method applied to any setting, including healthcare. Constructing a consistent process model can be instrumental in streamlining healthcare issues. Current process modelling techniques used in healthcare such as flowcharts, unified modelling language activity diagram (UML AD), and business process modelling notation (BPMN) are intuitive and imprecise. They cannot fully capture the complexities of the types of activities and the full extent of temporal constraints to an extent where one could reason about the flows. Formal approaches such as Petri have also been reviewed to investigate their applicability to the healthcare domain to model processes. Additionally, to schedule patient flows, current modelling standards do not offer any formal mechanism, so healthcare relies on critical path method (CPM) and program evaluation review technique (PERT), that also have limitations, i.e. finish-start barrier. It is imperative to specify the temporal constraints between the start and/or end of a process, e.g., the beginning of a process A precedes the start (or end) of a process B. However, these approaches failed to provide us with a mechanism for handling these temporal situations. If provided, a formal representation can assist in effective knowledge representation and quality enhancement concerning a process. Also, it would help in uncovering complexities of a system and assist in modelling it in a consistent way which is not possible with the existing modelling techniques. The above issues are addressed in this thesis by proposing a framework that would provide a knowledge base to model patient flows for accurate representation based on point interval temporal logic (PITL) that treats point and interval as primitives. These objects would constitute the knowledge base for the formal description of a system. With the aid of the inference mechanism of the temporal theory presented here, exhaustive temporal constraints derived from the proposed axiomatic system’ components serves as a knowledge base. The proposed methodological framework would adopt a model-theoretic approach in which a theory is developed and considered as a model while the corresponding instance is considered as its application. Using this approach would assist in identifying core components of the system and their precise operation representing a real-life domain deemed suitable to the process modelling issues specified in this thesis. Thus, I have evaluated the modelling standards for their most-used terminologies and constructs to identify their key components. It will also assist in the generalisation of the critical terms (of process modelling standards) based on their ontology. A set of generalised terms proposed would serve as an enumeration of the theory and subsume the core modelling elements of the process modelling standards. The catalogue presents a knowledge base for the business and healthcare domains, and its components are formally defined (semantics). Furthermore, a resolution theorem-proof is used to show the structural features of the theory (model) to establish it is sound and complete. After establishing that the theory is sound and complete, the next step is to provide the instantiation of the theory. This is achieved by mapping the core components of the theory to their corresponding instances. Additionally, a formal graphical tool termed as point graph (PG) is used to visualise the cases of the proposed axiomatic system. PG facilitates in modelling, and scheduling patient flows and enables analysing existing models for possible inaccuracies and inconsistencies supported by a reasoning mechanism based on PITL. Following that, a transformation is developed to map the core modelling components of the standards into the extended PG (PG*) based on the semantics presented by the axiomatic system. A real-life case (from the King’s College hospital accident and emergency (A&E) department’s trauma patient pathway) is considered to validate the framework. It is divided into three patient flows to depict the journey of a patient with significant trauma, arriving at A&E, undergoing a procedure and subsequently discharged. Their staff relied upon the UML-AD and BPMN to model the patient flows. An evaluation of their representation is presented to show the shortfalls of the modelling standards to model patient flows. The last step is to model these patient flows using the developed approach, which is supported by enhanced reasoning and scheduling

    A Model-driven Visual Analytic Framework for Local Pattern Analysis

    Get PDF
    The ultimate goal of any visual analytic task is to make sense of the data and gain insights. Unfortunately, the process of discovering useful information is becoming more challenging nowadays due to the growing data scale. Particularly, the human cognitive capabilities remain constant whereas the scale and complexity of data are not. Meanwhile, visual analytics largely relies on human analytic in the loop which imposes challenge to traditional human-driven workflow. It is almost impossible to show every aspect of details to the user while diving into local region of the data to explain phenomenons hidden in the data. For example, while exploring the data subsets, it is always important to determine which partitions of data contain more important information. Also, determining the subset of features is vital before further doing other analysis. Furthermore, modeling on these subsets of data locally can yield great finding but also introduces bias. In this work, a model driven visual analytic framework is proposed to help identify interesting local patterns from the above three aspects. This dissertation work aims to tackle these subproblems in the following three topics: model-driven data exploration, model-driven feature analysis and local model diagnosis. First, the model-driven data exploration focus on the problem of modeling subset of data to identify the co-movement of time-series data within certain subset time partitions, which is an important application in a number of domains such as medical science, finance, business and engineering. Second, the model-driven feature analysis is to discover the important subset of interesting features while analyzing local feature similarities. Within the financial risk dataset collected by domain expert, we discover that the feature correlation among different data partitions (i.e., small and large companies) are very different. Third, local model diagnosis provides a tool to identify interesting local regression models at local regions of the data space which makes it possible for the analysts to model the whole data space with a set of local models while knowing the strength and weakness of them. The three tools provide an integrated solution for identifying interesting patterns within local subsets of data

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in rĂ€umlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten ĂŒber sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhĂ€renter Unsicherheit. Beispiele fĂŒr solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusĂ€tzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hĂ€lt viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in rĂ€umlichen Daten. In all diesen FĂ€llen haben die entwickelten Methoden signifikante ForschungsbeitrĂ€ge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gĂ€ngiges Anwendungsgebiet rĂ€umlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulĂ€rer Maße optimieren, d.h., der kĂŒrzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusĂ€tzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder PrĂ€ferenzen erfĂŒllen. Diese Arbeit prĂ€sentiert zwei AnsĂ€tze, solche multi-enriched Daten in Straßennetze einzufĂŒgen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen ĂŒber NutzerprĂ€ferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden prĂ€sentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework prĂ€sentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfĂŒgbaren Ressourcen. Eine Anwendung ist die Suche nach ParkplĂ€tzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulĂ€sst. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant ĂŒber den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit prĂ€sentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukĂŒnftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfĂŒgbar und Anfragen lassen sich aufgrund der \P-VollstĂ€ndigkeit des Problems nicht ohne Weiteres auf grĂ¶ĂŸere DatensĂ€tze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wĂ€re. Diese Arbeit prĂ€sentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    RROVT: A Proposed Visualization Tool for Semantic Web Technologies

    Get PDF
    Visualization is the graphical or semi-graphical representation of information that aids human comprehension of and reasoning about that information. Visualization tools are critically important for creating, querying, visualizing and validating Semantic Web Data. Semantic Web, an efficient infrastructure that enhances visibility of knowledge on the web, is often used more specifically to refer to the formats and technologies that enable it. These technologies include RDF, RDFS and OWL. However, lack of robust and efficient tools to visualize, analyze and represent these technologies within time and space constraint remains a big challenge. In this study, semantic web technologies and their visualization tools were reviewed. RROVT (RDF, RDFS, OWL Visualization Tool), a tool to evaluate and represent formal description of concepts, terms and relationships of data models within a given knowledge domain as well as manage time and space complexities for publishing contents of the semantic web more efficiently is developed and proposed. Performance of RROVT was investigated on samples of semantic web documents implementing RDF, RDFS and OWL technologies. The proposed tool showed remarkable improvement over the existing tools as it aids a better comprehension of the syntax and semantics of semantic web technologies investigated in this study. Keywords: Semantic Web, Visualization Tool, Semantic Web Technologie

    Towards a fuzzy domain ontology extraction method for adaptive e-learning

    Get PDF
    With the widespread applications of electronic learning (e-Learning) technologies to education at all levels, increasing number of online educational resources and messages are generated from the corresponding e-Learning environments. Nevertheless, it is quite difficult, if not totally impossible, for instructors to read through and analyze the online messages to predict the progress of their students on the fly. The main contribution of this paper is the illustration of a novel concept map generation mechanism which is underpinned by a fuzzy domain ontology extraction algorithm. The proposed mechanism can automatically construct concept maps based on the messages posted to online discussion forums. By browsing the concept maps, instructors can quickly identify the progress of their students and adjust the pedagogical sequence on the fly. Our initial experimental results reveal that the accuracy and the quality of the automatically generated concept maps are promising. Our research work opens the door to the development and application of intelligent software tools to enhance e-Learning
    • 

    corecore