156 research outputs found

    Regularization approaches in clinical biostatistics: a review of methods and their applications

    Get PDF
    A range of regularization approaches have been proposed in the data sciences to overcome overfitting, to exploit sparsity or to improve prediction. Using a broad definition of regularization, namely controlling model complexity by adding information in order to solve ill-posed problems or to prevent overfitting, we review a range of approaches within this framework including penalization, early stopping, ensembling and model averaging. Aspects of their practical implementation are discussed including available R-packages and examples are provided. To assess the extent to which these approaches are used in medicine, we conducted a review of three general medical journals. It revealed that regularization approaches are rarely applied in practical clinical applications, with the exception of random effects models. Hence, we suggest a more frequent use of regularization approaches in medical research. In situations where also other approaches work well, the only downside of the regularization approaches is increased complexity in the conduct of the analyses which can pose challenges in terms of computational resources and expertise on the side of the data analyst. In our view, both can and should be overcome by investments in appropriate computing facilities and educational resources

    Interpretable Machine Learning Model for Clinical Decision Making

    Get PDF
    Despite machine learning models being increasingly used in medical decision-making and meeting classification predictive accuracy standards, they remain untrusted black-boxes due to decision-makers\u27 lack of insight into their complex logic. Therefore, it is necessary to develop interpretable machine learning models that will engender trust in the knowledge they generate and contribute to clinical decision-makers intention to adopt them in the field. The goal of this dissertation was to systematically investigate the applicability of interpretable model-agnostic methods to explain predictions of black-box machine learning models for medical decision-making. As proof of concept, this study addressed the problem of predicting the risk of emergency readmissions within 30 days of being discharged for heart failure patients. Using a benchmark data set, supervised classification models of differing complexity were trained to perform the prediction task. More specifically, Logistic Regression (LR), Random Forests (RF), Decision Trees (DT), and Gradient Boosting Machines (GBM) models were constructed using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD). The precision, recall, area under the ROC curve for each model were used to measure predictive accuracy. Local Interpretable Model-Agnostic Explanations (LIME) was used to generate explanations from the underlying trained models. LIME explanations were empirically evaluated using explanation stability and local fit (R2). The results demonstrated that local explanations generated by LIME created better estimates for Decision Trees (DT) classifiers

    Diverse and proportional size-1 object summaries for keyword search

    Get PDF
    The abundance and ubiquity of graphs (e.g., Online Social Networks such as Google+ and Facebook; bibliographic graphs such as DBLP) necessitates the effective and efficient search over them. Given a set of keywords that can identify a Data Subject (DS), a recently proposed relational keyword search paradigm produces, as a query result, a set of Object Summaries (OSs). An OS is a tree structure rooted at the DS node (i.e., a tuple containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. OS snippets, denoted as size-l OSs, have also been investigated. Size-l OSs are partial OSs containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, in this paper we investigate the effective and efficient generation of two novel types of OS snippets, i.e. diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, apart from the importance of each node, we also consider its frequency in the OS and its repetitions in the snippets. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g. by asking DBLP authors (i.e. the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate the quality of the snippets that they produce.postprin

    Scalable Algorithms for the Analysis of Massive Networks

    Get PDF
    Die Netzwerkanalyse zielt darauf ab, nicht-triviale Erkenntnisse aus vernetzten Daten zu gewinnen. Beispiele für diese Erkenntnisse sind die Wichtigkeit einer Entität im Verhältnis zu anderen nach bestimmten Kriterien oder das Finden des am besten geeigneten Partners für jeden Teilnehmer eines Netzwerks - bekannt als Maximum Weighted Matching (MWM). Da der Begriff der Wichtigkeit an die zu betrachtende Anwendung gebunden ist, wurden zahlreiche Zentralitätsmaße eingeführt. Diese Maße stammen hierbei aus Jahrzehnten, in denen die Rechenleistung sehr begrenzt war und die Netzwerke im Vergleich zu heute viel kleiner waren. Heute sind massive Netzwerke mit Millionen von Kanten allgegenwärtig und eine triviale Berechnung von Zentralitätsmaßen ist oft zu zeitaufwändig. Darüber hinaus ist die Suche nach der Gruppe von k Knoten mit hoher Zentralität eine noch kostspieligere Aufgabe. Skalierbare Algorithmen zur Identifizierung hochzentraler (Gruppen von) Knoten in großen Graphen sind von großer Bedeutung für eine umfassende Netzwerkanalyse. Heutigen Netzwerke verändern sich zusätzlich im zeitlichen Verlauf und die effiziente Aktualisierung der Ergebnisse nach einer Änderung ist eine Herausforderung. Effiziente dynamische Algorithmen sind daher ein weiterer wesentlicher Bestandteil moderner Analyse-Pipelines. Hauptziel dieser Arbeit ist es, skalierbare algorithmische Lösungen für die zwei oben genannten Probleme zu finden. Die meisten unserer Algorithmen benötigen Sekunden bis einige Minuten, um diese Aufgaben in realen Netzwerken mit bis zu Hunderten Millionen von Kanten zu lösen, was eine deutliche Verbesserung gegenüber dem Stand der Technik darstellt. Außerdem erweitern wir einen modernen Algorithmus für MWM auf dynamische Graphen. Experimente zeigen, dass unser dynamischer MWM-Algorithmus Aktualisierungen in Graphen mit Milliarden von Kanten in Millisekunden bewältigt.Network analysis aims to unveil non-trivial insights from networked data by studying relationship patterns between the entities of a network. Among these insights, a popular one is to quantify the importance of an entity with respect to the others according to some criteria. Another one is to find the most suitable matching partner for each participant of a network knowing the pairwise preferences of the participants to be matched with each other - known as Maximum Weighted Matching (MWM). Since the notion of importance is tied to the application under consideration, numerous centrality measures have been introduced. Many of these measures, however, were conceived in a time when computing power was very limited and networks were much smaller compared to today's, and thus scalability to large datasets was not considered. Today, massive networks with millions of edges are ubiquitous, and a complete exact computation for traditional centrality measures are often too time-consuming. This issue is amplified if our objective is to find the group of k vertices that is the most central as a group. Scalable algorithms to identify highly central (groups of) vertices on massive graphs are thus of pivotal importance for large-scale network analysis. In addition to their size, today's networks often evolve over time, which poses the challenge of efficiently updating results after a change occurs. Hence, efficient dynamic algorithms are essential for modern network analysis pipelines. In this work, we propose scalable algorithms for identifying important vertices in a network, and for efficiently updating them in evolving networks. In real-world graphs with hundreds of millions of edges, most of our algorithms require seconds to a few minutes to perform these tasks. Further, we extend a state-of-the-art algorithm for MWM to dynamic graphs. Experiments show that our dynamic MWM algorithm handles updates in graphs with billion edges in milliseconds

    Suur-Helsingin kerrostalojen vuokrahintojen arvionti

    Get PDF
    Determining the optimal rental price of an apartment is typically something that requires a real estate agent to gauge the external and internal features of the apartment, and similar apartments in the vicinity of the one being examined. Hedonic pricing models that rely on regression are commonplace, but those that employ state of the art machine learning methods are still not widespread. The purpose of this thesis is to investigate an optimal machine learning method for predicting property rent prices for apartments in the Greater Helsinki area. The project was carried out at the behest of a client in the real estate investing business. We review what external and inherent apartment features are the most suitable for making predictions, and engineer additional features that result in predictions with the least error within the Greater Helsinki area. Combining public demographic data from Tilastokeskus (Statistics Finland) and data from the online broker Oikotie Oy gives rise to a model that is comparable to contemporary commercial solutions offered in Finland. Using inverse distance weighting to interpolate and generate a price for the coordinates of the new apartment was also found to be crucial in developing an performant model. After reviewing models, the gradient boosting algorithm XGBoost was noted to fare the best for this regression task

    Real-Time Sensor Networks and Systems for the Industrial IoT

    Get PDF
    The Industrial Internet of Things (Industrial IoT—IIoT) has emerged as the core construct behind the various cyber-physical systems constituting a principal dimension of the fourth Industrial Revolution. While initially born as the concept behind specific industrial applications of generic IoT technologies, for the optimization of operational efficiency in automation and control, it quickly enabled the achievement of the total convergence of Operational (OT) and Information Technologies (IT). The IIoT has now surpassed the traditional borders of automation and control functions in the process and manufacturing industry, shifting towards a wider domain of functions and industries, embraced under the dominant global initiatives and architectural frameworks of Industry 4.0 (or Industrie 4.0) in Germany, Industrial Internet in the US, Society 5.0 in Japan, and Made-in-China 2025 in China. As real-time embedded systems are quickly achieving ubiquity in everyday life and in industrial environments, and many processes already depend on real-time cyber-physical systems and embedded sensors, the integration of IoT with cognitive computing and real-time data exchange is essential for real-time analytics and realization of digital twins in smart environments and services under the various frameworks’ provisions. In this context, real-time sensor networks and systems for the Industrial IoT encompass multiple technologies and raise significant design, optimization, integration and exploitation challenges. The ten articles in this Special Issue describe advances in real-time sensor networks and systems that are significant enablers of the Industrial IoT paradigm. In the relevant landscape, the domain of wireless networking technologies is centrally positioned, as expected

    Complex queries and complex data

    Get PDF
    With the widespread availability of wearable computers, equipped with sensors such as GPS or cameras, and with the ubiquitous presence of micro-blogging platforms, social media sites and digital marketplaces, data can be collected and shared on a massive scale. A necessary building block for taking advantage from this vast amount of information are efficient and effective similarity search algorithms that are able to find objects in a database which are similar to a query object. Due to the general applicability of similarity search over different data types and applications, the formalization of this concept and the development of strategies for evaluating similarity queries has evolved to an important field of research in the database community, spatio-temporal database community, and others, such as information retrieval and computer vision. This thesis concentrates on a special instance of similarity queries, namely k-Nearest Neighbor (kNN) Queries and their close relative, Reverse k-Nearest Neighbor (RkNN) Queries. As a first contribution we provide an in-depth analysis of the RkNN join. While the problem of reverse nearest neighbor queries has received a vast amount of research interest, the problem of performing such queries in a bulk has not seen an in-depth analysis so far. We first formalize the RkNN join, identifying its monochromatic and bichromatic versions and their self-join variants. After pinpointing the monochromatic RkNN join as an important and interesting instance, we develop solutions for this class, including a self-pruning and a mutual pruning algorithm. We then evaluate these algorithms extensively on a variety of synthetic and real datasets. From this starting point of similarity queries on certain data we shift our focus to uncertain data, addressing nearest neighbor queries in uncertain spatio-temporal databases. Starting from the traditional definition of nearest neighbor queries and a data model for uncertain spatio-temporal data, we develop efficient query mechanisms that consider temporal dependencies during query evaluation. We define intuitive query semantics, aiming not only at returning the objects closest to the query but also their probability of being a nearest neighbor. After theoretically evaluating these query predicates we develop efficient querying algorithms for the proposed query predicates. Given the findings of this research on nearest neighbor queries, we extend these results to reverse nearest neighbor queries. Finally we address the problem of querying large datasets containing set-based objects, namely image databases, where images are represented by (multi-)sets of vectors and additional metadata describing the position of features in the image. We aim at reducing the number of kNN queries performed during query processing and evaluate a modified pipeline that aims at optimizing the query accuracy at a small number of kNN queries. Additionally, as feature representations in object recognition are moving more and more from the real-valued domain to the binary domain, we evaluate efficient indexing techniques for binary feature vectors.Nicht nur durch die Verbreitung von tragbaren Computern, die mit einer Vielzahl von Sensoren wie GPS oder Kameras ausgestattet sind, sondern auch durch die breite Nutzung von Microblogging-Plattformen, Social-Media Websites und digitale Marktplätze wie Amazon und Ebay wird durch die User eine gigantische Menge an Daten veröffentlicht. Um aus diesen Daten einen Mehrwert erzeugen zu können bedarf es effizienter und effektiver Algorithmen zur Ähnlichkeitssuche, die zu einem gegebenen Anfrageobjekt ähnliche Objekte in einer Datenbank identifiziert. Durch die Allgemeinheit dieses Konzeptes der Ähnlichkeit über unterschiedliche Datentypen und Anwendungen hinweg hat sich die Ähnlichkeitssuche zu einem wichtigen Forschungsfeld, nicht nur im Datenbankumfeld oder im Bereich raum-zeitlicher Datenbanken, sondern auch in anderen Forschungsgebieten wie dem Information Retrieval oder dem Maschinellen Sehen entwickelt. In der vorliegenden Arbeit beschäftigen wir uns mit einem speziellen Anfrageprädikat im Bereich der Ähnlichkeitsanfragen, mit k-nächste Nachbarn (kNN) Anfragen und ihrem Verwandten, den Revers k-nächsten Nachbarn (RkNN) Anfragen. In einem ersten Beitrag analysieren wir den RkNN Join. Obwohl das Problem von reverse nächsten Nachbar Anfragen in den letzten Jahren eine breite Aufmerksamkeit in der Forschungsgemeinschaft erfahren hat, wurde das Problem eine Menge von RkNN Anfragen gleichzeitig auszuführen nicht ausreichend analysiert. Aus diesem Grund formalisieren wir das Problem des RkNN Joins mit seinen monochromatischen und bichromatischen Varianten. Wir identifizieren den monochromatischen RkNN Join als einen wichtigen und interessanten Fall und entwickeln entsprechende Anfragealgorithmen. In einer detaillierten Evaluation vergleichen wir die ausgearbeiteten Verfahren auf einer Vielzahl von synthetischen und realen Datensätzen. Nach diesem Kapitel über Ähnlichkeitssuche auf sicheren Daten konzentrieren wir uns auf unsichere Daten, speziell im Bereich raum-zeitlicher Datenbanken. Ausgehend von der traditionellen Definition von Nachbarschaftsanfragen und einem Datenmodell für unsichere raum-zeitliche Daten entwickeln wir effiziente Anfrageverfahren, die zeitliche Abhängigkeiten bei der Anfragebearbeitung beachten. Zu diesem Zweck definieren wir Anfrageprädikate die nicht nur die Objekte zurückzugeben, die dem Anfrageobjekt am nächsten sind, sondern auch die Wahrscheinlichkeit mit der sie ein nächster Nachbar sind. Wir evaluieren die definierten Anfrageprädikate theoretisch und entwickeln effiziente Anfragestrategien, die eine Anfragebearbeitung zu vertretbaren Laufzeiten gewährleisten. Ausgehend von den Ergebnissen für Nachbarschaftsanfragen erweitern wir unsere Ergebnisse auf Reverse Nachbarschaftsanfragen. Zuletzt behandeln wir das Problem der Anfragebearbeitung bei Mengen-basierten Objekten, die zum Beispiel in Bilddatenbanken Verwendung finden: Oft werden Bilder durch eine Menge von Merkmalsvektoren und zusätzliche Metadaten (zum Beispiel die Position der Merkmale im Bild) dargestellt. Wir evaluieren eine modifizierte Pipeline, die darauf abzielt, die Anfragegenauigkeit bei einer kleinen Anzahl an kNN-Anfragen zu maximieren. Da reellwertige Merkmalsvektoren im Bereich der Objekterkennung immer öfter durch Bitvektoren ersetzt werden, die sich durch einen geringeren Speicherplatzbedarf und höhere Laufzeiteffizienz auszeichnen, evaluieren wir außerdem Indexierungsverfahren für Binärvektoren
    • …
    corecore