    SoK: Memorization in General-Purpose Large Language Models

    Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a wide range of tasks. A major part of this success is due to their huge training datasets and the unprecedented number of model parameters, which allow them to memorize large amounts of information contained in the training data. This memorization goes beyond mere language, and encompasses information only present in a few documents. This is often desirable since it is necessary for performing tasks such as question answering, and therefore an important part of learning, but also brings a whole array of issues, from privacy and security to copyright and beyond. LLMs can memorize short secrets in the training data, but can also memorize concepts like facts or writing styles that can be expressed in text in many different ways. We propose a taxonomy for memorization in LLMs that covers verbatim text, facts, ideas and algorithms, writing styles, distributional properties, and alignment goals. We describe the implications of each type of memorization - both positive and negative - for model performance, privacy, security and confidentiality, copyright, and auditing, and ways to detect and prevent memorization. We further highlight the challenges that arise from the predominant way of defining memorization with respect to model behavior instead of model weights, due to LLM-specific phenomena such as reasoning capabilities or differences between decoding algorithms. Throughout the paper, we describe potential risks and opportunities arising from memorization in LLMs that we hope will motivate new research directions

    Análisis, caracterización y modelación 3D de fugas de agua en sistemas de abastecimiento de agua mediante imágenes de GPR

    Tesis por compendio[ES] Los esfuerzos que hacen los países, en conjunto con organizaciones mundiales, tales como IWA (por International Water Association), ONU-Agua y OMS (Organización Mundial de la Salud), para mitigar el impacto ambiental en el campo de la hidráulica urbana son considerados de vital importancia. Sin embargo, la escasez de los recursos hídricos en el mundo aumenta diariamente. Esto viene dado por el aumento constante de la demanda en los sectores industrial, agrícola y urbano, provocado por el aumento poblacional y el cambio climático. Los administradores de los sistemas de abastecimiento de agua (WSSs, por sus siglas en inglés, water supply systems) se han visto desafiados a suplir la creciente demanda de los diferentes sectores con la cantidad, calidad y eficiencia necesarios y, a su vez, reducir el desperdicio y el mal uso del recurso. Desde esta perspectiva, las fugas de agua son el mayor problema que enfrentan los administradores de estas empresas de servicios públicos. Las fugas en una red provocan problemas de salud, de escasez, económicos y medioambientales. El uso de técnicas de inspección no destructivas debe favorecer una rápida identificación de problemas, para realizar acciones posteriores de reparación en la red. Este trabajo hace uso del GPR (siglas en inglés de ground penetrating radar) como técnica de inspección no destructiva porque: favorece la exploración del subsuelo sin causar alteraciones al medio, es de fácil aplicación y, además, permite obtener pseudo imágenes del subsuelo. Uno de los objetivos de este documento es identificar y extraer características de una fuga en un WSS mediante imágenes de GPR, con el fin último de recrear las fugas a través de modelos 3D. Se realizaron ensayos de laboratorio bajo condiciones controladas donde se emuló una parcela en la cual se había enterrado una tubería con una pequeño orifico que simula una fuga de agua; tras introducir agua al sistema, se realizaron prospecciones con el GPR. Una vez finalizada la exploración del subsuelo, dado que las imágenes de GPR en bruto obtenidas no son fácilmente interpretables por personal no experto, tales imágenes fueron sometidas a procesamiento de datos que favorezcan su fácil interpretación. Este documento presenta dos metodologías de procesamiento de datos que permiten obtener imágenes a partir de las cuales es posible identificar tanto los componentes del sistema como la fuga y su alcance. Las metodologías de tratamiento de datos aplicadas en este documento son una metodología basada en sistemas multi-agente y el filtro de varianza, metodología basada en parámetros estadísticos de segundo orden. Posteriormente, tras aplicar estas metodologías de procesamiento a las imágenes, se sometieron los resultados a un análisis que facilitase la mejor elección evitando la subjetividad del experto. Bajo este concepto, este documento propone el uso conjunto de técnicas multicriterio. Se utilizó el Proceso de Jerarquía Analítica Difusa (FAHP, por sus siglas en inglés, Fuzzy Analytical Hierarchy Process), que permite ponderar varios criterios de evaluación, con el propósito de mitigar la incertidumbre que caracterizan los juicios de los expertos, en conjunto con el método ELECTRE III para obtener la clasificación final de alternativas, todo esto de la manera más objetiva posible. Los resultados de este documento son satisfactorios, permitiendo obtener amplio conocimiento de las fugas y su interacción con el subsuelo, proporcionando pautas para desarrollar posteriormente metodologías de automatización que permitan localizar, seguir y predecir problemas en los WSSs.[CA] Els esforços que fan els països en conjunt amb organitzacions mundials, como ara IWA (per International Water Association), ONU-Agua i OMS (per Organització Mundial de la Salut), per a mitigar l'impacte ambiental en el camp de la hidràulica urbana són considerats de vital importància. No obstant això, l'escassetat dels recursos hídrics en el món augmenta diàriament, donat per l'augment constant de la demanda en els sectors industrial, agrícola i urbà, provocat per l'augment poblacional i el canvi climàtic. Els administradors dels sistemes d'abastiment d'aigua (WSSs, per les seus sigles en anglès, water supply systems) s'han vist desafiats a suplir la creixent demanda dels diferents sectors amb la quantitat, qualitat i eficiència necessaris i, al seu torn, reduir el desaprofitament i el mal ús del recurs. Enfocant aquesta perspectiva, les pèrdues d'aigua són el problema més gran fet front pels directors d'aquestes utilitats. Les pèrdues d'aigua en una xarxa provoquen problemes de salut, d'escassetat, econòmics i mediambientals. L'ús de tècniques d'inspecció no destructives que afavoreixen una ràpida identificació per a realitzar accions de reparació posteriors en la xarxa. Aquest treball fa ús del GPR (sigles en anglès per ground penetrating radar) com a tècnica d'inspecció no destructiva perquè afavoreix l'exploració del subsol sense causar alteracions al entorn, és de fàcil aplicació i a més permet obtenir pseudo imatges del subsol. Un dels objectius d'aquest document és identificar i extraure característiques d'una pèrdua en un WSS mitjançant imatges de GPR, amb la fi última de recrear les pèrdues a través de models 3D. Es van realitzar assajos de laboratori sota condicions controlades on es va emular una parcel¿la en la qual s'ha enterrat una canonada amb una xicotet forat que simula una pèrdua d'aigua; després d'introduir aigua al sistema, s'obtenen prospeccions amb el GPR. Una vegada finalitzada l'exploració del subsol, atès que les imatges de GPR en brut obtingudes no són fàcilment interpretables per personal no expert, són sotmeses a processament de dades que afavorisquen la seua fàcil interpretació. Aquest document presenta dues metodologies de processament de dades que permeten obtenir imatges de les quals és possible identificar tant els components del sistema com la pèrdua i el seu abast. Les metodologies de tractament de dades aplicades en aquest document són una metodologia basada en multi-agents (MABS, per les seves sigles en anglès, multi-agent-based systems) i el filtre de variància, metodologia basada en paràmetres estadístics de segon ordre. Posteriorment, després d'aplicar aquestes metodologies de processament a les imatges se sotmeten els resultats a una anàlisi que faciliti la millor elecció evitant la subjectivitat de l'expert. Sota aquest concepte, aquest document proposa l'ús conjunt de tècniques de decisió multi-criteri (MCDM, per les seves sigles en anglès, multi-criteria decision-making). Es va utilitzar el Procés de Jerarquia Analítica Difusa (FAHP, per les seves sigles en anglès, Fuzzy Analytical Hierarchy Process) el qual s'utilitza per a ponderar diversos criteris d'avaluació, amb el propòsit de mitigar la incertesa que caracteritzen els judicis dels experts, en conjunt amb el mètode ELECTRE III, per a obtenir la classificació final d'alternatives, tot això de la manera més objectiva possible. Els resultats d'aquest document són satisfactoris, permetent obtenir ampli coneixement de les pèrdues d'aigua i la seua interacció amb el subsol, donant-nos la pauta per a desenvolupar posteriorment metodologies d'automatització que permeten localitzar, seguir i predir problemes en els WSSs.[EN] The efforts made by the countries in collaboration with world organizations, such as IWA (for International Water Association), UN-Water and WHO (for World Health Organization), to mitigate the environmental impact in the field of urban hydraulics are considered of vital importance. However, the scarcity of water resources in the world increases daily, given by the constant increase in demand in the industrial, agricultural and urban sectors, caused by the population increase and the climate change. Managers of water supply systems (WSSs) are challenged to supply the growing demand of different sectors with sufficient quantity, quality and efficiency and, in turn, reduce waste and misuse of the resource. Focusing this perspective, water leaks are the biggest problem faced by the managers of these utilities. Leaks in a network cause health, shortage, economic and environmental problems. The use of non-destructive inspection techniques favors rapid identification to carry out subsequent repair actions on the network. This work makes use of the GPR (ground penetrating radar) as a non-destructive inspection technique because: it favors the exploration of the ground without causing alterations to the environment, it is easy to apply, and also allows to obtain pseudo images of the subsoil. This document presents two data processing methodologies that allow obtaining images from which it is possible to identify both the system components and the leak and its scope. The data treatment methodologies applied in this document are a multi-agent-based system (MABS) methodology and the variance filter, a methodology based on second-order statistical parameters. Subsequently, after applying these processing methodologies to the images, the results are subjected to an analysis that eases the best choice, avoiding expert's subjectivity. Under this concept, this document proposes the joint use of two multi-criteria decision-making (MCDM) methods. The Fuzzy Analytical Hierarchy Process (FAHP) is used first to weight various evaluation criteria, in order to mitigate the uncertainty that characterize the experts' judgments, in conjunction with the ELECTRE III method, to obtain the final classification of alternatives in the most objective way. The results of this document are satisfactory, allowing to obtain extensive knowledge of leaks and their interaction with the subsoil, giving a guideline to subsequently develop automation methodologies that allow locating, monitoring and predicting problems in WSSs.Part of this work has been developed under the support of Fundación Carolina PhD and short-term scholarship programOcaña Levario, SJ. (2021). Análisis, caracterización y modelación 3D de fugas de agua en sistemas de abastecimiento de agua mediante imágenes de GPR [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/163677TESISCompendi

    Synergistic Visualization And Quantitative Analysis Of Volumetric Medical Images

    The medical diagnosis process starts with an interview with the patient, and continues with the physical exam. In practice, the medical professional may require additional screenings to precisely diagnose. Medical imaging is one of the most frequently used non-invasive screening methods to acquire insight of human body. Medical imaging is not only essential for accurate diagnosis, but also it can enable early prevention. Medical data visualization refers to projecting the medical data into a human understandable format at mediums such as 2D or head-mounted displays without causing any interpretation which may lead to clinical intervention. In contrast to the medical visualization, quantification refers to extracting the information in the medical scan to enable the clinicians to make fast and accurate decisions. Despite the extraordinary process both in medical visualization and quantitative radiology, efforts to improve these two complementary fields are often performed independently and synergistic combination is under-studied. Existing image-based software platforms mostly fail to be used in routine clinics due to lack of a unified strategy that guides clinicians both visually and quan- titatively. Hence, there is an urgent need for a bridge connecting the medical visualization and automatic quantification algorithms in the same software platform. In this thesis, we aim to fill this research gap by visualizing medical images interactively from anywhere, and performing a fast, accurate and fully-automatic quantification of the medical imaging data. To end this, we propose several innovative and novel methods. Specifically, we solve the following sub-problems of the ul- timate goal: (1) direct web-based out-of-core volume rendering, (2) robust, accurate, and efficient learning based algorithms to segment highly pathological medical data, (3) automatic landmark- ing for aiding diagnosis and surgical planning and (4) novel artificial intelligence algorithms to determine the sufficient and necessary data to derive large-scale problems

    Advanced Fault Diagnosis and Health Monitoring Techniques for Complex Engineering Systems

    Over the last few decades, the field of fault diagnostics and structural health management has been experiencing rapid developments. The reliability, availability, and safety of engineering systems can be significantly improved by implementing multifaceted strategies of in situ diagnostics and prognostics. With the development of intelligence algorithms, smart sensors, and advanced data collection and modeling techniques, this challenging research area has been receiving ever-increasing attention in both fundamental research and engineering applications. This has been strongly supported by the extensive applications ranging from aerospace, automotive, transport, manufacturing, and processing industries to defense and infrastructure industries

    Explainable Predictive Maintenance

    Explainable Artificial Intelligence (XAI) fills the role of a critical interface fostering interactions between sophisticated intelligent systems and diverse individuals, including data scientists, domain experts, end-users, and more. It aids in deciphering the intricate internal mechanisms of ``black box'' Machine Learning (ML), rendering the reasons behind their decisions more understandable. However, current research in XAI primarily focuses on two aspects; ways to facilitate user trust, or to debug and refine the ML model. The majority of it falls short of recognising the diverse types of explanations needed in broader contexts, as different users and varied application areas necessitate solutions tailored to their specific needs. One such domain is Predictive Maintenance (PdM), an exploding area of research under the Industry 4.0 \& 5.0 umbrella. This position paper highlights the gap between existing XAI methodologies and the specific requirements for explanations within industrial applications, particularly the Predictive Maintenance field. Despite explainability's crucial role, this subject remains a relatively under-explored area, making this paper a pioneering attempt to bring relevant challenges to the research community's attention. We provide an overview of predictive maintenance tasks and accentuate the need and varying purposes for corresponding explanations. We then list and describe XAI techniques commonly employed in the literature, discussing their suitability for PdM tasks. Finally, to make the ideas and claims more concrete, we demonstrate XAI applied in four specific industrial use cases: commercial vehicles, metro trains, steel plants, and wind farms, spotlighting areas requiring further research.Comment: 51 pages, 9 figure

    Regmentation: A New View of Image Segmentation and Registration

    Image segmentation and registration have been the two major areas of research in the medical imaging community for decades and still are. In the context of radiation oncology, segmentation and registration methods are widely used for target structure definition such as prostate or head and neck lymph node areas. In the past two years, 45% of all articles published in the most important medical imaging journals and conferences have presented either segmentation or registration methods. In the literature, both categories are treated rather separately even though they have much in common. Registration techniques are used to solve segmentation tasks (e.g. atlas based methods) and vice versa (e.g. segmentation of structures used in a landmark based registration). This article reviews the literature on image segmentation methods by introducing a novel taxonomy based on the amount of shape knowledge being incorporated in the segmentation process. Based on that, we argue that all global shape prior segmentation methods are identical to image registration methods and that such methods thus cannot be characterized as either image segmentation or registration methods. Therefore we propose a new class of methods that are able solve both segmentation and registration tasks. We call it regmentation. Quantified on a survey of the current state of the art medical imaging literature, it turns out that 25% of the methods are pure registration methods, 46% are pure segmentation methods and 29% are regmentation methods. The new view on image segmentation and registration provides a consistent taxonomy in this context and emphasizes the importance of regmentation in current medical image processing research and radiation oncology image-guided applications

    Investigation related to multispectral imaging systems

    A summary of technical progress made during a five year research program directed toward the development of operational information systems based on multispectral sensing and the use of these systems in earth-resource survey applications is presented. Efforts were undertaken during this program to: (1) improve the basic understanding of the many facets of multispectral remote sensing, (2) develop methods for improving the accuracy of information generated by remote sensing systems, (3) improve the efficiency of data processing and information extraction techniques to enhance the cost-effectiveness of remote sensing systems, (4) investigate additional problems having potential remote sensing solutions, and (5) apply the existing and developing technology for specific users and document and transfer that technology to the remote sensing community

    A complete online SVM and case base reasoning in pipe defect dection with multisensory inspection gauge

    An in-line inspection (ILI) robot has been considered an inevitable requirement to perform non-destructive testing methods efficiently and economically. The detection of flaws that could lead to leakages in buried concrete pipes has been a great concern to the oil and gas industry and water resource-based industry. The major problem is the difficulty in modeling the detection of cracks due to their irregularity and randomness that cannot be easily detected. Consequently, the use of an advanced modality system has emerged. Common defects detection systems favor non-destructive testing methods, which utilize specific sensory data. Only a few systems focus on fusing different types of sensory data. Moreover, the decision mechanism in this system required heavy-power consumption sensors with the configuration from the expertise domain. In addition, the outcome of the decision system is a consequence of rule-based settings rather than a mixture of learned features. This work covers the study of defect detection of non-destructive testing methods using fusion inspection sensors, light detection and ranging (LiDAR), and Optic sensors. The studies on ILI robots are reviewed to construct an efficient gauge. The prototype robot has been designed and successfully operated in a lab-scale environment. Ultimately, the study proposed a replacement for the standard expert system - in the branch of the CBR system, which is the crucial contribution of this thesis. Recent developments in Case-based Reasoning systems (CBR) have led to an interest in favoring machine learning (ML) approaches to replace traditional weighted distance methods. However, valuable information obtained through a training process was relinquished as transferring to other phases. As a result, the complete SVM-CBR system in this thesis concentrates on solving this gap by presenting an effective transferring mechanism from phase to phase. This thesis proposed a full pipeline integration of CBR using the kernel method designated with support vector machine. SVM technique is the primary classification engine for the combined sensory data. Since the system requires a learning SVM model to be invoked in every phase, the online learning mechanism is nominated to update the model when a new case adjoins effectively. The proposed full SVM-CBR integration has been successfully built into a pipe defect detection. The achieved result indicates a substantial improvement in transferring learning information accurately

    Smart Urban Water Networks

    This book presents the paper form of the Special Issue (SI) on Smart Urban Water Networks. The number and topics of the papers in the SI confirm the growing interest of operators and researchers for the new paradigm of smart networks, as part of the more general smart city. The SI showed that digital information and communication technology (ICT), with the implementation of smart meters and other digital devices, can significantly improve the modelling and the management of urban water networks, contributing to a radical transformation of the traditional paradigm of water utilities. The paper collection in this SI includes different crucial topics such as the reliability, resilience, and performance of water networks, innovative demand management, and the novel challenge of real-time control and operation, along with their implications for cyber-security. The SI collected fourteen papers that provide a wide perspective of solutions, trends, and challenges in the contest of smart urban water networks. Some solutions have already been implemented in pilot sites (i.e., for water network partitioning, cyber-security, and water demand disaggregation and forecasting), while further investigations are required for other methods, e.g., the data-driven approaches for real time control. In all cases, a new deal between academia, industry, and governments must be embraced to start the new era of smart urban water systems