443 research outputs found
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
The use of self-supervised pre-training has emerged as a promising approach
to enhance the performance of visual tasks such as image classification. In
this context, recent approaches have employed the Masked Image Modeling
paradigm, which pre-trains a backbone by reconstructing visual tokens
associated with randomly masked image patches. This masking approach, however,
introduces noise into the input data during pre-training, leading to
discrepancies that can impair performance during the fine-tuning phase.
Furthermore, input masking neglects the dependencies between corrupted patches,
increasing the inconsistencies observed in downstream fine-tuning tasks. To
overcome these issues, we propose a new self-supervised pre-training approach,
named Masked and Permuted Vision Transformer (MaPeT), that employs
autoregressive and permuted predictions to capture intra-patch dependencies. In
addition, MaPeT employs auxiliary positional information to reduce the
disparity between the pre-training and fine-tuning phases. In our experiments,
we employ a fair setting to ensure reliable and meaningful comparisons and
conduct investigations on multiple visual tokenizers, including our proposed
-CLIP which directly employs discretized CLIP features. Our results
demonstrate that MaPeT achieves competitive performance on ImageNet, compared
to baselines and competitors under the same model setting. Source code and
trained models are publicly available at: https://github.com/aimagelab/MaPeT
Pre-processing, classification and semantic querying of large-scale Earth observation spaceborne/airborne/terrestrial image databases: Process and product innovations.
By definition of Wikipedia, “big data is the term adopted for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The big data challenges typically include capture, curation, storage, search, sharing, transfer, analysis and visualization”.
Proposed by the intergovernmental Group on Earth Observations (GEO), the visionary goal of the Global Earth Observation System of Systems (GEOSS) implementation plan for years 2005-2015 is systematic transformation of multisource Earth Observation (EO) “big data” into timely, comprehensive and operational EO value-adding products and services, submitted to the GEO Quality Assurance Framework for Earth Observation (QA4EO) calibration/validation (Cal/Val) requirements. To date the GEOSS mission cannot be considered fulfilled by the remote sensing (RS) community. This is tantamount to saying that past and existing EO image understanding systems (EO-IUSs) have been outpaced by the rate of collection of EO sensory big data, whose quality and quantity are ever-increasing. This true-fact is supported by several observations. For example, no European Space Agency (ESA) EO Level 2 product has ever been systematically generated at the ground segment. By definition, an ESA EO Level 2 product comprises a single-date multi-spectral (MS) image radiometrically calibrated into surface reflectance (SURF) values corrected for geometric, atmospheric, adjacency and topographic effects, stacked with its data-derived scene classification map (SCM), whose thematic legend is general-purpose, user- and application-independent and includes quality layers, such as cloud and cloud-shadow. Since no GEOSS exists to date, present EO content-based image retrieval (CBIR) systems lack EO image understanding capabilities. Hence, no semantic CBIR (SCBIR) system exists to date either, where semantic querying is synonym of semantics-enabled knowledge/information discovery in multi-source big image databases.
In set theory, if set A is a strict superset of (or strictly includes) set B, then A B. This doctoral project moved from the working hypothesis that SCBIR computer vision (CV), where vision is synonym of scene-from-image reconstruction and understanding EO image understanding (EO-IU) in operating mode, synonym of GEOSS ESA EO Level 2 product human vision. Meaning that necessary not sufficient pre-condition for SCBIR is CV in operating mode, this working hypothesis has two corollaries. First, human visual perception, encompassing well-known visual illusions such as Mach bands illusion, acts as lower bound of CV within the multi-disciplinary domain of cognitive science, i.e., CV is conditioned to include a computational model of human vision. Second, a necessary not sufficient pre-condition for a yet-unfulfilled GEOSS development is systematic generation at the ground segment of ESA EO Level 2 product.
Starting from this working hypothesis the overarching goal of this doctoral project was to contribute in research and technical development (R&D) toward filling an analytic and pragmatic information gap from EO big sensory data to EO value-adding information products and services. This R&D objective was conceived to be twofold. First, to develop an original EO-IUS in operating mode, synonym of GEOSS, capable of systematic ESA EO Level 2 product generation from multi-source EO imagery. EO imaging sources vary in terms of: (i) platform, either spaceborne, airborne or terrestrial, (ii) imaging sensor, either: (a) optical, encompassing radiometrically calibrated or uncalibrated images, panchromatic or color images, either true- or false color red-green-blue (RGB), multi-spectral (MS), super-spectral (SS) or hyper-spectral (HS) images, featuring spatial resolution from low (> 1km) to very high (< 1m), or (b) synthetic aperture radar (SAR), specifically, bi-temporal RGB SAR imagery.
The second R&D objective was to design and develop a prototypical implementation of an integrated closed-loop EO-IU for semantic querying (EO-IU4SQ) system as a GEOSS proof-of-concept in support of SCBIR. The proposed closed-loop EO-IU4SQ system prototype consists of two subsystems for incremental learning. A primary (dominant, necessary not sufficient) hybrid (combined deductive/top-down/physical model-based and inductive/bottom-up/statistical model-based) feedback EO-IU subsystem in operating mode requires no human-machine interaction to automatically transform in linear time a single-date MS image into an ESA EO Level 2 product as initial condition. A secondary (dependent) hybrid feedback EO Semantic Querying (EO-SQ) subsystem is provided with a graphic user interface (GUI) to streamline human-machine interaction in support of spatiotemporal EO big data analytics and SCBIR operations. EO information products generated as output by the closed-loop EO-IU4SQ system monotonically increase their value-added with closed-loop iterations
Historical Document Digitization through Layout Analysis and Deep Content Classification
Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents.
This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia
Preliminary Assessment of HABIT for Children with Unilateral Cerebral Palsy Using Fidelity Measures
Purpose/ Hypothesis: The purpose of this study was to behaviorally code participants’ behaviors of a Hand Arm Bimanual Intensive Training (HABIT) camp. It was hypothesized the HABIT program would implement high levels of motor and social behaviors using behavioral coding as a measurement of fidelity.
Number of Subjects: Five children (Mean age=8.8 years, SD=1.6 years), three females, diagnosed with unilateral cerebral palsy (CP), right-side impairment. Participants were classified as Manual Ability Classification System (MACS) levels I-III.
Materials and Methods: The HABIT camp took place over a two-week period, ten days of intervention, four hours daily for a total of 40 hours. Oversight of daily intervention was directed by two therapists assisted by seven volunteers trained on HABIT key principles. A fidelity measurement was implemented to establish if participant behaviors were congruent with the intervention principles of HABIT through behavioral coding. Video footage was collected at random intervals throughout the intervention to measure the following behaviors’ duration: right/left contact, right/left object manipulation, tasks [i.e., therapist-provided activities that either do (complex tasks) or do not (simple tasks) cognitively challenge the subject], social engagement with peers, and focused attention (i.e., when subject focuses on an object while object exploration occurs). This preliminary report contains three random videos per participant, averaging approximately 30 minutes per video (total of 6.75 hours). Datavyu software was used to code behaviors [interrater reliability =85.4% 8.6]. The variables were summed as durations and normalized as percentages.
Results: On average, the percentage of the duration of contacts was relatively equal between left (M= 63.8, SD=11.7) and right (M=46.1, SD=12.5) hands. The percent of the duration of object manipulation varied between the left (M=20.0, SD=10.9) and right (M=5.7, SD=5.8) hands. Children were engaged in simple tasks (e.g., playing with play dough) (M=34.9, SD=12.4) more often than complex tasks (e.g., target game) (M=20.7, SD=12.8), but varied by participant. Children were socially engaged with their peers (M=51.0, SD=12.9), alongside focused on an object while exploring it (M=34.8, SD=13.3).
Conclusions: Both hands performed a similar duration of contact. Manipulations differed greatly between hands, favoring the unaffected left hand. It may be due to MACS classification systems and the use of their affected hand primarily for support. Simple tasks were performed more often than complex tasks, and social engagement with peers occurred most of the time. Clinical Relevance: This preliminary report of the 2022 HABIT camp suggests the intervention accomplishes its established high-intensity and engagement principles of intervention but may be limited to meeting challenging task goals. This study adds to existing research testing HABIT’s methodological approach to physical therapy intervention and to fidelity use in clinical settings relating to HABIT programming.https://digitalcommons.unmc.edu/surp2023/1008/thumbnail.jp
Using Landmarks for Explaining Entity Matching Models
The state of the art approaches for performing Entity Matching (EM) rely on machine & deep learning models for inferring pairs of matching / non-matching entities. Although the experimental evaluations demonstrate that these approaches are effective, their adoption in real scenarios is limited by the fact that they are difficult to interpret. Explainable AI systems have been recently proposed for complementing deep learning approaches. Their application to the scenario offered by EM is still new and requires to address the specificity of this task, characterized by particular dataset schemas, describing a pair of entities, and imbalanced classes.
This paper introduces Landmark Explanation, a generic and extensible framework that extends the capabilities of a post-hoc perturbation-based explainer over the EM scenario. Landmark Explanation generates perturbations that take advantage of the particular schemas of the EM datasets, thus generating explanations more accurate and more interesting for the users than the ones generated by competing approaches
Novelty Detection with Autoencoders for System Health Monitoring in Industrial Environments
Predictive Maintenance (PdM) is the newest strategy for maintenance management in
industrial contexts. It aims to predict the occurrence of a failure to minimize unexpected downtimes
and maximize the useful life of components. In data-driven approaches, PdM makes use of Machine
Learning (ML) algorithms to extract relevant features from signals, identify and classify possible
faults (diagnostics), and predict the components’ remaining useful life (prognostics). The major
challenge lies in the high complexity of industrial plants, where both operational conditions change
over time and a large number of unknown modes occur. A solution to this problem is offered by
novelty detection, where a representation of the machinery normal operating state is learned and
compared with online measurements to identify new operating conditions. In this paper, a systematic
study of autoencoder-based methods for novelty detection is conducted. We introduce an architecture
template, which includes a classification layer to detect and separate the operative conditions, and
a localizer for identifying the most influencing signals. Four implementations, with different deep
learning models, are described and used to evaluate the approach on data collected from a test rig.
The evaluation shows the effectiveness of the architecture and that the autoencoders outperform the
current baselines
Visually Evoked Postural Responses (VEPRs) in Children with Vestibular Migraine
Vestibular migraine (VM) is the most common cause of episodic vertigo in children. Vertigo, nausea, dizziness and unsteadiness are often complained of by children with migraine, which can precede, follow or be present simultaneously with headache. The aim of this study was to use posturography to investigate the visually evoked postural responses (VEPRs) of children with VM and compare them to data obtained from children with primary headache (M) and controls (C). Twenty children diagnosed as affected by VM, nineteen children with M without aura and twenty healthy subjects were recruited in this cross-sectional study. Posturography was performed by a standardized stabilometric force-platform (Svep-Politecnica) in the following conditions: open eyes (OE), closed eyes (CE) and during full-field horizontal optokinetic stimulation (OKN-S). Electronystagmography was performed simultaneously to analyze optokinetic reflex parameters. In the OE condition, no difference was found between groups with respect to body sway area. In contrast, this parameter increased in the two pathological groups with respect to controls in the CE condition. The optokinetic stimulations also induced a similar increase of body sway area in the M group relative to controls, but a further increase was elicited in the VM group. Electronystagmographic recording also revealed different optokinetic reflex parameters in the latter groups. This study disclosed an abnormal sensitivity of children with M and VM to full-field moving scenes and a consequent destabilization of posture, as documented by the abnormal VEPRs. Children with VM were particularly exposed to this risk. Possible clinical implications of these findings are discusse
Segment-based simple-connectivity measure design and implementation
In developing different measures for the description of a segment’s shape, we noted that it would be useful to include a measure capable of quantifying the presence of holes. This was motivated by the following scenario. The measures we use to characterize a segment’s shape, such as RoundnessAndNoHole (also known as compactness), ConvexityAndNoHole and RectangularityAndNoHole are monotonically decreasing with the presence of holes, namely:
• RoundnessAndNoHole is high if Roundness is high and condition NoHole is true,
• ConvexityAndNoHole is high if Convexity is high and condition NoHole is true and, finally,
• RectangularityAndNoHole is high if Rectangularity is high and condition NoHole is true.
For example, a region with a perfectly round external boundary, but containing several holes, will present a low RoundnessAndNoHole measure. Were the holes not present in the region, it would instead feature a very high RoundnessAndNoHole measure. Besides these measures, our newly introduced version of a measure of elongatedness is also affected by the presence of holes, increasing as the number of holes increases.
In our study of satellite images, it is very common to find segments that contain holes, whether due to the underlying holes in the original observed structure or whether due to segmentation errors. In order to reason about these types of situations without having to change the definitions of the shape measures already in use (which are quite natural and intuitive), we introduce a new measure to quantify the presence of holes, which we call simple-connectivity. The simple-connectivity measure quantifies the extent to which a region is simply-connected, i.e., the measure should be monotonically decreasing with holes whose cardinality increases or whose size increases (at fixed cardinality).This work was supported in part by the National Aeronautics and Space Administration under Grant/Contract/Agreement No. NNX07AV19G issued through the Earth Science Division of the Science Mission Directorate
- …