6,762 research outputs found
COMET: A Recipe for Learning and Using Large Ensembles on Massive Data
COMET is a single-pass MapReduce algorithm for learning on large-scale data.
It builds multiple random forest ensembles on distributed blocks of data and
merges them into a mega-ensemble. This approach is appropriate when learning
from massive-scale data that is too large to fit on a single machine. To get
the best accuracy, IVoting should be used instead of bagging to generate the
training subset for each decision tree in the random forest. Experiments with
two large datasets (5GB and 50GB compressed) show that COMET compares favorably
(in both accuracy and training time) to learning on a subsample of data using a
serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble
evaluation which dynamically decides how many ensemble members to evaluate per
data point; this can reduce evaluation cost by 100X or more
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community
In recent years, deep learning (DL), a re-branding of neural networks (NNs),
has risen to the top in numerous areas, namely computer vision (CV), speech
recognition, natural language processing, etc. Whereas remote sensing (RS)
possesses a number of unique challenges, primarily related to sensors and
applications, inevitably RS draws from many of the same theories as CV; e.g.,
statistics, fusion, and machine learning, to name a few. This means that the RS
community should be aware of, if not at the leading edge of, of advancements
like DL. Herein, we provide the most comprehensive survey of state-of-the-art
RS DL research. We also review recent new developments in the DL field that can
be used in DL for RS. Namely, we focus on theories, tools and challenges for
the RS community. Specifically, we focus on unsolved challenges and
opportunities as it relates to (i) inadequate data sets, (ii)
human-understandable solutions for modelling physical phenomena, (iii) Big
Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and
learning algorithms for spectral, spatial and temporal data, (vi) transfer
learning, (vii) an improved theoretical understanding of DL systems, (viii)
high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote
Sensin
Inter-annual stability of land cover classification: explorations and improvements
Land cover information is a key input to many earth system models, and thus accurate and consistent land cover maps are critically important to global change science. However, existing global land cover products show unrealistically high levels of year-to-year change. This thesis explores methods to improve accuracies for global land cover classifications, with a focus on reducing spurious year-to-year variation in results derived from MODIS data. In the first part of this thesis I use clustering to identify spectrally distinct sub-groupings within defined land cover classes, and assess the spectral separability of the resulting sub-classes. Many of the sub-classes are difficult to separate due to a high degree of overlap in spectral space.
In the second part of this thesis, I examine two methods to reduce year-to-year variation in classification labels. First, I evaluate a technique to construct training data for a per-pixel supervised classification algorithm by combining multiple years of spectral measurements. The resulting classifier achieves higher accuracy and lower levels of year-to-year change than a reference classifier trained using a single year of data. Second, I use a spatio-temporal Markov Random Field (MRF) model to post-process the predictions of a per-pixel classifier. The MRF framework reduces spurious label change to a level comparable to that achieved by a post-hoc heuristic stabilization technique. The timing of label change in the MRF processed maps better matched disturbance events in a reference data, whereas the heuristic stabilization results in label changes that lag several years behind disturbance events
Incremental inference on higher-order probabilistic graphical models applied to constraint satisfaction problems
Thesis (PhD)--Stellenbosch University, 2022.ENGLISH ABSTRACT: Probabilistic graphical models (PGMs) are used extensively in the probabilistic
reasoning domain. They are powerful tools for solving systems of complex relationships over a variety of probability distributions, such as medical and fault diagnosis, predictive modelling, object recognition, localisation and mapping, speech recognition, and language processing [5, 6, 7, 8, 9, 10, 11]. Furthermore, constraint
satisfaction problems (CSPs) can be formulated as PGMs and solved with PGM inference techniques. However, the prevalent literature on PGMs shows that suboptimal PGM structures are primarily used in practice and a suboptimal formulation
for constraint satisfaction PGMs.
This dissertation aimed to improve the PGM literature through accessible algorithms and tools for improved PGM structures and inference procedures, specifically focusing on constraint satisfaction. To this end, this dissertation presents three
published contributions to the current literature:
a comparative study to compare cluster graph topologies to the prevalent factor graphs [1],
an application of cluster graphs in land cover classification in the field of cartography [2], and
a comprehensive integration of various aspects required to formulate CSPs as
PGMs and an algorithm to solve this formulation for problems too complex
for traditional PGM tools [3].
First, we present a means of formulating and solving graph colouring problems with probabilistic graphical models. In contrast to the prevailing literature
that mostly uses factor graph configurations, we approach it from a cluster graph perspective, using the general-purpose cluster graph construction algorithm, LTRIP.
Our experiments indicate a significant advantage for preferring cluster graphs over factor graphs, both in terms of accuracy as well as computational efficiency.
Secondly, we use these tools to solve a practical problem: land cover classification. This process is complex due to measuring errors, inefficient algorithms, and
low-quality data. We proposed a PGM approach to boost geospatial classifications
from different sources and consider the effects of spatial distribution and inter-class dependencies (similarly to graph colouring). Our PGM tools were shown to be
robust and were able to produce a diverse, feasible, and spatially-consistent land cover classification even in areas of incomplete and conflicting evidence.
Lastly, in our third publication, we investigated and improved the PGM structures used for constraint satisfaction. It is known that tree-structured PGMs always result in an exact solution [12, p355], but is usually impractical for interesting
problems due to exponential blow-up. We, therefore, developed the āpurge-and mergeā algorithm to incrementally approximate a tree-structured PGM. This algorithm iteratively nudges a malleable graph structure towards a tree structure by selectively merging factors. The merging process is designed to avoid exponential
blow-up through sparse data structures from which redundancy is purged as the algorithm progresses. This algorithm is tested on constraint satisfaction puzzles such
as Sudoku, Fill-a-pix, and Kakuro and manages to outperform other PGM-based
approaches reported in the literature [13, 14, 15]. Overall, the research reported in
this dissertation contributed to developing a more optimised approach for higher order probabilistic graphical models. Further studies should concentrate on applying purge-and-merge on problems closer to probabilistic reasoning than constraint
satisfaction and report its effectiveness in that domain.AFRIKAANSE OPSOMMING: Grafiese waarskynlikheidsmodelle (PGM) word wyd gebruik vir komplekse
waarskynlikheidsprobleme. Dit is kragtige gereedskap om sisteme van komplekse
verhoudings oor ān versameling waarskynlikheidsverspreidings op te los, soos die
mediese en foutdiagnoses, voorspellingsmodelle, objekherkenning, lokalisering en
kartering, spraakherkenning en taalprosessering [5, 6, 7, 8, 9, 10, 11]. Voorts kan
beperkingvoldoeningsprobleme (CSP) as PGMās geformuleer word en met PGM
gevolgtrekkingtegnieke opgelos word. Die heersende literatuur oor PGMās toon
egter dat sub-optimale PGM-strukture hoofsaaklik in die praktyk gebruik word en
ān sub-optimale PGM-formulering vir CSPās.
Die doel met die verhandeling is om die PGM-literatuur deur toeganklike algoritmes en gereedskap vir verbeterde PGM-strukture en gevolgtrekking-prosedures
te verbeter deur op CSP toepassings te fokus. Na aanleiding hiervan voeg die verhandeling drie gepubliseerde bydraes by die huidige literatuur:
ān vergelykende studie om bundelgrafieke tot die heersende faktorgrafieke te
vergelyk [1],
ān praktiese toepassing vir die gebruik van bundelgrafieke in āland-coverā-
klassifikasie in die kartografieveld [2] en
ān omvattende integrasie van verskeie aspekte om CSPās as PGMās te formuleer en ān algoritme vir die formulering van probleme te kompleks vir tradisionele PGM-gereedskap [3]
Eerstens bied ons ān wyse van formulering en die oplos van grafiekkleurprobleme met PGMās. In teenstelling met die huidige literatuur wat meestal faktorgrafieke gebruik, benader ons dit van ān bundelgrafiek-perspektief deur die gebruik
van die automatiese bundelgrafiekkonstruksie-algoritme, LTRIP. Ons eksperimente
toon ān beduidende voorkeur vir bundelgrafieke teenoor faktorgrafieke, wat akku raatheid asook berekende doeltreffendheid betref.
Tweedens gebruik ons die gereedskap om ān praktiese probleem op te los: ālandcoverā-klassifikasie. Die proses is kompleks weens metingsfoute, ondoeltreffende
algoritmes en lae-gehalte data. Ons stel ān PGM-benadering voor om die georuimtelike klassifikasies van verskillende bronne te versterk, asook die uitwerking van ruimtelike verspreiding en interklas-afhanklikhede (soortgelyk aan grafiekkleurprobleme). Ons PGM-gereedskap is robuus en kon ān diverse, uitvoerbare en
ruimtelik-konsekwente āland-coverā-klassifikasie selfs in gebiede van onvoltooide
en konflikterende inligting bewys.
Ten slotte het ons in ons derde publikasie die PGM-strukture vir CSPās ondersoek en verbeter. Dit is bekend dat boomstrukture altyd tot ān eksakte oplossing
lei [12, p355], maar is weens eksponensiƫle uitbreiding gewoonlik onprakties vir interessante probleme. Ons het gevolglik die algoritme, purge-and-merge, ontwikkel
om inkrementeel ān boomstruktuur na te doen.
Die algoritme hervorm ān bundelgrafiek stapsgewys in ān boomstruktuur deur
faktore selektief te āmergeā. Die saamsmeltproses is ontwerp om eksponensiĆ«le
uitbreiding te vermy deur van yl datastrukture gebruik te maak waarvan die waarskeinlikheidsruimte ge-āpurgeā word namate die algoritme vorder. Die algoritme
is getoets op CSP-speletjies soos Sudoku, Fill-a-pix en Kakuro en oortref ander
PGM-gegronde benaderings waaroor in die literatuur verslag gedoen word [13,
14, 15]. In die geheel gesien, het die navorsing bygedra tot die ontwikkeling van
ān meer geoptimaliseerde benadering vir hoĆ«r-orde PGMās. Verdere studies behoort te fokus op die toepassing van purge-and-merge op probleme nader aan
waarskynlikheidsredenasie-probleme as aan CSPās en moet sy effektiwiteit in daar die domein rapporteer.Doctora
- ā¦