We are validating the global cloud parameters derived from the satellite-borne HIRS2 and MSU atmospheric sounding instrument measurements, and are using the analysis of these data as one prototype for studying large geophysical data sets in general. The HIRS2/MSU data set contains a total of 40 physical parameters, filling 25 MB/day; raw HIRS2/MSU data are available for a period exceeding 10 years. Validation involves developing a quantitative sense for the physical meaning of the derived parameters over the range of environmental conditions sampled. This is accomplished by comparing the spatial and temporal distributions of the derived quantities with similar measurements made using other techniques, and with model results. The data handling needed for this work is possible only with the help of a suite of interactive graphical and numerical analysis tools. Level 3 (gridded) data is the common form in which large data sets of this type are distributed for scientific analysis. We find that Level 3 data is inadequate for the data comparisons required for validation. Level 2 data (individual measurements in geophysical units) is needed. A sampling problem arises when individual measurements, which are not uniformly distributed in space or time, are used for the comparisons. Standard 'interpolation' methods involve fitting the measurements for each data set to surfaces, which are then compared. We are experimenting with formal criteria for selecting geographical regions, based upon the spatial frequency and variability of measurements, that allow us to quantify the uncertainty due to sampling. As part of this project, we are also dealing with ways to keep track of constraints placed on the output by assumptions made in the computer code. The need to work with Level 2 data introduces a number of other data handling issues, such as accessing data files across machine types, meeting large data storage requirements, accessing other validated data sets, processing speed and throughput for interactive graphical work, and problems relating to graphical interfaces

Granger-Gallegos, Stephanie

Haskins, Robert D.

Kahn, Ralph

Knighton, James E.

Pursch, Andrew

English

NASA Technical Reports Server

= .
Large Data SetValidation I33
N94-
Validating a Large Geophysical Data Set: /'-2/3/
......Experiences with Satelhte-Denved Cloud Parameters : #
Abstract
We arcvalidatingtheglobalcloudparame.tersderivedfromthe
satellite-borneKIRS2 and MSU atmosphericsounding
instrumentmeasurements,and arcusingtheanalysis ofthese
dam=_ One prototypeforstudyinglargegeophysicaldatasets
ingeneral.The HIRS2/IvISUdatasetcontainsa totalof40
physicalparameters,falling25 MB/day;raw HIRS2/MSU dam
areavailablefora periodexceeding10 years.Validation
involvesdevelopinga quantitativesenseforthe physical
meaning of the derivedparameters over the range of
environmentalconditionsampled.Thisisaccomplishedby
co -p  gthe' and dis butio.softhederived
quantitieswith Sh-nilarmeasurements made using other
techniques,and withmodelresults.
The datahandlingneededforthiswork _ p0s_bleonlywith
the helpof a suiteof interactivegraphicaland numerical
analysistools,l.avel3 (gridded)dataisthecommon form in
which largedatasetsofthistypearedistributedforsci_c
analysis. We find that Lavel 3 data is inadP.zluatefor the data
comparisons required for validation. Lavel 2 data (individ_
measurementsingeophysicalunits)isneeded. A sampling
probiem_ when individ_ measurements,which am not
uniformlydistributed in space or time,are used for the
comparisons.Smnd_d 'interpolation'methodsinvolvefitting
the measurementsforeachdam settosurfaces, whicharcthen
compared. We areexperimentingwith formalcriteriafor
selecting eographicalregions,based upon the spatial
fi'cquencyand variabilityof measurements,thatallowus to
quantifytheuncertaintydue to sampling.As partof this
project,we are alsodealingwith ways to keep trackof
constraintsplacedon theoutputby assumptionsmade inthe
computercode.The needtowork withLavel2 dam introduces
anumber ofotherdatahandlingissues,suchasaccessingdata
filesa_oss machine types, meeting largedata storage
requirements,accessingothervalidatedatasets,procassing
speed and throughputfor interactivegraphicalWork, and
problemsrelating to graphical interfaces.
KEY WORDS: largedatasets,validation,satellite
dataanalysis
Ralph Kahn, Robert D. Haskins, James E. Knighton,
Andrew Pursch, and Stephanie Granger-Gallegos
Jet Propulsion Laboratory, California _[nstitutc of Technology, Pasadena, CA 91109
1. Introduction
NASA's EarthObservingSystemCEOS) willgeneratevast
quantitiesof data.Hundreds of terabytesof datawillbe
acquiredfrom orbitocharacterizetheEarth'senvironment
withthekindofspatialand temporaldetailneededtostudy
climatechange.Suchhighresolutionisrequiredtoproperly
sample thenon-linearimpactof small-scalephenomena,
whichcanmake significantcontributionstotheglobal-scale
budgetsofheatand momentum. Itisalsoexpectedthat he
datawillbc analyzednotjustin thetraditionalmanner,
concentratingon a singledatasetatatime,butinnew ways
thatinvolveroutinelycomparing datasetsfrom multiple
sources.Partoftheneedtostudymultipledatasetscomes
from a growingappreciationfortheimportancetoglobal
conditionsoftransportsacrossboundariesuchas theair-
ocean intcffa_f.e(e.g.,Earth System ScienceCommiuee,
1988).
We are und_m,'tk_ngthevalidationfcloudparametersd_vcd
from the High ResolutionInfraredRadiationSounder 2
(I-URS2) and the Microwave Sounding Unit (MSLD
insu-umentsaboardtheNOA.A polarorbitingmeteorological
satellites. The instrumentsprovide one of thefew global
measuresof cloudpropertiesextendingovermany yeats.
They are also capableof obtainingnear-simultaneous
constraintson thephysicalcharacteristicsoftheatmosphere
and surfaceneededtoderivecloudproperties.One goalof
thiswork istolearnaboutanalyzinglargegeophysicaldata
seu ingeneral.
RadiancesfromtheHIRS2 and MSU insmm_entshavebeen
analyzedby Susskindand co-workersusingan algorithm
thataccountsself-consistentiyfor thefirst-orderphysical
quantities affecting the emergent radiation (Susskind et al.,
198,4; 1987). The standard data products are (1) monthly
mean values for forty meteorological parameters, including
effective cloud amount and effective cloud top height, on a
grid of boxes 2 degrees in latitude by 2.5 degrees in
longitude, and (2) 'daily data' with twice-daily temporal
sampling, a spatial resolution of about 125 kin, and spacing
between points of about 250 kin. The monthly mean data
arerdcrredtoasa 'Level3'(gridded)product,and thedaily
-- PMM_iDtN_ PAGE BLANK NOT FILMED
https://ntrs.nasa.gov/search.jsp?R=19940019142 2020-06-16T15:26:24+00:00Z
i'34 R. Kahn etat.
dataiscalleda 'Level2'product(individualmeasurements
reducedtOgeophysicalunits)(SpaceScienceBoard,1982;
EOS Data Panel,1986). The sizeof theuncompressed
Level3 dataisabout4 MB/month, whereas the LeveY2
productfillsabout25 N_Iday ('750M_Imonth).
By validationwe mean 'developinga quantitativesensefor
thephysicalmeaning of themeasuredparameters,'forthe
rangeofconditionsunder which theyarcacquired.Our
approachinvolves:(I)identifyingtheassurnpdonsmade in
derivingparametersfromthem_d radiances,('2)testing
theinputdataand derivedparametersforstatisticalerror,
sensitivity,andinmmal consismncy,and (3)comparingwith
similarparametersobtainedfromothersouzcesusingother
techniques.A studyof thistypewas performedforsea
surfacemmpcrature(Njoku,1985"),and ourprojectisone of
severalparalleleffortscurrentlyunderw_.yto validat_
differentcloudclimatologies(e.g.,Rossow etal.,1985;
1990).The validationeffortwe areun_rtaldngintroducesa
number ofproblemsthatmay be ofinteresttospecialistsin
computational statistics, such as the INTERFACE
community,aswellastothoseinvolvedinresearchdirecdy
relatedto interpretinglargegeophysicaldatasets.This
articlesummarizesthekey dam handlingissueswe have
er,countm-ed.
2. The Need for 'Level 2' Data
I..m-geophysicaldatasets,suchascloudclimatologies,are
oftendistributedtoresearchersingridded(L_vel3) form.
This can reducethedatavolume by ordersof magnitude
relativetothe parametervaluesforeachindividualsounding
(Level2),and providestheuserwitha 'spatiallyuniform'
data product. For example, Figure IA is the global,
monthly-meancloudamount map forJuly1979 from the
HIRS2/MSU data,intheoriginal2 degreeby 2.5degree
averagingbins.Allacceptedcloudamount datafrom the
individualatmosphericsoundingsthatfellwithineach
geographicbox weresummed, andmean and variancevalues
foreachboxw_ calculate._
Severalproblemsoccur when usingLevel 3 produc_ for
validation. First,ifonlytheLevel3 parametervaluesand
associatedvariancesarcavailable,thereisno way toassess
how much ofthereportedvarianceisdue toinherentnon-
uniformityof the parameterover the averagingregion.
Essentially, the insu'ument resolution is degraded to a scale
comparable to the box size, and information originally
acquiredtomeasuresmaller-scalephenomena inboththe
spatialand temporaldomains islost.For example,ina 2
by 2.5 degreebox, the surfacetemperaturemay exhibit
random fluctuationsof halfa degreeand may change
systematicallyby severaldegrees,whereasthebox average
variancewillassign all thevariabilitytorandomerror.
We encountered a second problem when making w
comparisonsamong Level3productswithdiffexentgridding
schemes. The bestconcurrentcloudclimatologyavailable
for comparison with the dam in Figure IA was derived from
the Temperature Humidity Infrared Radiome_r/Tor=J Ozone
Mapping Spectrometer (THiR/TOMS) on the NASA
Nimbus 7 satellite(Stoweetal.,1988;1989).The standard
THIR_ONIS Level3 dam productwas blurtedaccordingtoa
global500 by 500 km gridthatisalsoused forEarth m
radiationbudgetstudies.The July1979HIRS2/MSU Level
3 data,degradedusing area-weightedaveragingto the
THIR/rOMS spati_Jgrid,isshown inFigureIB. We _ U
resmnpledthedegradedHIRS2/MSU databacktothe2 by
2.5 degree grid,and subtractedit from the original
HIRSZ_SU data(HgureIce.Notethatthedifferencesare _-
nearlyaslargeastherangeofthesig1_I,withbothpositive m
and negativevalues.The pattornofdifferencesvarieswith
thelocationofedgesintheoriginaldata,and ismodulated
by therelativeposition"ofgridboundaries.DEferencesare W
especiallylarge at high latitudes,where the spatial
resolutionoftheTHIR_OMS grid ks much lower than u'mt
oftheHIRSZ/MSU grid,and wherevertherearesharpedges
generatedby cloud patterns,such as in the inter_opical
convcrgenc_zoneand monsoon
With theLevel 2 products,we have accessto physical
quantitiesatthefullresolutionacqu/redby theinsu'uments, m
andavoidintroducingadditionalr_actsintothecomparison
betweendam sets.Level2 dataarenotuniformlydism'buted
overthesurface. At low latitudestherearegoresinthe
I=IIRS2samplingbetweenorbits,whereasathighlatitudes,
thesurfaceisheavilyoversampled. Data dropoutsand
calibration linesoccur at alllatitudes.The sampl_
_oiu-dd-hc-I'L_ge_l_y more thanafac-to__Sf:_from nadir _ i
the limitsof each scan. As a firststeptowardmaking
comparisonsamong Level2 datasets, surfacesthattake
accoum Ofn0n-um_'orrnclusteringofdatapointsmay be fit
to the dam. We have begun experimentingwith locally
adaptivesurfacefittingtechniques(e.g.,Renka,1988),and
arc exploringtheuse of methods thatgeneratevariance
surfacestogetherwitheachfittedsurface(Cresse,1989,and W
referencesthc_in).
gln_g,w_i_c_ _ _didonaHy __ to-make Eomp_ns
among globaldatasets,is performed as an automatic
procedure.In usingLevel2 dataforvalidatingdatasets,
geographicsub-re_ons of the globemust be selectedfor
surfacefining,basedupon some criterionthatevaluatesthe
densityofpo_mtsrelativetothesizeoflocalgradientsofthe
parameterfield,possiblyinseveraldirections.Figure2
illustratesheroleof_n_cdve geographicsubset_Iccti_,
a pan of thesoftwarewe areassemblingtoperformtlfe_ i
HIRS2/'MSU validation.'HDF' in thisfigurerefersto
HierarchicalData Format,a transportablefileformatthat
eliminatesallbutan initialfileconversionforexchanging
dataamong DEC, Sun,Macintosh,andothermachine.sused
in the validation0NCSA SoftwareTools Group, 1990).
J
l
Large Data Set Validation I35
L :
L
v
This allows us to store single copies of data files on
centrally located disks, that are ace=ssible across the network
to machines with differing architectures. We arc currenOy
investigating the criteria for accepting subsets, choice of
method for surface fitting, and methods for making formal
comparisons among surfaces fitted to data from different
sources. The important question of interpolation in the
temporal domain we set aside for the present.
To summarize:inspiteof themuch largervolume of the
I.,cvel2 data,relativetoLevel3,and theconccfionofissues
relatedtothespatialand temporalsamplingofLevel2 .data,
we ne.e.dthe ability to access, store, and process Level 2 data
for (1) studies of the internal consistency and precision of
the data set and (2) comparisons with other cloud
climatoIogies, that are involved in the validation of the
HIRSZ/MSU cloud parameters. We anticipate that similar
needs will arise for interdisciplinary process studies, and in
work directed toward using observations to better understand
mesoscale climatological phenomena.
3. Tracking Assumptions in the Code
Another issue that bears upon the degree to which we may
perform validation, and other scientific analysis on large
data sets, is our ability to grasp the collection of constraints
imposed on parameter values by the code that generates
them. An assumption embedded in a large data handling
code may produce results that hide important information in
the data, or may produce patterns in the data that could be
incorrectly interpreted as scientifically meaningful.
We are experimenting with methods of charting the
collection of assumptions, as a way of calling the attention
of the user to areas where the code may influence the output
parameters. We are using standard charting symbols as
much as possible (e.g., Yourdon and Constantine, 1979).
An example of this type of chart is Figure 3. This shows
the flow of control and the flow of assumptions made in a
relatively small part of the I-IIRS2/MSU analysis code that
produces Level 3 data from Level 2 products. This chart
made clear the number and complexity of the assumptions
involved in generating Level 3 products, and it played a role
in our assessment of the value of Level 3 data for the
validation exercise.
Chartingtheflowofcontrolprovidesaneededcontextfor
theconswaintsplacedon thedata.These chartstakea step
in the directionof making itpossibleto keep trackof
assumptions,buttheydo noteliminatethework involvedin
carefullyassessingthemeaningofde.rivedparameters.
4. Conclusions
The/-IIRS2/MSU cloud parameter validation effort raises a
number of data handling issues that are likely to arise
frequently when scientific analysis is attempted on large
geophysicaldatasets.We need Level 2 data(individual
measurements in geophysicalunits)(A) to perform
comparisonsamong datasetswithdifferentsampling,and
(B) to understandthe effectsof spatialand temporal
samplingon the'average'valuesobtainedfromasingledata
set.The need forLevel2 dataseverelycomplicatesdam
handling.Among theareaswhere advanceswould be most
helpfulam:
1. Surface fitting software for data distributed non-uniJ'ormly
in 2-dimensional space, and ways to obtain some measure of
the associated variances.
2.Softwareformaking formalcomparisonsamong fitted
surfacesfromseveralsources,and theirassociatedvariance
sl_ac_.
3.Ways ofdocumentingsoftwareand datafilesotheymay
beexchangedand usedbyotherseasily.
4. Ways of documenting the assumptions embedded in
retrieval and processing algorithms, so a researcher studying
thedataproductscan graspthe collectionof constraints
placedon theoutputdataby thecode.
5. Additional ways of storing data. For a given Level 2
data product, we need readily accessible data storage capacity
of between one and two orders of magnitude the size of the
basic data set, for intermediate and derived products that arc
createdaspartofthevalidation.
Several longcr-ltm'n needs include:
6. The development of validation procedures that are easy
enough to apply so that it will be feasible to generate and
access a large number of validated geophysical data sets for
interdisciplinary studies of all types.
7.Ways offittingsurfacestodatavaluesdistributednon-
uniformlyin2-dimensionalspaceand intime,andobtaining
a measureoftheassociatedvariances.
8.Betterways ofdiscoveringpatternsand surprisesinhigh-
dimensionaldatasets.
9.Ways offittinghypcr-surfacestohigherdimensionaldata
sets,and techniquesforstudyingthem.
We have described our data, the collection of problems we
are facing in the validation work, and our approaches to
some of these issues. Solutions or partial solutions may
exist to some of the problems that arc not widely known
outside specialized data handling and computational statistics
communities. We hope to stimulate experts in these fields
to participate in the effort to improve our understanding of
Earth through the study of large, geophysical data sets.
1176 R. Kahn et aL
WlW
Acknowledgments
we thank Paul Tukey for inviting us to pan/cipate in the
INTERFACE 91 conference, and Daniel Can', Jeff Dozier,
Mike Freilich,Wes Nicholson, BillRossow, Victor
ZIomicki,andRichardZumk forstimulatingdiscussionson
many aspects ofthiswork. This projectissupported in part
by theNASA EarthSciencesInterdisciplinaryP ogramin
theEarthScienceand ApplicationsDivision,and by theJet
Propulsion Laboratory Direc_r's Discretionary Fund. The
work was performedat the Jet PropulsionLaboratory,
CaliforniaInstituteofTechnology,un_r contract withthe
Nation,a/AeronauticsandSpaceAdministration.
I)
References
Cressie, N. 0989), Geostafistics, The amer. Statistician,
43, 197-202.
Earth System Science Committee (1988), "Earth System
Science: A Closer View', Repon of the Earth System
ScienceCommittee, NASA Advisory Council,NASA,
Washington.D.C. _
Stowe,L.I.,.,Wellemeyer,C.G., Eck, T.F.,Yeh, H.Y.M.,
and theNIMBUS 7 Cloud Data ProcessingTeam (1988),
NIMBUS 7globalcloudclimatology.PartI:Algorithmsand
validation, J. Climate, 1, #45-470.
Stowe, L.L., Ych, _LY.M., Eck, T.F., Wellemeyer, C.G.,
ILL. Kyle,and theNIMBUS 7 Cloud DataProcessingTeam
(1979),NIMBUS 7 globalcloudclimatology.PartIf:Fast
year results, I. Climate, 2, 671-709.
Susskdnd, J._,Rosenfield, L, Reumr, D., Chahme, bLT.
(1984), Remote sensing of wca_cr and cUmate parameters
from HIRS2/MSU on TIROS-N, J. Geophys.RcJ., 89,
4677-4697. .....
Susskind,J., Reuter,D.,Chahine, M.T. (1987),Cloud fields
retrievedfrom analysisof HIR$2/MSU soundingdata,J.
aeophys.Res..92, 4o35-4oso.
Yourdon, E., and Constantine, E._. (1979), Structured
Design: Fundament_s of a DiscipEne of Computer Program
and System D_gn, Yourdon Press, N/, pp 473.
EOS DataPanel(19863,The Earthobservingsystem:Report
of theEOS datapanel,Vol 2a,NASA Tech.Memo. 8777.
Washington,D.C.
NCSA SoftwareTools Grow (1990),HierarchicalData
Format,NadonalCenterforSupercomputingApplications,
Champaign, IL.
Njoku,E.(1985),Satellite-deriveds asurfacetempera,rare:
Workshop comparisons, Bull, Am. Meteorol. Soc., 66, 274-
281. : " ..............
Renka, R=}'.(1988), Multivadam interpolation of large sets
of sca_cred dab ACM Transact. Math. Software. 14, 139-
148.
Rossow, W.B., Mosher, F., Kinsella,E., Atking, A .................
Dcsbois,E.,Harrison,E.,Minnis,P.,Rtrprecht,E.,Seze,
G.,Simmer, C., and Smith, E. (1985),ISCCP cloud
algorithmintexcomparison.,J.ClimateA991.Meteor.,24,
877-903.
Rossow, W.B. (1990), Report of the Workshop on
ComparisonofCloudClimatologyDatuscts,NASA Goddatd
Insdtute for Space Studies, New York.
Space Science Board (1982), Dam management and
computation, Vol 1: Issues and rccgmmendations Nadon_
Academy of Sciences/Nafion_ Academy Press, Washington,
D.C.
!
m
Im
i
m
I
I
i
I
m
I
I
D
g
J
!
ID
I
" - I [] I ' l ' lilll II II I l -- 2 -- _ __ I I I .,
....... _,_ ............ I- ........ . . - .................. L- ..................... ,.J..._ ..... _ ......
l.a,t,," l_ala ._'cI i'aliailtion 137
r
L
L
C
0
(/1
oel¢
rip
G
=1
0
0
_o=
r,,.J o
0
,.1,-
o
!
E
.<
0
m
m
_d
0
m
0
em
¢.__
r_
L.
em
138 R. Kahn et al.
EE
InteractiveC- ographic
Subset Selection
i
Output FiJes, and 1
B & W and Color
Hardcopy
I
lib
Ill
i
i
11
II
El
=--
II
III
ii
iI
II
El
II
m
II
Figure 2. Level 2 Data Analysis Softwaf_ "
I
¸1111
Large Data Set Validation 139
u
! :
m
®
® last revised: 04110/91
Figure 3. HIRS2 Level 2 to 3 Software Overview/Assumptions
140 R.Ka_n.eta!. w
@
L_... _l___f -- ....
@
O< SI_t_ < I ,) .....
(Xzjzc: dmaff Wasmmaizvsd'h_ , : ....
] _ioc_,==_Sd_r=rby=== i-- -- --
__th,=3 x (Aas_]Fs)>s) )
i% ma_.=_ lo dla=_ma,. I"
i sam or Cad.Vn. _po_,d L
] m_y= l: r-"
_,. o<.cz._.,<,,z ,/ .,.
9.
Jc_mnds (FT13 >SK.,,,,_ ]I t /
,rod2 m_ mclu,_J \\
._ / / \ ,_. O_xTZO-1oo.) )
/ /
JL,..,-.(_v'.S_'O<.%,_ _ / [_paZbe.doczk=dadeuif_
L 5_m_5(FI'T>2) ) /' /
5. ,, l X 7.
ii==_=u(r.J_s_'o<z),md lmb_zu_a='_)r"iss °°_'/° I i ddi=,,id_m SOOmb_,u,f,-,a
i l.hy=,_,L_,,,,_o,, domx l,_ 0_o <_).o) _ r,,=>i ] by=.z _ ,_,,'= uard,=_,
_,_h_ muc_(F'Z'Z'C_0.'_ |0.4. =_ (4)"incU_'='mnnd5" i J (FTTCK> 0.9).or_) hycr 2 old
Figure 3. HIRS2 Level 2 to 3 Software Overview (Continued)
ii
i
i
II
i
l
i
I
m
i
I
I
i
i
m
_1 LI I_llll_ II ..... Iv: ....... I _ ._ _LI.........
i
.... J


Validating a large geophysical data set: Experiences with satellite-derived cloud parameters

Abstract

Similar works

Full text

Available Versions

NASA Technical Reports Server