Validation of Soil Moisture Data Products from the NASA SMAP Mission

-- NASA’s Soil Moisture Active Passive (SMAP) mission has been validating its soil moisture (SM) products since the start of data production on March 31, 2015. Prior to launch, the mission defined a set of criteria for core validation sites (CVS) that enable the testing of the key mission SM accuracy requirement (unbiased root-mean-square error <0.04 m 3 /m 3 ). The validation approach also includes other (“sparse network”) in situ SM measurements, satellite SM products, model-based SM products, and field experiments. Over the past six years, the SMAP SM products have been analyzed with respect to these reference data, and the analysis approaches themselves have been scrutinized in an effort to best understand the products’ performance. Validation of the most recent SMAP Level 2 and 3 SM retrieval products (R17000)

shows that the L-band (1.4 GHz) radiometer-based SM record continues to meet mission requirements. The products are generally consistent with SM retrievals from the ESA Soil Moisture Ocean Salinity mission, although there are differences in some regions. The high-resolution (3-km) SM retrieval product, generated by combining Copernicus Sentinel-1 data with SMAP observations, performs within expectations. Currently, however, there is limited availability of 3-km CVS data to support extensive validation at this spatial scale. The most recent (version 5) SMAP Level 4 SM data assimilation product providing surface and rootzone SM with complete spatio-temporal coverage at 9-km resolution also meets performance requirements. The SMAP SM validation program will continue throughout the mission life; future plans include expanding it to forested and high-latitude regions. K C.N. is with EURAC-Institute for Earth Observation, Bolzano, Italy (e-mail: claudia.notarnicola@eurac.edu).

I. INTRODUCTION
NASA's Soil Moisture Active Passive (SMAP) mission has produced global soil moisture (SM) measurements since March 2015 [1]. SMAP uses the L-band (1.413 GHz) frequency to carry out the SM measurements because of its sensitivity to SM changes and relative insensitivity to confounding effects of surface roughness and vegetation [2]. As with other remotely sensed data products, the scientific value of these SM products is determined, in part, by how well their performance characteristics are known. The process of assessing the accuracy of a data product by independent means is called validation ( [3], [4]). The SMAP mission established a rigorous validation program to verify that mission requirements are met and to provide information on the quality of the products to the community. The mission recognized the importance of the calibration and validation program early on, resulting a comprehensive plan of validation activities during the prelaunch phase ( [5]). In particular, the mission started engaging external partners and conducting validation exercises years before the launch. Moreover, the SMAP validation strategy benefited from two earlier missions that had a considerable focus on the validation of SM products: the JAXA AMSR-E (Advanced Microwave Scanning Radiometer-Earth Observing System) instrument launched by NASA on the Aqua satellite in 2002 ( [6]), and the SMOS (Soil Moisture and Ocean Salinity) satellite launched by ESA (European Space Agency) in 2009 ( [7]). AMSR-E validation efforts spurred the development of locally dense observation networks with surface SM measurements in hydrologic research watersheds for SM validation at the footprint scale of these satellites (tens of kilometers) (e.g., [8], [9], [10], [11], [12], [13], [14]). This trend continued with SMOS (e.g., [15], [16], [17], [18], [19]). Consequently, when SMAP was launched in 2015, there was already a significant infrastructure of locally dense networks in place with respect to the remote sensing footprint size, due to these earlier efforts and active international cooperation.
The SMAP project evaluated these existing locally dense networks for their suitability as so-called SMAP core validation sites (CVS) and called for expanding these kinds of observations as much as possible, while also incorporating sparse networks (typically providing just one point-scale observation location within a footprint), other satellite data products, model-based products, and field experiments into the SM validation plan ( [5]). The AMSR-E and SMOS validation efforts utilized these components as well. In the US, the AMSR-E community led a series of field experiments that also included airborne observations (e.g., [20], [21], [22]). The experience gained from these experiments was invaluable for the subsequent SMAP validation experiments (Section III.F). SMOS SM validation plans [23] similarly included field experiments (e.g., [24]), sparse networks (e.g., [25]), and other approaches (e.g., [26]) in addition to dense networks (e.g., [27], [28], [18], [29]). In the 1990s, an effort was started to collect global SM measurements in a single database called the Global Soil Moisture Data Bank [30]. ESA and SMOS continued the development of a centralized repository via the ongoing collection of in situ SM observations into the International Soil Moisture Network (ISMN; [31]).
The SMAP project required the release of beta and validated versions of the SM data products after 6 and 12 months, respectively, from the start of science observations [5]. This timeline drove many decisions in the development of the validation plan and tools. Obviously, only reference data for the period after the start of the SMAP science observations on March 31, 2015 could be used for SMAP validation. Moreover, reference measurements needed to be available to SMAP with short temporal latency to facilitate validation of the beta and validated product versions in time for their public release. The mission's emphasis on this point ensured that there were data available during the first months of the validation period to meet the challenging timeline. This is also the main reason the SMAP validation team connected directly to the data providers and operated outside of established repositories, such as the International Soil Moisture Network (ISMN), that do not have such strict latency requirements. Furthermore, the bulk of the data processing tools were developed and the data formats, transfer protocols etc. were agreed upon before the launch of SMAP. This arguably reduced the flexibility to include data sets that did not meet the constraints during the first year of the validation. However, once the most intensive phase of the validation was completed, the mission was able to increase flexibility and relax its previous requirements on the latency of the validation data. A unique aspect of the SMAP SM validation is the need to assess the surface and root-zone (0-1m depth) SM from the SMAP Level 4 (L4) product, which is based on the assimilation of SMAP brightness temperature (TB) observations into a land surface model [39]. To the extent possible, the SM validation strategy for the L4 product is similar to that of the directly retrieved Level 2 (L2) and Level 3 (L3) surface-only SM products (Section II), which were based on CVS, sparse networks, and other data sources. However, adaptation to the unique characteristics of the L4 product resulted in some differences between the validation of the L4 and L2/L3 products (Sections III and IV).
The Committee on Earth Observation Satellites (CEOS) has advanced a four-stage validation hierarchy, which has been adopted by many providers of satellite data product (https://lpvs.gsfc.nasa.gov). The validation stages increase with the breadth of the validation effort (Appendix A). SMAP was operating at validation Stage 1 during the first year of the mission, which implied that the assessment was conducted based on comparisons to in situ reference data collected at a small set of locations and over relatively short time periods. The SM products achieved Stage 2 shortly thereafter with the extension of the spatial and temporal scope and the publication of the first validation results in the peer-reviewed literature. Since then, SMAP has continued to expand the analysis, with significant contributions from the community, to achieve Stage 3 maturity. Stage 4 (the final stage) requires the Stage 3 level analysis to be updated systematically over time. Many of the SMAP validation analyses are currently updated on a yearly basis and released in the data product assessment reports (e.g., [32], [33]), which satisfies the key aspect of Stage-4 validation.
The experience of the AMSR-E, SMOS, and SMAP SM validation efforts contributed to two important community reference documents that outline best practices for SM validation along with guidance for future development of these practices [34], [35]. The SMAP SM validation approach is largely in line with these recommendations. This paper lays out the SMAP SM validation approach, describes the use of the different methodologies, discusses the uncertainty estimates associated with the validation analyses (section III), and finally presents updated validation results for the most recent versions (R17000 for L2/L3 and Vv5030 for L4) of the SMAP SM products (section IV).

II. SMAP SOIL MOISTURE DATA PRODUCTS
Since March 31, 2015, the SMAP mission has delivered data products containing instrument measurements (Level 1), geophysical SM retrievals (swath-based, L2, and daily composite, L3), and SM estimates from data assimilation of the instrument measurements into a land surface model (L4). On 7 July 2015, the SMAP radar malfunctioned and ceased operation. Prior to the radar malfunction, SMAP provided four different surface SM products and one root-zone SM product [1]: radiometer-based surface SM on a 36-km grid [36], radarbased surface SM on a 3-km grid [37], radiometer and radar combined surface SM on a 9-km grid [38], and surface and rootzone SM based on the assimilation of SMAP TB observations into the NASA Goddard Earth Observing System (GEOS) Catchment land surface model on a 9-km grid [39]. The grid used by the SMAP products is the version 2 Equal Area Scalable Earth (EASEv2) grid system [40], [41].
Following the failure of the radar, the mission introduced a new TB sampling approach and two new SM products. The SMAP 40° angle TB measurements have a 38-km resolution (defined by the half-power footprint on the Earth's surface of the radiometer antenna pattern); the radiometric resolution of the gridded TB is better than 0.5 K, and the measurements filter out radio frequency interference (RFI, [42], [43], [44]). The original radiometer sampling averaged the TB measurements over the 36-km EASEv2 grid cells using inverse distance weighting ( [45]). The enhanced TB processing developed after the radar malfunction, using a Backus-Gilbert approach to sample measurements on the 9-km EASEv2 grid [46]. A new SM product was developed based on the enhanced TB product, which was also sampled onto the 9-km grid [47]. Because the spatial resolution of the TB measurement is considerably larger than the 9-km spacing of the sampling grid, the enhanced passive radiometer-based SM product (henceforth, PE) inverts the TB from a given 9-km grid cell into a SM estimate using ancillary data and parameters for a 33-km "aggregation domain" centered on the 9-km grid cell, thereby approximating the spatial resolution of the TB measurement. The second SM product introduced after the radar malfunction uses observations from the C-band radar on the Copernicus Sentinel-1a and 1b satellites to downscale the SMAP L-band radiometer TB measurements with an algorithm similar to that used by the original SMAP radar/radiometer combined product [38], and then derives SM from the downscaled TB field [48]. The SMAP/Sentinel-1 product (henceforth, SP) provides SM on 1km and 3-km grids. The product uses the SMAP observations only when the Sentinel-1 measurements are available; therefore, the product covers the Earth in about 12-days (based on Sentinel-1 repeat cycle), but the combined revisit interval of the two satellites is less for certain areas where data collection is prioritized (over Europe, for example) [48]. Table I summarizes the SMAP SM products.
The radiometer-based products (P and PE) include SM retrieved using three different algorithms: single channel vertical polarization (SCA-V), single channel horizontal polarization (SCA-H) and dual channel algorithms (DCA) [49]. Besides adding the enhanced and the SMAP/Sentinel-1 products described above, the mission has broadly improved all products over the years. Key improvements include the processing of ancillary data, such as the surface temperature [49], updating the SM retrieval algorithm for the DCA [50], and improving the modeling parameterization for the L4 product [51]. New versions of the SMAP SM products have been released approximately yearly with various enhancements and always accompanied by updated assessment reports (e.g., [32] [33]), plus a complete reprocessing of the data.
The SMAP baseline validation domain is defined by the product accuracy requirements of the mission. Surfaces with permanent ice and snow, urban areas, wetlands, and areas with above-ground vegetation water content greater than 5 kg/m 2 are excluded from the formal accuracy requirements and identified with a non-zero Retrieval Quality Flag (RQF) [1]. In recent years, the SMAP SM algorithm research has included improving the quality of SM retrievals in more densely vegetated regions, which has resulted in validation activities in forested areas [52] [53].

III. SMAP SOIL MOISTURE VALIDATION STRATEGY AND METHODOLOGIES
The SMAP SM validation strategy is driven by the mission validation requirements, the characteristics of the measured SM (accounting for natural variability in the horizontal, vertical and time dimensions), and the availability of high-quality reference data [5]. NASA required the SMAP mission to measure SM to within 0.04 m 3 /m 3 accuracy -in an average aggregate senseacross the entire SM validation domain [1], [54], where accuracy is defined as the standard deviation of the error or unbiased root-mean-square error (ubRMSE, see Appendix B). Because the locally dense CVS SM monitoring networks provide the best available measurements of SM at the SMAP radiometer footprint scale (Section III.B ), the mission chose the CVS data as the primary validation reference to establish that the accuracy requirement is met. Specifically, the product accuracy assessment is based on the average of the unbiased root-mean-square difference (ubRMSD) sampled at the CVS [56]. Other metrics used in the validation are the RMSD and the mean difference (MD) (see Appendix B). Owing to errors in the in situ measurements, the ubRMSD, RMSD and MD are conservative estimates of the true SMAP ubRMSE, RMSE and bias, respectively (section III.A). The validation metrics further include the Pearson correlation (R; Appendix B) and the anomaly R computed using the departures from the multi-year, seasonally varying climatology computed for both the reference and SMAP SM [51]. Because the number of available SM CVS across the globe is limited, the validation strategy was complemented with additional data sources. These sources include geographically more extensive SM networks with only one, or very few, measurements within the footprint (i.e., sparse networks; Section III.C), other global satellite-based SM products (Section III.D), global land model-based estimates of SM (Section III.E), and field experiments (Section III.F). Additionally, the validation of the L4 product included data assimilation diagnostics as an important element, reflecting the unique nature of L4 among the products (Section III.G). Each methodology has key features that are exploited in the continuing validation process to accomplish the most comprehensive validation possible for each product across time and space. The application of the methodologies and analysis approaches depends on the SMAP products. For example, there are few other high-resolution (<30 km) SM products, especially at the global scale, that can be compared with the SMAP active (A), active-passive (AP) and SP products. Triple collocation (TC) analysis has been used to support the use of the sparse networks and other satellite data products together with land model-based products for complementary assessments of the P and PE products [57], [58]. However, since the L4 product is based on merging radiometer observations into a land model, traditional TC analysis cannot be used for the L4 SM because there are not enough independent reference datasets. However, other Instrumental Variable approaches can be used to quantify the skill improvement from the assimilation of SMAP observations in L4 (relative to a model-only baseline; [59]). Table II summarizes the different validation methodologies and the analysis approaches applied with them for the different data products.
The SMAP satellite makes measurements in the morning and evening based on its 6 AM / 6 PM equator crossing sunsynchronous orbit configuration (Section II). The accuracy requirement for L2SM products applies to the SM retrieved using the 6 AM (descending) SMAP overpasses because of the expected uniformity of the temperature across the soilvegetation column [49], but the SM retrieved using the 6 PM overpasses is also validated (and actually has roughly equivalent accuracy performance to the AM overpasses). L4 SM is available and validated at 3-hourly intervals.
Currently, the data record of in situ comparisons with SMAP is over 6 years long. This allows a very detailed look into the performance of the SMAP products, including seasonal characteristics (e.g., [60], [61], [62]). Continuing data collection and validation are nevertheless important. CEOS considers the continuous monitoring of data consistency to be a key aspect of any validation program since it allows for the reliable detection of any potential anomalies. For example, SMAP experienced an operational anomaly from 19 June to 23 July 2019; once SMAP science measurements were available again after 23 July 2019, it was extremely important to have immediate access to concurrent validation data to verify that the SM retrieval performance remained unchanged following the operational anomaly.
The metrics computed with respect to in situ data are subject to sampling error and should always be provided together with statistical confidence intervals. Reference [34] summarizes the appropriate methodology to compute confidence intervals for each metric. While they provide a quantitative way of evaluating the statistical significance of the differences between different products and algorithm versions, it is important to emphasize that they do not provide the confidence with respect to the actual true value, but the confidence in the calculation of the difference between the two data sets (SMAP and the reference). This approach was adopted also by the SMAP validation team. The equations are summarized in Appendix B.
A. Horizontal and Vertical Variability of Soil Moisture SM measurements are scale-dependent and must be interpreted in terms of their spatial support, spacing, and network extent [63]. The validation of SMAP SM with in situ measurements is complicated by the extremely different spatial support of the measurements (point-scale in situ sensor measurements vs km-scale resolution of the SMAP products). Depending on the network, the spacing and extent of the in situ measurements may approximate the SMAP footprint, as is the case with the CVS. Perhaps most importantly, the spacing of the station measurements (number and distribution) must allow a reliable estimation of the average SM over the SMAP footprint. The required minimum number of point-scale sensors and their spacing is dictated by the spatial variability of SM within the area of interest and the desired accuracy for the estimate at that spatial scale (e.g., [64], [65], [66], [67], [68], [9], [10], [69], [70], [71]). For the SMAP CVS, Voronoi diagrams ( [72]; or see Thiessen Polygons in [73]) were chosen as the baseline upscaling approach to avoid geographical overweighting of clustered parts of the pixels [54]. Because of an extremely distinct soil texture gradient at the Carman CVS, an upscaling approach based on the soil texture distribution relative to the SM station locations was applied there [74]. Reference [75] presented the upscaling approach applied at the Twente CVS where a smaller number of continuously measuring stations are used to estimate the average SM based on a hydrological model and measurements from additional stations that do not cover the entire time-period.
The scale discrepancy presents a particular challenge for determining errors in large-scale SM products because the longterm mean SM at a randomly selected single point may be very different from that of the area-average SM. That is, point-scale SM measurements are typically biased with respect to areaaverage SM. Conversely, the time-varying component of SM typically has a large autocorrelation over long distances; that is, point measurements can better represent the SM temporal changes over domains of several km [76], [77]. Many studies of temporal stability of SM very effectively illustrate these differences of spatial and temporal evolution of SM (e.g., [78], [79], [80], [11], [81], [70]). Generally, the sparse networks lack adequate representation for resolving bias and RMSE at the scale of satellite SM retrieval footprints.
While arguably less severe than the challenges facing the sampling of bias [77], there are also challenges in estimating bias-insensitive metrics (e.g., ubRMSE, R) from sparse ground observations. Most notably, random spatial representativeness errors will spuriously inflate the sampled ubRMSD [82] and degrade the sampled estimates of R [57]. As a result, the recovery of absolute ubRMSE and R metrics acquired from in situ measurements, and especially from sparse networks, requires statistical upscaling techniques capable of estimating, and correcting for, the impact of random spatial representative errors (see Section III.C).
In addition to determining absolute performance metrics, calibration/validation (cal/val) activities are often used to quantify the relative variation of metrics between, for example, two different retrieval techniques. It is worth noting that a robust bulk characterization of relative skill in terms of ubRMSE and R can generally be obtained directly from sparse network datawithout the application of upscaling techniques. While point-to-footprint upscaling errors can be large, they can be treated as random in nature and independent of retrieval errors. As a result, random spatial representativeness can be assumed to have an equal impact on ubRMSD and R calculated for multiple products and will not affect the assessment of relative metric differences between two products [83]. A consequential assumption in SMAP validation is that the L2/L3 products provide an estimate of the surface SM in the top 5 cm (on average) and within the grid cell boundaries. This is especially important in considering the breakdown of uncertainties. The response of a microwave radiometer varies depending on the SM content and its vertical distribution (e.g., [84], [85], [86], [87]). Accounting for this effect separately in the data product would introduce another set of uncertainties, but because of the assumption, the uncertainty caused by the variable sensing depth is embedded within the product uncertainty (e.g., [88], [75]). Hence, the validation of the surface SM products is done with respect to in situ measurements that correspondingly provide an estimate for the top 5 cm of the soil column with their own set of uncertainties. Most of the in situ measurements used for SMAP validation measure the SM at 5 cm depth using a probe that is installed horizontally in the soil which captures the SM over an approximate depth range of 3-7 cm, missing the topmost layer of the soil. Probes that have ~5 cm prongs and are inserted vertically also capture the topmost layer and provide a truer average of water content in the 0-5 cm soil column, particularly during rapid dry-down periods after rain events [89]. Vertical installation, however, makes sensors more vulnerable to surface disturbances, and depending on the sensor, may interfere with the water flow, result in inaccurate soil temperature compensation of the probe calibration, or the assumptions of the SM measurement along the sensor prongs may be inaccurate [90]. The study in [91] found that the practical difference of these measurements is dependent on the soil clay content (increasing dependence on sensor orientation with increasing clay content).

B. Core Validation Sites
The SMAP mission engaged with investigators across the globe to provide data from dense networks. The networks were assessed before the SMAP launch according to the following criteria: 1. Number of sensors: N>8 for 36-km, N>5 for 9-km and N>3 for 3-km pixels (see [54]). 2. Geographical distribution: The sensors are not clustered in only one portion of the pixel but cover (approximately) the entire pixel (although not necessarily evenly, see the next requirement). 3. Spatial upscaling: An average SM can be established based on the measured SM and ancillary information (such as additional short term observations), see Section III.A.

Calibration:
The sensors have undergone a calibration using additional measurements, or the calibration is otherwise verified based on past measurements. 5. Quality assessment: The time-series of each sensor is valid (no dropouts, spikes, drifting, etc.). 6. Maturity: The network has been up and running for a sufficiently long period during which the overall consistency of the measurements was verified. 7. Latency: The data is made available for SMAP validation within 1 week (at most 1 month). This criterion was applied only in the early phase of the mission. Based on these requirements, currently 15 sites provide reference data at the 33-km scale (or 36 km for the standard P product), 17 sites provide reference data at the 9-km scale, and 8 sites provide reference data at the 3-km scale. At the 9-km and 3-km scales, a few sites include more than one independent reference pixel, for a total of 22 (15) pixels at the 9-km (3-km) scale. Two of the original 36-km sites (Kyeamba and Bell Ville) [54] did not continue to provide data from a sufficient number of stations after the initial validation period and were therefore moved to the candidate site category. One site (HOBE; [92]) was added only after the initial validation period because it had not met the first-year latency requirement. The 9-km L4 product is validated primarily using 9-km reference pixels that were selected following largely the same principles as for L2 SM validation, resulting in a nearly identical set of 18 CVS with sufficient in situ measurements for surface SM validation. Only 7 CVS provide sufficient measurements for root-zone SM validation. Table III lists the CVS and candidate sites and Figure 1 shows their locations.
Most of the 33-km CVS have more than the minimum required number of measurement locations, which would suggest an uncertainty of less than 0.03 m 3 /m 3 for the average in situ SM across the 33-km reference pixel [54]. However, [77] found that the variability of the SM caused the confidence interval (CI) for the MD to be greater than 0.03 m 3 /m 3 at seven of the 15 analyzed CVS. The study also accounted for the distribution of the stations (spacing) so that clustered installations had less sampling power; using this adjustment, the MD CI exceeded 0.03 m 3 /m 3 at nine CVS. The study found that these sites would need to add about eight stations on average to meet the CI goal. At four of the CVS, temporary SM sensors were installed that provided an additional 19-34 measurement stations over one season. These temporary measurements are useful as a reference for the permanent network measurements. As expected, SM measurements from the permanent and temporary networks were very well correlated overall, but the absolute SM difference ranged from 0.009 m 3 /m 3 to 0.034 m 3 /m 3 [93], which supports the finding by [77] that significant uncertainties in absolute SM remain even with relatively dense spatial sampling.
Even though the candidate validation sites did not satisfy all of the core site requirements, they still offered rich data sets. The 18 candidate sites provided data from six continents and for diverse land cover and climate conditions. The candidate sites could be applied to investigate SM anomalies, the impact of RFI on SM retrievals, and performance outside the validation domain (e.g., forests).
In Section IV, updated performance metrics at the CVS are presented using the six-year data record of the latest SMAP product version for each SM product. The metrics obtained for each CVS are averaged to derive an overall representative value.   [54]). SF stands for surface and RZ for root-zone. b) Some but not all sites provide data before 4/15 and/or after 3/21. The range does not account for station outages. c) Experiments including an L-band retrieval aspect and a large-scale sampling. d) Koeppen-Geiger climate classification scheme [119] e) MODIS-based International Geosphere-Biosphere Program classification [120] C. Sparse Networks During the SMAP period (2015-present), a large number of in situ SM measurements are available from across the world, albeit with larger concentration in North America (see Figure  1). Most of these measurements are from sparse networks and do not provide SM at the spatial scale of SMAP estimates (Section III.A). Nevertheless, they still provide useful information and greatly expand the spatial coverage of the in situ validation. The periodically released SMAP assessment reports include performance metrics computed using the sparse network measurements (e.g., [49]). These metrics are not used in an absolute sense, but give a general indication of retrieval performance and to track consistency between algorithm versions (see Section III.A). Table IV summarizes the networks used in the SMAP validation activities. Most of the networks use conventional probe-based measurements, but the PBO H2O network uses GPS reflectometry to derive SM [121] and the COSMOS network uses neutron measurements to derive SM [122], which have different spatial and vertical support compared to probes. The use of sparse networks together with other satellite-based and land model-based SM products in TC approaches has been studied extensively (e.g., [123], [124], [125]). The basic principle has been to use TC to statistically characterize pointto-footprint upscaling errors and then apply this characterization to correct for the biasing impact of such error on satellite validation metrics [82]. However, despite this potential, work conducted during the SMAP cal/val project revealed significant limitations in the utility of TC for this purpose. First, TC analysis is insensitive to the presence of additive or multiplicative biases in a time series. Such biases can only be detected (and thus eliminated) if TC is given access to a perfectly calibrated data set (i.e., data lacking bias of any kind and degraded only via additive random noise). This assumption, unfortunately, is not satisfied by the sparse network SM observations [57]. As a result, TC cannot contribute directly to the specification of either RMSE (which is sensitive to both additive and multiplicative biases) or even ubRMSE (which is sensitive to multiplicative bias). Even for a bias-tolerant metric like R, TC is only truly trustworthy when applied to a SM anomaly time series, that is, after removing the multi-year mean (seasonally varying) SM climatology [57], [126]. Therefore, the utility of TC analysis for the calculation of absolute SMAP cal/val metrics from sparse networks is limited to anomaly R.
We computed bias-insensitive performance metrics with respect to the sparse networks with confidence intervals using the six-year data record for each product. The metrics obtained for each network location were averaged based on land cover to derive representative values for each major land cover class (Section IV).

D. Other Global Satellite Products
Other global satellite-based SM products can be compared directly with SMAP SM estimates by computing the performance metrics between the products at each grid point [132]. Such results do not indicate the correctness of the retrievals -rather the degree of consistency between the products; anomalies in the consistency across the globe can point to potential weak points in the algorithm that are not revealed by geographically limited in situ measurements.
The intercomparison can be developed further by applying TC with additional global information sources. Provided that certain statistical assumptions (e.g., mutual error independence) are met, TC can provide unbiased estimates of anomaly Reven in the absence of ground-based observations. Using a triplet of SMAP (or SMOS) SM retrievals, ASCAT SM retrievals and surface SM estimates from a land surface model, [58] validated the assumptions underlying TC (over limited areas of the globe containing ground-based observations) and, subsequently, applied TC globally to obtain unbiased, 36-km estimates of anomaly R for SMAP, ASCAT and SMOS SM retrievals. Their results illustrated that SMAP retrievals are significantly outperforming their SMOS or ASCAT equivalents over a large fraction of the globe.
SMOS provides the most relevant satellite products for comparison with the SMAP SM because it uses the same Lband frequency as SMAP. Moreover, SMOS retrievals utilize multi-angle TB measurements to compensate for the vegetation effect, but otherwise the retrieval approaches are similar. The comparisons between SMAP and SMOS were updated over the SMAP lifetime (Section IV). The SMAP PE product on the 9km grid allows for interpolation of the SM values to the SMOS grid with minimal loss of information because the 9-km SMAP grid spacing is less than a half of the effective resolution of the product (Nyquist Sampling Theorem). Therefore, the metrics (ubRMSD, MD, RMSD, Pearson correlation) were computed over each SMOS grid point with sufficient valid retrievals with the SMAP PE (SCA-V algorithm) and SMOS L3 products after applying the quality flags for both products. If any data point used in the interpolation of the SMAP PE product was flagged, the result was also flagged.

E. Global Model-based Products
Like other satellite products, land surface model-based SM products cannot be used as a direct reference for SMAP SM validation [133]. Nevertheless, land models capture the most relevant hydrological processes and SM dynamics based on a large set of input parameters, including precipitation, and therefore offer an additional source of SM information at global scale and at very high temporal resolution. Various land model products have been used in SMAP validation studies, especially to support TC analyses, including the GMAO Nature Run [57][58], MERRA-2 [34], ECMWF ERA [134], and ECMWF H-TESSEL [58]. These studies exploit the independence of the model products from remote and in situ measurement to compensate remote-sensing error metrics for the impact of random error in model and in situ reference products.

F. Field Experiments
Depending on their particular focus, field experiements have supported the testing of specific algorithm features under a limited set of conditions. These analyses can be supported by CVS data sets as they offer a longer-term reference set, albeit without the supporting measurements provided in a field experiment. For this reason, the SMAP Validation Experiment 2015 (SMAPVEX15) [103] and SMAPVEX16 [99] focused on the locations of the Walnut Gulch, South Fork and Carman CVS sites. The main objectives of these experiments were, respectively, to support the development and validation of the SM spatial disaggregation algorithm used by SMAP and to provide additional insight into algorithms over agricultural domains, where the analysis of the first-year retrievals revealed specific issues [36]. The SMAP Experiment-4 (SMAPEx-4) and SMAPEx-5 in Australia at the Yanco CVS were also designed to support the development and validation of SM downscaling algorithms [95].
Unfortunately, only SMAPEx-4 was executed before the SMAP radar malfunctioned; when SMAPVEX15 and SMAPEx-5 were conducted later in 2015, only the SMAP radiometer was operational. The Copernicus Sentinel-1 mission now used for the SP productdid not make measurements over the SMAPVEX15 site at that time, but did cover the SMAPEx-5 site [135]. The topic is particularly important because the validation of the spatial disaggregation techniques is exceptionally difficult. The issue is related to the challenge of reliably characterizing the SM areal mean (discussed in Section III.A), since, in order to show that a disaggregation approach works as intended, absolute SM levels in neighboring grid cells need to be known accurately (see also Section IV.B). Furthermore, the difference between the average SM in the cells needs to be large enough for the disaggregation to have a meaningful impact on the retrieved SM pattern. Such cases are extremely difficult to capture during a short-term field experiment.
Airborne measurements providing high-resolution SM retrievals are well-suited for evaluating the effectiveness of downscaling approaches because they provide a spatially distributed SM reference (e.g., [136]). The data collected during SMAPVEX15 (despite not being able to test SMAP radar-based disaggregation) was useful for examining: sub-footprint spatial heterogeneity; discrepancies in 5-cm in situ sensor readings and SMAP measurements with the help of rain gauge records [103]; the effectiveness of a high-resolution hydrological model for SM validation [137], and surface roughness effects on SM retrievals [138].
Over agricultural areas, SM retrievals face rapidly changing vegetation and surface roughness conditions that may be large enough to disrupt L-band retrieval of SM (e.g., [61][60] [139]). In SMAPVEX16, the vegetation water content (VWC) and surface roughness were sampled at multiple fields within the South Fork and Carman CVS. VWC was measured multiple times over the growing season. While the VWC calibrated using the experiment data [109] shows significant differences with respect to the data used by the SMAP algorithm, the differences cannot by themselves explain the retrieval errors of the operational algorithm [99].

G. Assimilation Diagnostics
In any operational data assimilation system, model estimates are routinely confronted with the assimilated observations. The L4 algorithm computes, in 3-hourly intervals, the difference between the SMAP TB observations that are available during each 3-hour period and the corresponding model forecast TB [140]. These observation-minus-forecast (O-F) TB residuals encapsulate the new information provided by the SMAP observations to the modeling system; they consequently form the basis of the L4 SM analysis, which converts the O-F TB residuals into corrections to the modeled SM estimates (a.k.a. SM increments). Because the O-F TB difference involves only TB observations that have not contributed to the corresponding TB forecast, the statistics of the O-F TB residuals provide independent verification of the quality of the model's TB estimates and, by extension, the model's SM estimates within the assimilation system. Specifically, in a well-calibrated, unbiased assimilation system, the time series mean of the O-F TB residuals should be close to zero. Moreover, the typical magnitude of the O-F TB residuals (computed as their time series standard deviation) should be consistent with the error assumptions underpinning the assimilation system. Finally, a well-designed land data assimilationparameterized with accurate measurements of both model and observation errorswill also minimize the temporal standard deviation of O-F TB residuals. This principle is particularly useful when evaluating new L4 algorithm versions. For example, the land model revisions in Version 4 of the L4 algorithm resulted in a reduction of the typical magnitude of the O-F TB residuals by 0.13 K compared to Version 3 [51].

A. Radiometer-Based Product (PE)
This section presents updated validation results for the PE algorithm only. Results for the P algorithm are essentially the same (not shown). Figure 2 shows histograms of the validation metrics for the 6 AM overpasses computed over the CVS, with the average metric indicated by the vertical blue line. Metrics include the ubRMSD, MD, RMSD, R and anomaly R. The numbers in the plots list the average and median metrics, along with lower and upper bounds of the 95% confidence interval of the average metric (see Appendix B). The performance of the SCA-V (current baseline) and DCA algorithms is essentially the same (well within the confidence interval), while the SCA-H algorithm has a significantly larger MD and ubRMSD.
When categorized based on land cover, the grassland dominant sites have markedly better performance metrics compared to agriculture-dominated sites (Table 5). Across all sites, the DCA performed slightly better than SCA-V (0.036 vs 0.038 m 3 /m 3 ubRMSD). This difference stems from the agricultural sites, where DCA is better at addressing rapid temporal variability in vegetation attenuation characteristics than SCA-V due to the latter's use of a prescribed NDVI-based VWC climatology.
The metrics were also compared to the soil texture and the mean and variance of the VWC at each site. The soil texture was based on the values obtained from in situ samples where available and from the global data set used for the SM retrieval otherwise. The only parameter pair that resulted in a meaningful correlation was the MD versus soil clay content. Figure 3 shows the scatterplot for the MD of each algorithm as a function of the clay content at the CVS. SCA-V exhibits the strongest correlation, while DCA has a somewhat weaker correlation; for SCA-H, the correlation is only marginally meaningful (P-value 0.016). The MD has the most uncertainty of the metrics as discussed in Section III.A, but the level of correlation with clay content is compelling, especially for SCA-V. Although the result seems to indicate a systematic bias in the SMAP SM, the effect may be linked to the vertical distribution of water in the top layers of the soil and the difference between the in situ sensor and the SMAP measurements as the clay content impacts the water retention in the soil. It is also possible that the clay content correlates with the upscaling uncertainty contributing to the observed relationship.
Next, Figure 4 shows histograms of ubRMSD and correlation vs. the sparse network measurements for each algorithm, broken down by grasslands and croplands. The SCA-H has somewhat weaker performance than SCA-V and DCA, particularly for croplands, consistent with the CVS results. All algorithms performed significantly worse for croplands compared to grasslands. As discussed in Section III.C, the   representativeness errors over sparse network sites are expected to be large, which is reflected also in the level of the ubRMSD and R values, which are generally worse than for CVS. Furthermore, the sparse networks are particularly susceptible to representativeness errors over croplands, which have relatively higher heterogeneity compared to grasslands. Moreover, the in situ measurements in croplands are typically installed next to the actual fields to avoid interfering with the cultivation activities, which may exacerbate the problem. The metrics of the core site and sparse network comparisons for AM and PM overpasses are tabulated in Appendix C. Finally, Figure 5 shows a comparison of the SMAP L3SMPE SCA-V with the SMOS L3 SM product from their morning overpasses. The quality flags of the products were cross applied before the comparisons. For reference, panels (a) and (b) show the time series of mean and standard deviation, respectively, for SMAP SM for the nearly 6-year period from 31 March 2015 to 13 March 2021. As expected, the mean and variation are small in arid and desert regions, such as the Sahara, the Arabian Peninsula, and western Australia. Large variations are seen in the Pampas, in the savannas south of the Sahel region, in the savannas south of the Congo rainforest and in eastern Australia. Panel (c) shows the number of valid SMAP-SMOS data pairs used in the comparison, indicating good coverage except in forested regions. Panel (d) shows the MD between the products. SMAP has generally wetter features in the western Sahel, southern Congo rainforest and eastern India, for example. Panel (e) shows the ubRMSD between SMAP and SMOS. First, regions with low SM variability have naturally low ubRMSD, and regions with high variability are prone to have a high ubRMSD. Therefore, the patterns in SM variability (panel (b)) are, for the most part, repeated in those of the ubRMSD (panel (e)). However, the savannas south of the Sahel and in eastern Australia exhibit relatively low ubRMSD indicating a particularly good match between the SMAP and SMOS products there. Finally, panel (f) shows the correlation between products.
Significant parts of the globe have strong correlations, including regions with relatively low SM variability such as western Australia. Owing to the lack of underlying signal, low correlations are expected in areas with extremely low SM variability, such as the Arabian Peninsula and Sahara Desert. There are regions where the differences between the products draw more attention. For example, the Indian subcontinent has relatively large ubRMSD and low correlation, which are not directly explained by high or low variability, respectively, and relatively large and varying MDs. Based on a global TC analysis, [58] suggest that SMAP SM is more reliable than SMOS L3 SM product (v300) in this region. Figure 6 shows the histograms of the validation metrics over the CVS for the SP product at 9-km and 3-km resolution. Anomaly correlation was not computed because the number of SP retrievals does not allow for a reliable computation of the climatology. The number of sites used is less than shown in Table III. The smaller number of sites reflects the limited coverage by Sentinel-1; SP retrievals are also flagged when the site happens to be systematically on the edge of a Sentinel-1 data granule [48]. The average ubRMSD of 0.035 m 3 /m 3 at the 9-km scale is below the 0.04 m 3 /m 3 ubRMSE threshold of the requirement. The difference to the ubRMSD of the PE SCA-V and DCA products is within the statistical confidence intervals. The average MD is reasonable, but with large variation from site to site, which is reflected in the relatively large average RMSD value (compared to that of the SCA-V and DCA algorithms of the PE product). The correlation is in line with the good ubRMSD performance. Considering that the SP algorithm is susceptible to additional uncertainties because of the disaggregation scheme, the performance is overall very satisfactory at 9-km. The narrow distribution of the individual site results around the average value, especially for ubRMSD and correlations, is also a good sign regarding the consistency of the performance.

B. SMAP/Sentinel-1 Combined Product (SP)
At the 3-km scale, the overall number of sites is smaller than at the 9-km scale. Additionally, as in the 9-km case, the number of sites used in the comparison is smaller than values shown in Table III because of the limited Sentinel-1 coverage. However, at 3-km, the site-to-site consistency observed in the 9-km evaluation breaks down. The reliability of the aggregate results is somewhat questionable, given the large dispersion of the sitespecific results around the average values. However, the requirement of only three stations within footprint for the 3-km sites seems particularly small (relative to requirements enforced at 9-and 33-km) and may be the primary reason for large spread in the metrics. Naturally, some of the dispersion is due to the retrieval performance; however, the current reliability of 3-km results remains relatively low. Figure 7 shows the sparse network metrics for the SP product for grasslands and croplands at the 3-km scale. The number of stations is not the same as in the PE comparison (Figure 4) because of the coverage of the SP product, as discussed above. The performance over grasslands is identical to that of the PE product. Over croplands, the correlation is very similar to that of the PE product and the average ubRMSD is somewhat smaller.  Figure 8 shows the spatial distribution of SM for a 400 km by 300 km area in Georgia, USA (a) based on the PE product (b) and the SP product at the 3-km resolution (c) and aggregated up to the 9-km resolution (d). Panel (b) illustrates the true resolution of the PE product (33 km) as the SM features are spatially smoothed over the 9-km grid. In contrast, the SM features of the 9-km SP product follow distinctly the 9-km grid, and the same is true for the 3-km resolution (d). The observations are consistent with the SP algorithm principle, which relies on the TB observations to provide the information content on the absolute SM level at the coarse scale (33 km), while the backscatter observations provide the information content on the higher-resolution spatial variations starting from 1 km (not shown).
In principle, the smaller temporal ubRMSD obtained with the SP product over croplands implies that the 3-km SP product, through its disaggregation technique, compensates for some of the heterogeneity that the PE product cannot resolve. However, in order to evaluate quantitatively this possibility, the skill of the spatial disaggregation would need to be assessed using approaches focused on spatial measures of downscaling performance (as discussed in Section III.F). Figure 9 summarizes the performance of L4 surface and rootzone SM product across the CVS for the 9-km reference pixels ( Table 3) Figure 9 is based on 3-hourly data and 9-km pixels at 18 core sites. When evaluated at the 33-km reference pixels (not shown), the ubRMSD for L4 surface SM drops to 0.037 m 3 /m 3 and is thus comparable to that of the L2SMPE product ( Figure  2).

C. Assimilation Product (L4)
Next, Figure 11 summarizes the performance of L4 SM across 178 grassland and 94 cropland sparse network stations. Rootzone measurements are not available at all stations, and at one station each for surface and root-zone SM there is no anomaly correlation metric because the number of available measurements was not sufficient to compute the climatology needed for this metric. The average ubRMSD for L4 surface (root-zone) SM is ~0.056 (~0.038) m 3 /m 3 , with typical values ranging from 0.03 to 0.09 m 3 /m 3 for surface SM and from 0.02 to 0.07 m 3 /m 3 for root-zone SM. The surface SM ubRMSD is slightly higher for cropland than grassland stations, but the difference is much less pronounced than for the L2 product ( Figure 4 and Figure 7); for L4 root-zone SM, the ubRMSD is nearly identical for grasslands and croplands. As expected, the ubRMSD at the sparse network stations is larger than at the CVS ( Figure 9) due to enhanced levels of upscaling error associated with characterizing SM at the 9 km scale using only one or two point-scale sparse network measurements.
Typical values for the L4 correlation and anomaly correlation at the sparse network stations range from 0.5 to 0.9 at a large majority of the stations, with average metrics falling between 0.65 and 0.74 for surface and between 0.59 and 0.70 for rootzone SM (Figure 11). Interestingly, the L4 correlation and anomaly correlation skill is better by ~0.04 on average for cropland than for grassland stations, which is the opposite of the result seen for the L2 product (Figure 4 and Figure 7). Most of the sparse network stations are in the continental US, where the L4 algorithm benefits from high-quality land model background estimates owing to the dense network of precipitation gauges available to force the land surface model. Consequently, L4 SM should be less sensitive than the L2 retrievals to errors incurred in the challenging parameterization of the surface radiative transfer equations over cropland.
Finally, the L4 algorithm's consistency between the assimilated SMAP TB observations and the corresponding TB model forecasts was examined (Section III.G). Figure 10(a) shows a global map of the mean O-F TB residuals from the L4 algorithm, with a global average value of only 0.06 K and an average absolute value of just 0.29 K. The small values primarily reflect the impact of the climatological rescaling of the assimilated SMAP TB observations prior to their assimilation into the land surface model [140]. The L4 algorithm, through this TB rescaling, efficiently assimilates the time series anomaly information contained in the SMAP TB observations while ensuring that the analysis is unbiased. Whereas earlier versions of the L4 algorithm relied on SMOS TB observations to determine the rescaling parameters and resulted in mean absolute O-F TB values of ~0.6 K, only SMAP observations are used to compute the rescaling parameters in Version 5. This, together with improvements in the underlying modeling system, considerably improved the algorithm calibration. Figure 10   forcing used in the land modeling system. These errors were revealed by the assimilation of SMAP TB observations in the L4 algorithm [140], [59]. The global average O-F TB standard deviation is 5.5 K in the Version 5 algorithm, which represents a ~0.2 K reduction from the corresponding value in the Version 4 algorithm [51]. This reduction in the typical magnitude of the O-F TB residuals reflects improvements in the underlying land surface modeling system as well as in the calibration stability of the assimilated SMAP TB observations [141].

V. DISCUSSION
The CVS analysis illustrates that the performance of the current versions of the SMAP SM products is as expected based on earlier results. The PE product (and, by extension, the P product) meets the mission requirements by achieving ubRMSD of less than 0.04 m 3 /m 3 (with both SCA-V and DCA algorithms). The enhancements in the DCA algorithm [50] helped achieve a mean performance of less than 0.04 m 3 /m 3 ubRMSD also over agricultural areas, although some of the individual sites still exhibit performance not meeting expectations.
The temporal performance of the SP product at 9-km based on the CVS comparisons is satisfactory. The CVS comparisons at 3-km suffer from a lack of sites and the low number of sampling points within the 3-km reference pixels (even though meeting the original requirement of three stations). The validation of high-resolution and/or disaggregated SM products will need significantly more resources in the future for completing a full assessment of these products. The availability of sites meeting the original CVS requirements (Section III.B) at 3-km (or 1-km) is not adequate, and even sites meeting these criteria do not reliably capture the area-average SM. The skill of the spatial downscaling algorithm, as discussed in Section IV.B, is difficult to resolve. Improving the skill requires measurement setups, such as airborne field experiments (e.g., [142], [135]) or particularly dense measurement networks close to each other, which are not commonly available. This is a very significant aspect of the validation of spatially downscaled products that is often overlooked in algorithm assessments.
The performance of L4 surface SM compared well with the PE and SP products over the CVS. Like the SP product, the root-zone product comparisons also suffer from a lack of suitable reference sites, with only 7 independent CVS locations currently available.
As discussed in Section III.B, studies have found that to estimate reliably the absolute level of the area-average SM, the number of point measurements required seems to be larger than originally estimated. The SMAP criteria for the number of spatially distributed measurement locations was computed based on the relationship between SM variability in the area and the desired accuracy with a certain level of confidence presented in [67]. The original computation assumed 70% confidence, and for the 3-km and 9-km scales, a 0.05 m 3 /m 3 target accuracy with an assumption of a 0.05 m 3 /m 3 SM spatial variation within the pixel, which resulted in three and five measurement-location minimums, respectively [54]. With a 90% confidence, a 0.03 m 3 /m 3 target accuracy and spatial variation assumptions of 0.06 m 3 /m 3 and 0.07 m 3 /m 3 (which are more in line with literature, e.g., [67]) the corresponding  [77] (which also applied 90% confidence in the computations). This translates into a strong desire to not only see more CVS at different spatial scales but to see them deploy even denser networks at all scales to enable an accurate computation of the bias-sensitive SM metrics. For the biasinsensitive metrics (e.g., R or ubRMSE), the requirements are not as strict; the sampling currently available at the 33-km sites provides a solid reference for computing these metrics [77]. At smaller scales (9-km, 3-km, and even 1-km and below), the availability of sites with an adequate sampling, even for resolving the bias-insensitive metrics, is scarcer. Inadequate validation resources will hinder the development of the SM products overall because spatially representative validation references are needed to reveal the true performance of the algorithms; otherwise, the representativeness errors will dominate the comparisons and algorithm improvement will be difficult.
The SMAP SM products have been investigated in several other studies using various dense and sparse networks. Most of the networks are captured in this study, but there are a few additional ones, including the BIEBRZA-S-1 network in Poland [143], the RSMN network in Romania [144], CTP-SMTMN in China [145] and an agricultural area in China [146]. Several studies include complementary analysis approaches and comparisons to other spaceborne SM products over the in situ measurement sites, for example, [36], [147], [25], [143] and [146]. In all these studies, the SMAP performance over the study sites was rated very favorably with respect to the other   products. Most of these studies did not include confidence intervals and the differences were small in some cases. In [146], SMAP was the only product to produce reasonable performance over a corn field in China. The study also highlighted the importance of the varying surface roughness in agricultural areas.
Several of these studies also assess the surface temperature used in the SM retrieval algorithm. These comparisons shed light on potential systematic errors arising from the estimation of the effective soil temperature needed for the inversion of the radiative transfer model [148], even though the effective soil temperature (based on GEOS model analysis soil temperature for the SMAP algorithms, [49]) generally differs from the physical soil temperature. This aspect is also particularly important for the consistency of the retrievals between the 6 AM and 6 PM overpasses. Different soil temperatures and the different vertical distribution of temperature in the soilvegetation continuum can cause systematic differences in the retrievals even though their overall performance metrics are similar.
Capturing SM during or right after precipitation is important for many hydrological applications. Reference [149] quantified the retrieval degradation over CVS during and right after (high vertical gradient in SM) rain events and showed that the SMAP PE product maintains sensitivity to SM even during rain events and suggests that flagging of rain events may be unnecessary to ensure SM retrieval quality.
The utility of the SMAP SM products has been also revealed through other means than comparisons to reference measurements. For example, [150] showed that the L4 product is consistent with SM condition surveys conducted by USDA National Agriculture Statistics Service volunteers indicating the value of the SMAP observations in the prediction of crop yield by geographical area. Reference [151] showed the value of SMAP SM in clarifying water supply controls affecting ecosystem productivity and land-atmosphere CO2 exchange. Reference [152] showed that the SMAP SM improves evapotranspiration retrievals for water limited regions. References [153] and [154] used an analysis of SMAP SM dynamics to investigate the coupling between SM and energy fluxes. Reference [155] showed the value of the SMAP SM data in improving hydrologic forecasts and [83] showed the SMAP products can achieve meaningful correlation between SM and near-surface air temperature.
During the first three years of the SMAP mission, the objective was to provide SM products that meet the requirements across the validation domain. Thereafter, the objective was broadened to include areas outside of the original validation domain, including forests. The mission is currently engaged in exploring the improvement and validation of SMAP SM products over forested areas through field experiments [52], [156], an added focus on forested candidate CVS (see Table III) and other networks, such as the National Ecological Observatory Network (NEON) [53]. Furthermore, the effort to expand the validation domain includes accounting for the complex soil composition of the boreal and arctic regions. One obstacle in addressing the retrieval issues in these areas has been the distortion of the global projection of the EASE v2 grid at high latitudes [157]. One solution is the use of the north polar grid projection, which is currently being implemented by the SMAP mission.

VI. CONCLUSION
The validation of six years of SMAP SM products demonstrate that they meet the accuracy requirements set for the mission. The DCA algorithm of the radiometer-based enhanced product (PE) exhibits the best performance -although the differences between the DCA and SCA-V are small. All of the algorithms show a relative degradation of the performance over croplands where the retrievals are challenged by rapidly changing vegetation and landscape heterogeneity. DCA is the only algorithm maintaining less than 0.04 m 3 /m 3 mean ubRMSD for the agriculturally dominated CVS. The PE product is also consistent with the SMOS L3 product across most parts of the globe. The validation of the 3-km SM product is hindered by the small number of high-quality validation pixels and the limited temporal and spatial coverage of Copernicus Sentinel-1 data. When aggregating the 3-km SM up to 9-km, the evaluation is more robust and the performance is satisfactory. The 9-km L4 product provides surface and rootzone SM, and the performance of both meet the mission accuracy requirement. Notably, L4 SM does not exhibit similar degradation of performance over croplands as the L2 products.
The SMAP validation program has fostered an increased use of in situ resources for SM validation. At the same time, studies have found that the spatial sampling requirements for the CVS may need to be even higher than originally planned for SMAP to accurately measure the area-average absolute SM [77]. Going forward, it would be important for the community to support efforts that aim at providing more accurate reference data at all spatial scales. Counterintuitively, the availability of reference data is more restrictive at smaller scales (1 and 3 km) than at coarser scales. Accurate reference datadesigned to capture true SM conditions at a variety of spatial scalesare the only way to ensure continued improvement in the quality of satellite-based SM products.

ACKNOWLEDGMENTS
Funding for this work was provided by the NASA SMAP mission. The research described in this publication was carried out in part at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. Computational resources for L4 production were provided by the NASA High-End Computing program through the NASA Center for Climate Simulation. The University of Salamanca team involvement in this study was supported by the Spanish Ministry of Science, Innovation and Universities (project ESP2017-89463-C3-3-R), the Castilla y León Government (projects SA112P20 and CLU-2018-04) and the European Regional Development Fund (ERDF). SMAP radiometer and SM data products and SMAP core site validation data are available from the National Snow and Ice and Data Center (https://nsidc.org/data/smap). SMAP radar products are available from the Alaska Satellite Facility (https://asf.alaska.edu/data-sets/sar-data-sets/smap/smap-dataand-imagery/). This research was supported by the U.S. Department of Agriculture, Agricultural Research Service. USDA is an equal opportunity provider and employer. This research was a contribution from the Long-Term Agroecosystem Research (LTAR) network. LTAR is supported by the United States Department of Agriculture.
APPENDIX A CEOS has put forward a four-stage validation hierarchy which has been adopted by many data providers (https://lpvs.gsfc.nasa.gov). The validation stage increases with increasing product maturity and extensiveness of the validation effort. It is a useful guide to assess the progress of a validation program.

APPENDIX B
This Appendix describes the computation of the performance metrics and the statistical confidence intervals. The performance metrics are computed following [55]. The rootmean-square difference (RMSD) is defined as: where xi is the SMAP soil moisture samples; yi is the in situ soil moisture samples (either core validation site or sparse network), and N is the number of samples. The mean difference (MD) is defined as: The unbiased RMSD is defined as: The Pearson correlation (R) is defined as: where the overbar denotes average. The confidence intervals (CI) of the aforementioned metrics are calculated following [34]. The CI of the MD for one measurement location is defined as: where −1 /2 is the value at /2 for the t-distribution with N-1 degrees of freedom. The CI of the ubRMSD for one measurement location is defined as: where −1 1− /2 is the value at 1 − /2 for the -distribution with N-1 degrees of freedom. The CI of RMSD for one measurement location is defined as: The CI of R for one measurement location is defined as: where in which F -1 is the normal inverse function with mean 0 and standard deviation 1, and In the calculation of the CIR, the effective number of samples is computed as: where = √ (12) in which and are the 1-lag autocorrelation of the SMAP and in situ SM samples, respectively. The confidence interval for the anomaly R is computed similarly as for R. For the 95% confidence intervals,  = 0.05.
For the average metrics, the confidence intervals of the separate locations are combined as follows: where Pj denotes the metric (MD, ubRMSD, RMSD, R, or anomaly R) whose confidence intervals are computed for site j, and M denotes the number of sites.

APPENDIX C
A. Result Tables for Radiometer-Based Product (PE)  Table VII and Table VI show the PE product CVS metrics for the 6 AM and 6 PM overpasses, respectively. Table VIII shows the sparse network comparison results for the PE product.   0.058 0.048 0.046 0.021 0.049 0.009 0.062 0.071 0.058 0.80 0.79 0.8 0.68 0.68 Table IX and Table XI show the SP product CVS metrics for the 9-km and 3-km scales, respectively. Table X shows the sparse network comparison results for the SP product at the 3km scale.