Evaluation of Global Gridded Precipitation and Temperature Datasets against Gauged Observations over the Upper Tekeze River Basin, Ethiopia

The availability of satellite and reanalysis climate datasets and their applicability have been greatly promoted in hydro-climatic studies. However, such climatic products are still subject to considerable uncertainties and an evaluation of the products is necessary for applications in specific regions. This study aims to evaluate the reliability of nine gridded precipitation and temperature datasets against ground-based observations in the upper Tekeze River basin (UTB) of Ethiopia from 1982 to 2016. Precipitation, maximum temperature (Tmax), minimum temperature (Tmin), and mean temperature (Tmean) were evaluated at daily and monthly timescales. The results show that the best estimates of precipitation are from the EartH2Observe, WFDEI, and ERA-Interim reanalysis data Merged and Bias-corrected for the Inter-Sectoral Impact Model Intercomparison Project (EWEMBI), and the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) datasets. The percentage biases and correlation coefficients (CCs) are within ± 15% and > 0.5, respectively, for both EWEMBI and CHIRPS at the two timescales. All products underestimate the drought conditions indicated by the standardized precipitation index (SPI), while the EWEMBI and CHIRPS datasets show higher agreement with the observations than other datasets. The Tmean estimates produced by the ECMWF Re-Analysis version 5 (ERA5) and the Climate Hazards Group InfraRed Temperature with Station data (CHIRTS) are the closest to the observations, with CCs of 0.65 and 0.55, respectively, at the daily timescale. The CHIRTS and EWEMBI datasets show better representations of Tmax (Tmin), with CCs of 0.69 (0.72) and 0.62 (0.68), respectively, at the monthly timescale. The temperature extremes are better captured by the ERA5 (Tmean), CHIRTS (Tmax), and EWEMBI (Tmin) datasets. The findings of this study provide useful information to select the most appropriate dataset for hydrometeorological studies in the UTB and could help to improve the regional representation of global datasets.


Introduction
Climate variables play various essential roles in the components of the global water cycle.Specifically, rainfall has dominant influences in the hydrological cycle and thus affects the modeling of extreme hydrological events (floods and droughts), water management, and natural disasters (landslides and earthquakes) (Tan et al., 2018;Gyasi-Agyei, 2020;Satgé et al., 2020).Similarly, temperature has significant influences on heat and cold waves (Costa et al., 2020).However, the variabilities of precipitation and temperature in space and time cause their measurements and estimations to be difficult.
The temporal and spatial coverages of gauge observations and their quality standards highly influence the accuracy of hydrometeorological studies (Li et al., 2020).The quality and stability of climatic data from ground meteorological stations may not be sufficient due to a limited number of stations, uneven spatial distribution, and vulnerability to human and environmental factors (Wang et al., 2020).These issues are more pronounced in developing countries (Boke, 2017).Economic and political problems often result in sparse and nonuniformlydistributed gauge networks that are unable to capture temporal and spatial climatic variability (Le Coz and van de Giesen, 2020).For example, the recommended rain gauge density is one station per 600-900 km 2 for flatlands and one station per 100-250 km 2 for topographically rugged areas (Zeng et al., 2018).Africa has only 744 stations, of which only a quarter of them meet the international standards.However, Africa ideally requires 10,000 uniformly distributed standard stations to capture its spatiotemporal climate variabilities (Satgé et al., 2020).Moreover, the gauges that exist in Africa are concentrated in a few regions, such as in South Africa (Le Coz and van de Giesen, 2020).In Ethiopia, the gauge networks are very sparse and are of low quality.For example, in the upper Tekeze River basin (UTB), one rain gauge station covers an area of approximately 1400 km 2 , which is far below the recommended density standard (Gebremicael et al., 2019).Thus, investigating relevant data sources that can capture the spatiotemporal climatic variability of the UTB is required for high-quality hydrometeorological studies.
Recently, the availability of high-quality gridded climatic dataset products (satellite and reanalysis-based) has increased, and these products are increasingly used in hydrological and water management studies (Hu et al., 2017;Fang et al., 2019;Azimi et al., 2020;Gebremicael et al., 2020;Wang et al., 2020).In this context, gridded climatic datasets at both regional and global scales become essential alternative data sources (Li et al., 2020).This could be a result of the fast development of remote sensing and data assimilation technologies (Zhang and Ma, 2018).However, the performance of these datasets in representing the ground-based observed data and in hydrological modeling applications is inconsistent between different regions and basins.
Many studies have evaluated and validated different datasets against gauge measurements (Dembélé and Zwart, 2016;Sahlu et al., 2017;Aslami et al., 2019;Lockhoff et al., 2019;Ayoub et al., 2020;Islam and Cartwright, 2020).Moreover, satellite and reanalysisbased datasets have also been used to drive hydrological models in different basins around the world (Zhao et al., 2015;Tan et al., 2017;Li et al., 2018;Roy et al., 2018;Awange et al., 2019;Azimi et al., 2020).These studies suggested that the various datasets may have different applicability across regions.For instance, the climatic research unit (CRU) time series (TS) 2.1 shows good agreement over most parts of China as compared to other reanalysis rainfall estimates (Zhao and Fu, 2006).A similar study by Dembélé and Zwart (2016) indicated that the Climate Hazards Group InfraRed Precipitation with Stations (CHIRPSv8) precipitation dataset better captured the observed extremes in West Africa.Notable improvements of interpolated surface temperature have been achieved through topographic correction for the 40yr ECMWF Re-Analysis (ERA-40) temperature dataset over China (Zhao et al., 2008).The different meteorological datasets are inevitably subject to uncertainties resulting from the factors associated with climate zones, seasonal changes, ground-surface conditions, geographical positions, and the embedded algorithms used to derive the datasets (Fang et al., 2019;Gebremicael et al., 2019;Wang et al., 2020).Assessment of the accuracy and uncertainties in historical meteorological dataset products based on ground observations is helpful for water resource and hydrological studies.
Some studies have already evaluated satellite-and reanalysis-based global datasets in Ethiopia, e.g., in the Lake Tana basin (Worqlul et al., 2014), the Gilgel Abbay watershed (Lakew et al., 2017), and the upper Blue Nile (Bayissa et al., 2017;Sahlu et al., 2017).These studies focused on rainfall datasets only in the upper Blue Nile basin.Gebremicael et al. (2019) evaluated eight precipitation products in the UTB.However, their study focused only on satellite rainfall products despite the fact that temperature is also a key parameter for hydro-climatic studies in the region.Reanalysis-based climatic datasets often provide better spatial coverage than non-reanalysisbased datasets.Based on the above, our study aims to evaluate nine satellite and reanalysis rainfall and near surface air temperature (SAT) datasets against gauge-observed data in the UTB.The UTB is representative of the mesoscale basins in northern Ethiopia, and the Tekeze River is the most important water source for irrigation and hydropower in both Ethiopia and the downstream countries such as Sudan and Egypt (Fentaw et al., 2018;Gebremicael et al., 2019).With a comprehensive evaluation of the temperature and precipitation products, this study will contribute to UTB water resource management for the sustainable development of the basin region and the downstream countries.

Study area
The Tekeze River basin (Fig. 1), situated in the northwestern part of Ethiopia, is one of the major tributaries of the Nile River (Abrha, 2009).The Tekeze headwaters originate in the Meket Mountains near Lalibela and flow northwards until it turns westward along the Ethiopia-Er-itrean border, with a distance of 600 km until it crosses the Ethiopia-Sudan border near Humera (MoWR, 1998).The basin is characterized by complex topography consisting of mountains, highlands, and lowlands with gently sloping terrain.The complex topography endows the basin a high potential for hydropower production in the mountainous areas and for irrigation in the lowland areas.
The northern and eastern parts of the UTB are categorized as semi-arid, and the southern part is categorized as semi-humid (Fentaw et al., 2018).The annual rainfall ranges from 400 mm yr −1 in the eastern parts to 1200 mm yr −1 in the southwestern parts of the basin (Gebremicael et al., 2019).The rainy season is from June to September, during which more than 70% of the total annual rainfall occurs in the UTB.The variables T max and T min vary from 3 to 21°C in the high-elevation areas (around the Semien Mountains) and from 19 to 43°C in the flat and low-elevation areas (around Humera).At the basin level, the average minimum and maximum temperatures (T min and T max ) of the UTB are 11 and 32°C, respectively (Fentaw et al., 2018).

Gauge observations
Gauge observations of daily maximum and minimum SAT and precipitation were collected from the National Meteorological Agency of Ethiopia (NMAE) from 21 stations located inside and nearby the UTB (Fig. 1).These daily data span from 1980 to 2019, with the lengths of records varying from station to station.In this study, the stations having at least 35 years of continuous records with less than 2.5% missing values are considered.Most of the meteorological stations are in the northeastern part of the UTB (mostly highland area), and only a few are in the western and southwestern parts (dominantly lowland areas) of the basin (Fig. 1).This may introduce uncertainties into the spatially aggregated values of the climate variables.Thus, the point data are not interpolated into gridded time series.However, for the purpose of evaluation/validation of satellite and reanalysis climatic datasets, the distribution of stations can be considered sufficient for a point-to-pixel comparison.
Quality control such as excluding outliers and homogenization was applied to gauge observation TS data.Monthly precipitation and SAT series were calculated for each gauge station and then tested for homogeneity using the Standard Normal Homogeneity Test.The homogenization procedure is based on the application of the Standard Normal Homogeneity Test.This test assumes that precipitation amounts at the station being tested (test station) and some regional average values are proportional to each other.In our study, the test was applied to a series of ratios comparing the observations at each gauge station (test stations) with the averaged observations at the four nearest stations to the station to be tested.This relationship is expressed in terms of the ratio between the test station normalized precipitation values and those of a regional TS defined as a weighted average of the four neighboring reference stations.See Section 1 in supplementary information for more details.The test indicated no homogenization problems in the rainfall and temperature gauge observations, which are thus reliable for statistical analyses following screening criteria.

Satellite-based and reanalysis datasets
We selected nine climatic datasets that include satellite, reanalysis, gauge-based temperature and rainfall observations, and the merged estimates based on the three types of datasets (Table 1).The choice of these products was mainly based on their temporal coverage, data integrity, availability of near-real-time data, accessibility, and popularity in the literature.
The CRU gridded TS (CRU TS) dataset is produced by the CRU at the University of East Anglia based on the World Meteorological Organization (WMO) global stations (Harris et al., 2014).Many studies have used this dataset in diverse research areas in Africa since it was first released in 2000 (e.g., Haile et al., 2020;Mahmood et al., 2020;Peng et al., 2020).
ERA5-land is a global reanalysis-based product produced by the Copernicus Climate Change Service (C3S; Maidment et al., 2013).The variable T mean at 2 m above the ground and the total precipitation were obtained from this dataset.These two variables are based on inputs from satellite radiances and in-situ data provided by the WMO information system (WIS).This product has been applied in Africa, e.g., in East Africa (Agutu et al., 2017) and Uganda (Maidment et al., 2013).
The CHIRPS data are a quasi-global rainfall dataset developed by the Climate Hazards Center (CHC) at the University of California, Santa Barbara and the US Geological Survey (USGS; Funk et al., 2015).This dataset is based on the CHC's precipitation climatology dataset and quasi-global geostationary thermal infrared satellite observations.Rainfall data from the NOAA Climate Forecast System and the TRMM 3B42 (Tropical Rain Measurement Mission level 3 output) product are also the inputs for CHIRPS (Funk et al., 2015).This product has been widely used in East Africa, as reported in many studies (e.g., Bayissa et al., 2017;Lemma et al., 2019).
Climate Hazards Group InfraRed Temperature with Station (CHIRTS) daily SAT data was developed by the CHC at the global scale (60°S-70°N) with a spatial resolution of 0.05° × 0.05° (approximately 5 km).This dataset contains daily maximum and minimum SAT and was produced through the disaggregation from monthly fields of the CHIRTS T max (CHIRTSmax1) TS and by synthesizing daily temperatures from the ERA5 (Funk et al., 2019).The input data for CHIRTSmax1 are Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA2), CRU TS, and ground station data (of approximately 15,000 stations) from the Berkeley Earth database.
The Global Precipitation Climatology Centre (GPCC) product v2018 of global daily precipitation was produced based on rainfall data provided by national meteorological and hydrological station services, regional and global data collections and WMO GTS (Global Telecommunication System) data (Schamm et al., 2014).More than 85,000 rain gauge stations across the globe were used in the GPCC dataset.This dataset has been widely used in Africa, e.g., in the Horn of Africa (Dinku et al., 2007;Funk et al., 2015) and sub-Saharan Africa (Harrison et al., 2019).
The Climate Prediction Center (CPC) dataset is part of a product suite from the CPC Unified Precipitation Project that is underway at the NOAA (Chen et al., 2008).The precipitation and temperature variables were produced by merging observations from gauge stations with precipitation estimates from several satellite-based (infrared and microwave) datasets.
NCEP Reanalysis-2, produced by the NOAA-CIRES Climate Diagnostics Center, is a new version of the NCEP Reanalysis-1 model that has been modified through the update of the physical process parameterization and the fixing of errors (Marques et al., 2009).NCEP Reanalysis-2 performed data assimilation using the Rapid Radiative Transfer Model (RRTM) developed by the Atmospheric and Environmental Research (AER) group using different data sources.
WFDEI was generated by using the same methodology as the WATCH (WATer and global CHange) Forcing Data by making use of the ERA-Interim reanalysis data (Weedon et al., 2014).The ERA-Interim data improved the temperature and precipitation data compared to WATCH.WFDEI has been widely used for studies on climate change and hydrological modeling (Liu et al., 2017;Lange, 2019).
EWEMBI was a newly compiled reference dataset named for EartH2Observe, WFDEI, and ERA-Interim reanalysis data Merged and Bias-corrected for the Inter-Sectoral Impact Model Intercomparison Project (EWEM-BI) phase 2b (Lange, 2018).The sources of the precipitation and temperature data of the EWEMBI dataset are the WFDEI and GPCC v5 data over the land surface, and EartH2Observe forcing data (E2OBS) over the ocean (Lange, 2016).

Evaluation indices
The climatic variables of precipitation and mean, maximum, and minimum SAT were evaluated at daily and monthly timescales.The temperature and precipitation products with hourly temporal frequencies were aggregated to daily and monthly TS.Evaluation of the selected temperature and precipitation products was performed for 35 yr covered by all the products.A point-to-pixel comparison approach was used to evaluate the products against their corresponding observations at each gauge station.In this point-to-pixel approach, precipitation and temperature products from a grid cell were compared to the observed gauge data located within this pixel.This approach is more applicable in rugged topography and complex terrain such as the UTB than the areas with homogenous topography for the evaluation of such products at small spatial scales with varying climatic factors (Gebremicael et al., 2019;Lemma et al., 2019;Li et al., 2020).
The precipitation products were further evaluated for monthly cumulative precipitation and different rainfall intensity groups.The monthly standardized precipitation index (SPI) was used to compare the performances of the rainfall datasets in capturing historical drought events.SPI is an index for characterizing and monitoring drought conditions on a range of timescales based on the probability distribution of a long-term precipitation TS (McKee et al., 1995;Liu et al., 2012).The calculation of SPI includes a transformation of one frequency distribution (e.g., gamma) to another frequency distribution (normal or Gaussian).Specifically, a time serial of precipitation is firstly fitted with a gamma (or an incomplete beta) probability distribution, and then transformed to a normal (or Gaussian) distribution.The gamma distribution has been widely used in previous studies as it has been understood as the reliable fit to the rainfall data.The normal distribution has a mean of zero and standard deviation of one and then can be used to indicate dry or wet condition.The detailed SPI calculation procedures can be found in the supplementary information.The SPI was calculated for 3-month (SPI3), 6-month (SPI6), 9-month (SPI9), and 12-month (SPI12) timescales for each precipitation product.A threshold value of -1 is usually used as an indicator of drought conditions for the SPI (Tefera et al., 2019).Hence, −1.0 ≥ SPI > −1.5 indicates moderate drought, −1.5 ≥ SPI > −2.0 indicates severe drought, and SPI < −2.0 indicates extreme drought (Liu et al., 2012).The trend of extreme heat and the number of hot days per year (number of days scored above the extreme value) were also evaluated for the temperature datasets against the gauge-observed data.In this study, extreme temperature was defined as SAT > the 95th percentile for daily maximum and mean SAT and SAT < the 5th percentile for minimum SAT.

Daily timescale
The statistical metrics from the comparison of precipitation products against the observed data across all gauging stations of the UTB are shown in Fig. 2. The EWEMBI dataset shows a reasonably good correspondence with the rain gauge observations at the daily timescale, with average values of 0.58, 2.3%, 6.4 mm, and 2.7 mm for CC, Pbias, RMSE, and MAE, respectively.This may be due to the climatological bias-adjustments done for EWEMBI.EWEMBI was produced by using data sources E2OBS, WFDEI, and ERAI, which are bias-corrected by using GPCPv2.1 monthly precipitation total Percentage of bias (Pbias) P bias = µ P y i ¡ P Note: is the temperature or precipitation from gauge observations, is the corresponding value from the precipitation or temperature product, n is the number of observations, and and are the averages of the gauge-observed and meteorological product values, respectively.The evaluation was performed at daily and monthly timescales for all the selected gauge stations.
over the ocean (Balsamo et al., 2015) and with GPCCv5/v6 monthly precipitation totals over the land (Weedon et al., 2014) were obtained for the CPC dataset.The CC (Fig. 2a), Pbias (Fig. 2b), RMSE (Fig. 2c), and MAE (Fig. 2d) values for each precipitation product except for the EWEM-BI, CHIRPS, and CPC exhibit poor agreement with the majority of the rain gauge networks at the daily timescale.The larger gap between some of the datasets and the observed data at the daily timescale could be due to the complex and rugged topography in the UTB.This, in turn, could contribute to failures in detecting more localized convective rainfall events.Among the precipitation datasets, the ERA5 and NCEP products overestimate the rainfall at all stations.The ERA5 products resulted in a high CC (0.6) but largely overestimated the rainfall at all stations, with an average (range) value of Pbias reaching 150% (88% to 242%).This indicates the poor performance of the ERA5 product in capturing the ground rainfall over the basin, which is consistent with the previous study by Lemma et al. (2019).Likewise, NCEP extremely overestimates (Pbias = 98.6%) the daily observed rainfall.NCEP also shows a relatively poor skill throughout the basin, with an average CC value of 0.33.The RMSE and MAE values for both the ERA5 and NCEP products are also very large compared to those of the other datasets (Fig. 2).The performances of the remaining precipitation products are characterized by results in between those of the rainfall products discussed above.
Figure 3 presents a comparison between the precipita-tion estimates and the observed rainfall using the spatial distribution of CC.The spatial distributions of CC, Pbias, RMSE, and MAE (Table S1a-c) indicated by the range values show that EWEMBI and CHIRPSv8 products have better agreement with ground rainfall, especially at the stations in the northeastern part of the basin.The NCEP, GPCC, and WFDEI show poor agreement with CC values less than 0.5 at almost all the stations.

Monthly timescale
The daily precipitation products were then evaluated at the monthly timescale in order to determine how much the performance of the precipitation products would be improved in the magnitude of the indicators from daily to monthly scales.Meanwhile, they could also be compared with the well-known monthly precipitation product, the CRU TS v4.03.Table 3 summarizes the performances of the different rainfall estimates using the various statistical indices at the monthly timescale.The statistical indices of all products in estimating the monthly rainfall are significantly improved compared to those at the daily timescale.As shown in Table 3, all products have average CC values of greater than 0.7.This result agrees with previous studies showing improvements in detecting rainfall events using satellite and reanalysis-based datasets when the temporal scale was changed from daily to monthly (Roy et al., 2018;Ayoub et al., 2020;Islam and Cartwright, 2020).This could be due to the variabilities being counterbalanced and the errors offset each other when the data are aggregated from shorter to longer timescales.Even though the estimation accuracies are improved for most of the precipitation datasets, the EWEMBI, CHIRPS, CRU TS v4.03, and CPC datasets relatively outperform the others, with higher CC values and lower Pbias, RMSE, and MAE values.In addition, the performance rank of these products at the monthly timescale is in line with the results obtained at the daily timescale.
As the observations have the highest accuracy at the location of stations, a case was added by interpolating grid data to the stations and compare these interpolatedgridded data with station observation directly.A little performance improvement was found for most of the precipitation products (Table 4).CHIRPS (CC = 0.9) and EWEMBI (CC = 0.89) perform better as compared to the others, while the NCEP, ERA, and WFDEI show relatively poor agreement with the ground observations.

Precipitation intensity
To further understand whether the different precipitation datasets could capture rainfall events within various intensity groups, we divided the daily precipitation intensity (PI) of the gauge data (OBS) into 6 groups (0 ≤ PI < 1, 1 ≤ PI < 5, 5 ≤ PI < 10, 10 ≤ PI < 20, 20 ≤ PI < 30, and PI ≥ 30 mm day −1 ) following Wang et al. (2020).Figure 4 shows the total cumulative rainfall values of the eight precipitation products under different PI groups.The PI groups ranging from 0-1 to 1-5 mm day −1 are overestimated by all precipitation products.However, it is clearly shown in Fig. 4 that ERA5 and NCEP highly overestimate all rainfall magnitudes (PI groups).In par-ticular, ERA and NCEP fail to capture the categorized PI groups with values below 20 mm day −1 , but reasonably well capture for rain gauge observations of above 20 mm day −1 .This result is contrary to the findings by Islam and Cartwright (2020), in which they reported ERA5 outperforming other datasets in predicting rainfall accumulation below 20 mm day −1 but seriously underestimating higher rainfall values.This could be an indicator that the  different datasets could produce different results in different regional studies.Most of the remaining precipitation products, especially EWEMBI, CPC, and CHIRPS, captured the PI groups greater than 5 mm day −1 better than ERA5 and NCEP.The precipitation products were further evaluated by using statistical metrics to examine whether the datasets can capture the rainfall of the different PI groups.The CC and Pbias indices indicate that the different precipitation datasets poorly captured almost all PI groups.However, CHIRPS (for 0 ≤ PI < 1 and 1 ≤ PI < 5), EWEBI (for 5 ≤ PI < 10 and 10 ≤ PI < 20), and CPC (for 20 ≤ PI < 30 and PI ≥ 30 mm day −1 ) showed relatively better performance with higher CCs and lower Pbias values (Table 5) compared to the other datasets.This result is in agreement with a previous study by Wang et al. (2020), which showed lower capturing capacities of reanalysis and satellite rainfall products when rainfall events were grouped into different intensity ranges.The results of the statistical metrics (CC, Pbias, RMSE, and MAE) of all the precipitation datasets for the considered PI groups are summarized in the supplementary files (Ta-ble S2a-d).

Historical drought events
The ability of the different rainfall datasets to detect temporal drought was analyzed through comparisons to the drought indicated by the gauge-observed rainfall records (OBS).Figure 5 shows the monthly average SPI values over the UTB during 1982-2016 for each of the precipitation products using the 12-month SPI TS (SPI12).The SPI3, SPI6, and SPI9 TS are also given in the supplementary materials (Fig. S1).The discrepancies in the SPI results among the datasets were further supported by the time-specific historical exceptional drought records during the study period.For example, there was a severe drought in Ethiopia and Sudan during 1984-1985, induced by persistent rainfall shortages, which caused many deaths and migrations (Degefu and Bewket, 2015;Haile et al., 2020).The observed data detected this case as an extreme drought (SPI < −2; Fig. 5).EWEMBI, CRU TS v4.03, and CHIRPS detected a severe drought, while NCEP, ERA5, WFDEI, and GPCC identified a moderate drought for that specific period.Furthermore, CPC did not detect a drought in this particular period.Similarly, during 2002Similarly, during -2003, there was a countrywide drought in Ethiopia that affected more than 14 million people and resulted in severe damages (Muller, 2014;Nicholson, 2017).Almost all products detected this drought event better than the drought event in 1984; the datasets detected drought events for this case from moderate to extreme, except GPCC and WFDEI, which detected a very light drought (0 > SPI > −1).
Figure 6 indicates the statistical metrics (Pbias, CC, RMSE, and MAE) for the drought events (for which the observed SPI is ≤ −1) at the SPI3, SPI6, SPI9, and SPI12 TS for each product.Generally, all precipitation products underestimated the drought pattern compared to that detected in the observed data.This is revealed by the negative values of Pbias for all SPI TS (Fig. 6b).Drought estimates from EWEMBI, CRU TS v4.03, and CHIRPS are closer to the observed drought (SPI ≤ −1) for SPI3, SPI6, SPI9, and SPI12 than the other datasets.The Pbias and CC for EWEMBI, CRU TS v4.03, and CHIRPS are estimated as −28%, −33%, and −44%, and 0.71, 0.67, and  0.73, respectively, for SPI12.The drought metrics at the SPI12 TS driven by NCEP, GPCC, ERA5, and WFDEI are significantly underestimated compared to the drought observations.The underestimates are within the range from −50% (ERA5) to more than −112% (NCEP).This indicates that most precipitation products can reasonably detect the meteorological, agricultural, and hydrological droughts in the UTB.The Pbias (CC) values of EWEMBI are −40% (0.52), −37% (0.57), −32% (0.61), and −28% (0.71) for drought metrics of SPI3, SPI6, SPI9, and SPI12 TS, respectively.These values clearly show a consistently improving pattern with the increasing months of the SPI TS.This may imply the improvement of the capability in rainfall events detection of the precipitation products from daily to monthly timescales.However, there are some exceptions for WFDEI and NCEP; even though their correlations with the observed data increase with the SPI TS, their underestimations of drought are further accelerated as the SPI increases in the number of months from SPI3 to SPI6, SPI9, and SPI12 (Fig. 6).This may be attributed to the different retrieval methodologies and assumptions of the products.The results suggest that choice of precipitation product, spatial resolution, and record lengths can vary significantly in precipitation-based drought metrics.These relationships also vary with the severity of drought events.
In summary, the EWEMBI, CRU TS v4.03, and CHIRPS datasets have shown better agreements with the observed drought at each SPI timescale (SPI3, SPI6, SPI9, and SPI 12) across the basin than the other products.The relatively better performance of these products is also confirmed by the minimum Pbias, RMSE, MAE, and CC at the monthly timescale.

Temperatures at daily and monthly timescales
Comparisons based on the exact daily T mean at each gauge station indicate that the estimates from the ERA5, CHIRTS, and EWEMBI products are in agreement with the corresponding observations.As presented in Fig. 7, the average CC values for the ERA5, CHIRTS, and EWEMBI products are 0.65, 0.55, and 0.55, respectively.The daily T mean patterns gathered from these products have consistent and better agreement with the ground T mean than those of the NCEP and CPC products, which have large RMSE and MAE values.The discrepancies in the metrics are large across stations for NCEP and CHIRTS.In particular, the ranges of the metrics are large for NCEP, e.g., 1.62-6.38°C(MAE) and 2.04-6.82°C(RMSE).The CC metric was also calculated based on the anomaly values of T mean .Similar to the exact values, The ERA5, CHIRTS, and EWEMBI products are in agreement with the corresponding observation of T mean anomalies, with CC value of 0.62, 0.53, and 0.52, respectively (Fig. S2).
Daily T max and T min were evaluated for CHIRPS, CPC, and EWEMBI, and monthly T max and T min were evaluated including CRU TS v4.03.Table 6 indicates the performances of each of these products in capturing their corresponding gauge values; the values shown are the average results of the metrics for all the stations within the basin.The T max and T min estimates of CHIRTS and EWEMBI show relatively more accuracy at the daily scale than the other datasets.Likewise, the T max and T min estimates of CHIRTS, CRU TS v4.03, and EWEMBI indicate better agreement with the observed values at the monthly timescale (Table 6) than the other datasets.However, analogous to the precipitation products, all the temperature products show poor performances at the daily timescale than at the monthly timescale.The better performance at the monthly timescale could be due to that the errors in daily data offset each other when aggregated to the monthly data.
The CHIRTS, CPC, and EWEMBI products overestimate T max and T min at both daily and monthly timescales.The CRU TS temperature product merely underestimates T min and overestimates T max (Table S3a-c).The temperature products are able to better capture T mean than T max and T min of the gauge observations with relatively higher CC values (Fig. 7 and Table 6).However, SAT estimates from all products show consistently improving patterns with an increase in temporal scale.This improvement could be because the different temperature datasets are relatively better-captured by ground measurements when compared at larger temporal scales than at smaller temporal scales.The temporal variations in T mean , T max , and T min at the seasonal timescale are given in the supplementary information (Table S4) for spring, autumn, winter, and summer.Generally, all the temperature products overestimate the seasonal daily T mean , T max , and T min , except ERA5 (which underestimates summer T mean ) and CPC (which underestimates autumn T max ).

Temperature extreme indices
Table 7 presents daily values of the 95th percentile of T max and T mean and the 5th percentile values of T min for all the SAT datasets.CPC and ERA5 have relatively closer values to the 95th percentile T max and T mean gaugeobserved data, respectively.The high extreme SAT (95th percentile) from all the products are larger than those of the observations.EWEMBI shows a relatively better estimate of low extreme (5th percentile) T min observed data (Table 7) than the other products.
In this study, the SAT values are considered to be highly extreme when their values are greater than the 95th percentile of the gauge SAT data (values greater than 28.4 and 21.1°C for T max and T mean , respectively).Similarly, a low extreme is considered when the values of T min are less than the 5th percentile of the gauge SAT (values < 8.2°C) From the perspective of the 95th percentile long-term annual changes in T max and T mean presented in Fig. 8, each product consistently overestimates the SAT variables.With regard to the changing tendencies, OBS, CHIRTS, EWEMBI, ERA5, and NCEP show an increasing pattern, whilst the CPC product shows a significantly decreasing pattern.According to the CC results, the annual TS of the 95th percentile T mean values of ERA5 and EWEMBI shows comparatively better correlations with the observed data than the other datasets, with CC values of 0.71 and 0.66, respectively.CPC shows the  lowest correlation with the observed data, with a CC value of 0.29 for the 95th percentile of T mean (Fig. 8b).CHIRTS exhibits a relatively better accuracy in the 95th percentile annual TS values of T max (CC value of 0.67).
Although the estimated 95th percentile T max values of CPC are closer than those of CHIRTS and EWEMBI to the observed data, the time-series correlation of CPC shows a poor performance with a CC value of less than 0.26 (Fig. 8a).The SAT datasets have poorly matched temporal variability of the T min extremes (5th percentile of T min ) compared to the corresponding observed values.Relatively, EWEMBI is considered better (CC = 0.52) than CHIRTS (CC = 0.26) and CPC (CC = 0.30).Similarly, EWEMBI, CHIRTS, and CPC also overestimate the T min yearly extremes (Fig. 8).
The number of days per year in which each SAT dataset scores greater than the 95th percentile of T max (28.4°C) and T mean (21.1°C), and less than the 5th percentile of T min (8.2°C) is also evaluated.The mean estimated number of days with T max above 28.4°C is 50, 113, 55, and 126 days per year for OBS, CHIRTS, CPC, and EWEMBI, respectively (Fig. 9a).All SAT products overestimate the number of days with SAT greater than the 95th percentile value of T max (28.4°C).Among others, the CPC estimate is the closest to the observed values.The highest daily T max , which reaches a value of 37.7°C, is estimated by EWEMBI, followed by the observed data (36.7°C).Figure 9b shows the average number of days with T mean values higher than 21.1°C; the highest value is estimated by CHIRTS (182.9 days yr −1 ); the minimum number is recorded by ERA (56.6 days yr −1 ), which is comparatively closer to the ground observations (49.8 days yr −1 ) than the estimates of the other datasets.The largest daily T mean value from the observed data is 30.8°C,whereas the largest mean daily SAT from the different datasets ranges from 26.6°C (EWEMBI) to 31°C (CPC).Similarly, the average number of days per year of T min less than 8.2°C is 20.4,2.5, 4.9, and 10.5  days for OBS, CHIRTS, CPC, and EWEMBI, respectively.The number of T min days below the 5th percentile value of the observed data is underestimated by all products, unlike the trends seen with T max and T mean (Fig. 9c) In the UTB, T mean is better captured by ERA5, while T max and T min estimated by CHIRTS are in good agreement with the observed values.This implies that the best estimates of T mean , T max , and T min are from different SAT products.This result agrees with the study of Nechita et al. (2019).The TS of SAT extremes of all the datasets, including the observed data, shows increasing trends.This could be an indication of the impact of climate change in the UTB.This argument could also be supported by Fentaw et al. (2018) in a climate change study in the UTB; the study results show an increasing trend in SAT.

Conclusions
In this study, nine satellite and reanalysis-based precipitation and SAT datasets were statistically evaluated over the UTB.The EWEMBI, ERA5-land, GPCC v2018, CPC, WFDEI, NCEP reanalysis-2, CRU TS v4.03, CHIRPSv8, and CHIRTS products were evaluated against observed data.The precipitation, T max , T min , and T mean were evaluated by using different statistical indices, including CC, Pbias, RMSE, and MAE against the ground measurements at daily and monthly timescales.These products were further evaluated by using different precipitation intensity groups and drought indices.Accordingly, the following conclusions are derived from this study.EWEMBI, CHIRPSv8, and CRU TS v4.03 show good performances in representing daily and monthly precipitation.EWEMBI (CC = 0.86) shows reasonable and good agreement with the observed values, followed by the CHIRPS (CC = 0.85) and CRU TS v4.03 (CC = 0.85) products at the monthly timescale.The rainfall estimates from ERA5 and NCEP show poor performances as the daily rainfall is largely overestimated by these products.The monthly precipitation estimates usually perform better than the daily estimates.The precipitation products overestimate the lower precipitation intensity events (less than 5 mm day −1 ) but better estimate higher precipitation intensity events (greater than 20 mm day −1 ).Moreover, the precipitation products underestimate historical drought events compared to the observed data.Specifically, the EWEMBI, CRU TS v4.03, and CHIRPS products show better agreement with the gauge-based drought (SPI ≤ −1) estimation for each SPI timescale.Overall, the EWEMBI, CHIRPSv8, and CRU TS v4.03 precipitation products better represent the rainfall in the UTB than the other products.
ERA5, CHIRTS, and EWEMBI provide relatively better SAT estimates than the other products.T mean estimates from ERA5, CHIRTS and EWEMBI show good agreement with gauge measurements.Similarly, T max and T min estimates of CHIRTS and EWEMBI at daily timescales and CHIRTS, CRU TS v4.03, and EWEMBI at monthly timescales have better agreement with observed values over the UTB than the other products.The SAT products have a poorer performance at the daily timescale than at the monthly timescale.In addition, T mean is better represented than T max and T min by these products.SAT extremes are well-represented by ERA5 and EWEMBI (for T mean ), CHIRTS (for T max ) and EWEMBI (for T min ), while the CPC product shows poor performance in capturing the temperature extremes.The number of days with extreme temperature values is overestimated by all products.The overall deviation between the SAT products and the observed values is higher in the T min estimates than in the T max and T mean estimates.In summary, depending on factors such as time series (daily and monthly) agreement and the estimation accuracy of the products for temperature extremes, the ERA5, CHIRTS, and EWEMBI temperature products are relatively better than the other products at representing the gauge-observed values in the UTB.

Fig. 1 .
Fig. 1.Map of the study area: (a) Ethiopia, (b) geographical locations and distributions of meteorological stations in the UTB, (c) average monthly maximum, mean, and minimum temperatures (T max , T mean , and T min ), and (d) mean monthly rainfall of the UTB for dry and wet seasons based on gauge records from 1980 to 2019.

)Fig. 2 .
Fig. 2. Boxplots of evaluation metrics for the seven precipitation products at daily timescales during 1982-2016.(a) Pearson correlation coefficient (CC), (b) percentage of bias (Pbias), (c) root mean square error (RMSE), and (d) mean absolute error (MAE).Each boxplot also indicates the range of variation from minimum to maximum (vertical line), the quartiles and median (horizontal lines), mean (points), and outliers (+) of the precipitation estimates at the 21 stations.

Fig. 3 .
Fig. 3. Spatial distributions of CCs for the seven daily precipitation products.
Cumulative rainfall amounts under different PI groups during 1982-2016.

Fig. 5 .
Fig. 5. Temporal variations in the monthly standardized precipitation index (SPI) over the UTB for a 12-month SPI timescale (SPI12) for all precipitation products.

Fig. 7 .
Fig. 7. Boxplots showing comparisons of T mean between the five product estimates and the ground measurements at the daily timescale: (a) Pearson CC, (b) mean absolute error (MAE), and (c) root mean square error (RMSE).

Fig. 8 .
Fig. 8. Temporal changes in the annual 95th percentile values of (a) T max and (b) T mean , and temporal changes in the 5th percentile for (c) T min of the different SAT datasets.

Fig. 9 .
Fig. 9.Comparisions between observed values and the various daily SAT datasets in the number of days estimated above the 95th percentile for T max and T mean and below the 5th percentile for T min : (a) annual number of days with daily T max above 28.4°C,(b) annual number of days with daily T mean above 21.1°C, and (c) annual number of days with T min below 8.2°C.In this study, the scores over the specified SAT (95th percentile values of the observed data) denote occurrence of extreme SAT.

Table 1 .
Summary of dataset characteristics considered in this study max , T min , T mean Funk et al. (2019) Note: EWEMBI: the EartH2Observe, WFDEI, and ERA-Interim reanalysis data Merged and Bias-corrected for the Inter-Sectoral Impact Model Intercomparison Project; CRU: Climatic Research Unit; ERA5: ECMWF Re-Analysis version 5; GPCC: Global Precipitation Climatology Centre; CPC: Climate Prediction Center; NCEP: National Centers for Environmental Prediction; WFDEI: WATCH Forcing Data methodology applied to ERA-Interim data; CHIRPS: Climate Hazards Group InfraRed Precipitation with Station data; CHIRTS: Climate Hazards Group InfraRed Temperature with Station data.Pre denotes precipitation.

Table 2 .
Statistical indices used for evaluating multiple meteorological products Lemma et al. (2019), Gebremicael et al. (2019)I for the different gauge stations range from 0.48 to 0.67, −23.6% to 37%, 5.3 to 7.6 mm, and 2 to 3.5 mm, respectively.Compared with EWEMBI, the CHIRPS and CPC products perform well, with CC values greater than 0.5.The likely reason for the better performance of CHIRPSv8 could be its high spatial resolution (0.05°) as compared to the other products and the CHIRPS algorithm that combines the bias-corrected CHIRP with station observations.A similar study in the upper Blue Nile basin, Ethiopia bySahlu et al. (2017), Gebremicael et al. (2019)andLemma et al. (2019)also showed that the CHIRPSv8 product better captures the ground observations compared to other satellite products.The averages and range values (in brackets) of CC, Pbias, RMSE, and MAE of CHIRPS for all stations are 0.57 (0.45 to 0.7), −4% (−24.4 to 22%), 7.1 mm (5.8 to 10.6 mm), and 2.8 mm (2.1 to 3.8 mm), respectively.Similarly, average values and ranges of 0.5 (0.3 to 0. 7), −16.8% (−31.1 to 23.5%), 6.8 mm (5.5 to 7.7 mm), and 2.7 mm (1.8 to 3.6 mm) for CC, Pbias, RMSE, and MAE

Table 3 .
Ranges and means of the statistical metrics derived from the comparison between the different precipitation datasets and the observed at the monthly timescale.The range values are shown in parentheses

Table 4 .
Mean of the statistical metrics resulting from the comparison between the different interpolated gridded rainfall datasets and the observed data at the monthly timescale

Table 5 .
Statistical summary of EWEMBI, CPC, and CHIRPS precipitation products in capturing different PI groups

Table 6 .
Comparison of T max and T min between the four products and the observed values at daily and monthly timescales

Table 7 .
95th percentile record for daily T max and T mean and 5th percentile record for daily T min of the SAT products overUTB during  1982UTB during   -2016