
Lake Taihu is one of the largest freshwater lakes in China. It is located in the southeast portion of the Yangtze River Delta (30°55′–31°30′N, 119°55′–120°40′E), covering an area of 2338 km^{2} with an average water depth of 2 m. Following the rapid development of China’s economy in the previous 30 years, increasing volumes of industrial, agricultural, and domestic sewage from surrounding cities have been discharged continually into Lake Taihu. This process has caused the water to become eutrophic, resulting in frequent occurrence of largescale cyanobacterial blooms (Li et al., 2016; Shi et al., 2017).

The Department of Ecology and Environment of Jiangsu Province established 19 water quality buoy sites in Lake Taihu (as shown in Fig. 1) to automatically receive water quality data. Although few in number, these sites are generally distributed uniformly throughout the lake, and the observations can represent the water quality distribution in the lake. All the buoy sites are equipped with a multiparameter water quality monitoring instrument (model: YSI6600). The observational data include 14 water quality parameters such as Chla concentration, algae density, and dissolved oxygen concentration. The observation instrument is calibrated once a week, and the actual water sample comparison experiment is carried out at the same time to ensure the stability and reliability of the data. The Chla selected in this study represent instantaneous values observed simultaneously with the time of overpass (1130 BT) of the GF1 satellite.

The GF1 satellite, which is the first highresolution satellite developed by China, was launched on 26 April 2013. This satellite is equipped with a 2mresolution panchromatic sensor, 8mresolution multispectral sensor, and four 16mresolution multispectral WFV sensors. This study uses the data acquired by the 16mresolution WFV sensors. Table 1 lists the basic parameters of the WFV sensors onboard GF1.
Spectral band Spectral range (μm) Spatial resolution (m) Swath width (km) Swaying capacity Revisit period (day) B_{1} 0.45–0.52 16 800 (4 cameras) ±35° 2 B_{2} 0.52–0.59 B_{3} 0.63–0.69 B_{4} 0.77–0.89 Table 1. Parameters of the GF1 WFV sensors
The GF1 WFV data are acquired from January 2018 to May 2019 and include the TOA reflectance at four bands. The 27 GF1 images selected for model training and crossvalidation covered 18 days (Table 2). Furthermore, the TOA reflectances observed by GF1 on 4 June and 13 December 2019 are also selected as model input data for increasing the node impurity (INIP) Chla retrieval. These data are used only for independent testing of the performance of the models, that is, not for training or crossvalidation purposes. All images underwent orthorectification, radiometric calibration, and image mosaicking processing.
Year Date Number of image 2018 12 January 2 5, 6, 13, and 23 February 6 8 and 28 April 2 15 and 23 May 3 25 June, 19 and 20 July 5 6 and 27 October 2 18 December 2 2019 17 and 24 January 3 Table 2. GF1 images selected for model training and crossvalidation
For comparison, Earth Observing System (EOS) Moderate Resolution Imaging Spectroradiometer (MODIS) data are also collected for retrieving cyanobacterial blooms. The commonly used normalized difference vegetation index (NDVI) is used to extract cyanobacterial bloom information from all available MODIS data collected during 2007–2018. In this paper, the use of cyanobacterial blooms in the area and the times of occurrence of two indicators, wherein the indicator area is determined by setting a threshold NDVI, and the times are determined based on the area (Li et al., 2016).

From all the measured Chla concentration acquired at the time of overflight of the GF1 satellite on the 18 days, excluding 152 missing and abnormal data (mainly the influence of the cloud), 190 matching data elements are obtained to form the Chla sample dataset for model training and crossvalidation purposes. Grouped by season, 57 spring (March–May), 30 summer (June–August), 20 autumn (September–November), and 83 winter (December–February) sample subsets of measured Chla are obtained. Additionally, the measured Chla on 4 June and 13 December 2019, are used to independently test the performance of the models. Figure 2 shows the frequency distributions of the Chla sample dataset. It can be seen that the distribution of Chla concentration is mainly concentrated between 4 and 10 mg m^{−3}, of which the winter dataset accounts for the highest proportion of 95.2%, followed by the whole year dataset with 88.4%, and the summer dataset with the lowest proportion of 73.4%. The maximum Chla concentration is 37.4 mg m^{−3}, which appears in autumn; and the minimum is 0.9 mg m^{−3}, which appears in spring. From the mean Chla concentration in the four seasons, the maximum is 9.6 mg m^{−3} in summer, followed by autumn (8.0 mg m^{−3}), and the minimum in winter (6.6 mg m^{−3}). The distribution range of the Chla sample dataset used for modeling is reasonable, and the seasonal concentration changes are also consistent with the actual situation, which can be used for model training and crossvalidation.
Figure 2. Statistical distributions of frequency for the Chla sample dataset constructed by (a) the whole year dataset, (b) the spring dataset, (c) the summer dataset, (d) the autumn dataset, and (e) the winter dataset. The percent value is the ratio of the count of each Chla to the total count of the samples in the dataset. All Chla widths are 2 mg m^{−3} for each panel.
The GF1 WFV TOA reflectance is used as the input for the RF model. It is considered reasonable to use the reflectance in the four GF1 WFV bands for water quality parameter retrieval because the reflectance of a single band is a complex function of many parameters (Kong et al., 2017). Following Fang et al. (2019), considering the 4 single bands and the main vegetation index commonly used to retrieve Chla, and including some other wave band combinations of the GF1 WFV, 39 variables constructed are selected as the latent variables for model filtering (Table 3).
Variable name Input Quantity B_{i} (single wave band) B_{1}, B_{2}, B_{3}, B_{4} 4 EVI (enhanced vegetation index) $ \mathrm{E}\mathrm{V}\mathrm{I}\mathrm{ }=\mathrm{ }2.5\mathrm{ }\times \dfrac{{B}_{4}{B}_{3}}{{B}_{4}+ 6 \times {B}_{3} 7.5 \times {B}_{1}+ 1}. $ 1 NDVI_{ (i, j)} (normalized vegetation index) NDVI_{ (1, 2)} = $ \left({B}_{1}{B}_{2}\right)/\left({B}_{1}+{B}_{2}\right) $, NDVI_{ (1, 3)} = $ \left({B}_{1}{B}_{3}\right)/\left({B}_{1}+{B}_{3}\right) $,
NDVI_{ (1, 4)} = $ \left({B}_{1}{B}_{4}\right)/\left({B}_{1}+{B}_{4}\right) $, NDVI_{ (2, 3)} = $ \left({B}_{2}{B}_{3}\right)/\left({B}_{2}+{B}_{3}\right) $,
NDVI_{ (2, 4)} = $ \left({B}_{2}{B}_{4}\right)/\left({B}_{2}+{B}_{4}\right) $, NDVI_{ (3, 4)} = $ \left({B}_{3}{B}_{4}\right)/\left({B}_{3}+{B}_{4}\right) $.6 DVI_{ (i, j)} (vegetation index difference) DVI_{ (1, 2)} = $ {B}_{1}{B}_{2} $, DVI_{ (1, 3)} = $ {B}_{1}{B}_{3} $, DVI_{ (1, 4)} = $ {B}_{1}{B}_{4} $,
DVI_{ (2, 3)} = $ {B}_{2}{B}_{3} $, DVI_{ (2, 4)} = $ {B}_{2}{B}_{4} $, DVI_{ (3, 4)} = $ {B}_{3}{B}_{4} $.6 RVI_{ (i, j)} (vegetation index ratio) RVI_{ (1, 2)} = $ {B}_{1}/{B}_{2} $, RVI_{ (1, 3)} = $ {B}_{1}/{B}_{3} $, RVI_{ (1, 4)} = $ {B}_{1}/{B}_{4} $,
RVI_{ (2, 3)} = $ {B}_{2}/{B}_{3} $, RVI_{ (2, 4)} = $ {B}_{2}/{B}_{4} $, RVI_{ (3, 4)} = $ {B}_{3}/{B}_{4} $.6 VI (other) VI_{ (4, 1, 2, 3)} = $ {B}_{4}/\left({B}_{1}+{B}_{2}+{B}_{3}\right) $, VI_{ (1, 2, 3, 4)} = $ {B}_{1}/\left({B}_{2}+{B}_{3}+{B}_{4}\right) $,
VI_{ (2, 1, 3, 4)} = $ {B}_{2}/\left({B}_{1}+{B}_{3}+{B}_{4}\right) $, VI_{ (3, 1, 2, 4)} = $ {B}_{3}/\left({B}_{1}+{B}_{2}+{B}_{4}\right) $,
VI_{ (1, 2, 3)} = $ {B}_{1}/\left({B}_{2}+{B}_{3}\right) $, VI_{ (1, 2, 4)} = $ {B}_{1}/\left({B}_{2}+{B}_{4}\right) $, VI_{ (1, 3, 4)} = $ {B}_{1}/\left({B}_{3}+{B}_{4}\right) $,
VI_{ (2, 1, 3)} = $ {B}_{2}/\left({B}_{1}+{B}_{3}\right) $, VI_{ (2, 1, 4)} = $ {B}_{2}/\left({B}_{1}+{B}_{4}\right) $, VI_{ (2, 3, 4)} = $ {B}_{2}/\left({B}_{3}+{B}_{4}\right) $,
VI_{ (3, 1, 2)} = $ {B}_{3}/\left({B}_{1}+{B}_{2}\right) $, VI_{ (3, 2, 4)} = $ {B}_{3}/\left({B}_{2}+{B}_{4}\right) $, VI_{ (3, 1, 4)} = $ {B}_{3}/\left({B}_{1}+{B}_{4}\right) $,
VI_{ (4, 2, 3)} = $ {B}_{4}/\left({B}_{2}+{B}_{3}\right) $, VI_{ (4, 1, 3)} = $ {B}_{4}/\left({B}_{1}+{B}_{3}\right) $, VI_{ (4, 1, 2)} = ${B}_{4}/\left({B}_{1}+{B}_{2}\right) .$16 B_{1} refers to the blue wave band (0.45–0.52 μm), B_{2} refers to the green wave band (0.52–0.59 μm), B_{3} refers to the red wave band (0.63–0.69 μm), and B_{4} refers to the nearinfrared wave band (0.77–0.89 μm). The “i, j” in B_{i}, NDVI_{(i, j)}, DVI_{(i, j)}, and RVI_{(i, j)} refer to the number of different bands respectively. Table 3. The 39 latent variables involved in the random forest modeling

The mathematical details and structure of the RF algorithm have been discussed elsewhere (Breiman, 2001; Iverson et al., 2008); therefore, only a brief introduction to the RF algorithm is given here. The RF is a type of supervised ensemble ML technique that uses multiple decision trees and bootstrap aggregation to provide a nonparametric, multivariable, and nonlinear regression. First, “k” features are selected at random from the total features and used to calculate the root node via the best split approach. Then, a tree is constructed by using the root node. Multiple randomly constructed trees generated from the above process are then used to build multiple decision trees. Finally, each prediction from the multiple trees is merged to obtain the average results (Liaw and Wiener, 2002).
The RF model used here is developed by incorporating insitu Chla and TOA reflectance to estimate the Chla concentration. The input variables included the insitu Chla, reflectance variables of band combinations of the GF1 WFV, and the latitude and longitude coordinates of the water quality buoy sites. The use of latitude and longitude as variables accounts for the spatiotemporal variation of the Chla. The performance of the models is compared by using different settings of the number of trees (n_{tree}) and the number of variable per level (m_{try}), and the optimal model performance is achieved when n_{tree} is assigned the value of 600 and m_{try} is assigned the value of onethird of the total number of variables of each model. It is noted that the RF is a supervised ML algorithm; thus, although the insitu Chla is critical for model fitting, it is not necessary for model application.
The Monte Carlo crossvalidation (MCCV) technique is used to assess the potential of model fitting and model robustness (Ghorbanzadeh et al., 2020). The MCCV technique is an asymptotically consistent method for model selection. It can avoid an unnecessary large model, decrease the risk of overfitting in model training, and has a relatively high probability for choosing the most appropriate model. Here, the sample dataset is split randomly into two subsample datasets: one subset containing 25% of the total samples is used to validate the model, and the other subset containing the remaining 75% of the samples is used to train the model. Such an independent model training and test procedure are repeated multiple times, and the average of these test results is taken as an indicator with which to verify the accuracy of the model. Several statistical indicators are used to quantitatively evaluate the model performance, that are, the coefficient of determination (R^{2}), the rootmeansquare error (RMSE), and the mean absolute percentage error (MAPE) between the crossvalidation predicted and observed Chla. The MAPE is calculated as follows:
$${\rm{ MAPE}}=\frac{1}{n}\sum _{i=1}^{n}\frac{\left{\rm Chla}^{\rm obs}\left(i\right){\rm Chla}^{\rm pre}\left(i\right)\right}{{\rm Chla}^{\rm obs}\left(i\right)}\times 100 \text%, $$ (1) where
$ n $ is the total number of samples, and${\rm Chla}^{\rm obs}$ and${\rm Chla}^{\rm pre}$ are the observed and predicated Chla, respectively. 
The selection of the most effective band plays an important role in accurate estimation of Chla (Goyens et al., 2013; Jiang et al., 2013). Some previous studies have determined the Chla spectral characteristics and sensitive bands of lake water bodies through statistical analysis or field measurement (Wu et al., 2009; Yang et al., 2011). However, for relatively turbid inland water bodies, owing to the presence of phytoplankton, suspended matter, dissolved organic matter, and many other substances that affect the absorption spectrum of Chla, all components mix and interact with each other such that their spectral characteristics are more complicated. The actual measured Chla spectral reflectance and absorption peaks between these water bodies will also have significant differences (Luo et al., 2017). Moreover, most such measurements have been undertaken in specific water bodies and therefore they lack extensive validation.
The RF algorithm can be used to quantitatively assess the relative importance of each variable to the model. Thus, certain irrelevant or redundant characteristic variables can be excluded from the initial large number of variables, and a small number of characteristic variables that contribute most to the model can be filtered to obtain a more accurate model. The contribution of each input variable in each model is evaluated based on two factors: the percentage for increasing the mean square error (IMSE) and INIP. As the percentage for IMSE or the INIP increases, the contribution of the variable to the Chla retrieval increases. Previous studies generally use only one of the indicators of IMSE and INIP to select the important variables, but this practice could be inaccurate. Therefore, a novel relative importance evaluation index (RIEI), which is developed by combining IMSE and INIP to measure the relative importance of the variables, is used to filter the important variables as model inputs. The RIEI is calculated as follows:
$$ {\rm{RIEI}}=\left(\frac{{\rm IMSE}_{i}{\rm IMSE}_{\rm min}}{{\rm IMSE}_{\rm max}{\rm IMSE}_{\rm min}}+\frac{{\rm INIP}_{i}{\rm INIP}_{\rm min}}{{\rm INIP}_{\rm max}{\rm INIP}_{\rm min}}\right)/2 , $$ (2) where
$ {\rm IMSE}_{i} $ is the IMSE value of the ith model of the 20 training models with the highest accuracy (HAMs) from MCCV,$ {\rm IMSE}_{\rm min} $ is the minimum of the 20 HAMs, and$ {\rm IMSE}_{\rm max} $ is the maximum of the 20 HAMs;$ {\rm INIP}_{i} $ is the ith INIP value of the 20 HAMs,$ {\rm INIP}_{\rm min} $ is the minimum of the 20 HAMs, and$ {\rm INIP}_{\rm max} $ is the maximum of the 20 HAMs.From Eq. (2), the RIEI value is proportional to the
$ {\rm IMSE}_{i} $ and$ {\rm INIP}_{i} $ , that is, the larger the$ {\rm IMSE}_{i} $ or$ {\rm INIP}_{i} $ , the larger the RIEI. Similarly, we can also conclude that as the RIEI increases, the contribution of the corresponding variable to Chla retrieval increases. 
First, a representative training dataset is needed for RF model training to estimate Chla concentration. The GF1 WFV TOA reflectance and the synchronously observed Chla for 18 days from January 2018 to May 2019 are used as the training dataset. The 190 measured Chla are selected and grouped by season to obtain 5 sample subsets for model training. Threequarters of the data are selected at random from each sample subset as the training sample set, and the remaining quarter of the data are taken as the validation sample set.
The selected important variables are considered as the input variables and the RF model is trained separately, where the value of parameter m_{try} is set as onethird of the number of characteristic variables, and it adopts the four values of 1, 2, 3, and 4 successively. The parameter n_{tree} adopts the value of 400, 500, or 600 based on the previous error analysis results. Each set of corresponding parameter combinations (m_{try}, n_{tree}) is repeated multiple times in the modeling procedure, and the parameter combination with the highest accuracy of each model is selected. A detailed flowchart of the process for determining the optimal RF model is shown in Fig. 3.

First, we select the band combinations based on the single index of either IMSE or INIP for comparison purposes. The abovedetermined parameters m_{try} and n_{tree} are used for modeling and optimization, and IMSE and INIP values and their ranking of each variable in each model are obtained separately, based on which the top ranked variables are regarded as the important variables of model input. For comparability of the results, we select the same number of important variables in each model. The result is that MOD_{Year}, MOD_{Spr}, MOD_{Sum}, MOD_{Aut}, and MOD_{Win} have 6, 4, 4, 9, and 4 important variables, respectively (Tables 4, 5).
Model Important variable Number MOD_{Year} DVI_{ (2, 3)}, VI_{ (2, 1, 3)}, RVI_{ (2, 3)}, NDVI_{ (2, 3)}, VI_{ (2, 1, 3, 4)}, VI_{ (2, 3, 4)} 6 MOD_{Spr} VI_{ (2, 1, 4)}, DVI_{ (2, 3)}, VI_{ (2, 1, 3)}, VI_{ (2, 1, 3, 4)} 4 MOD_{Sum} VI_{ (3, 1, 2)}, RVI_{ (1, 3)}, DVI_{ (2, 3)}, RVI_{ (2, 3)} 4 MOD_{Aut} VI_{ (2, 3, 4)}, RVI_{ (1, 2)}, VI_{ (1, 2, 3, 4)}, VI_{ (1, 2, 3)}, VI_{ (4, 1, 2)}, NDVI_{ (2, 4)}, VI_{ (1, 2, 4)}, VI_{ (1, 3, 4)}, DVI_{ (1, 2)} 9 MOD_{Win} RVI_{ (2, 3)}, RVI_{ (2, 4)}, VI_{ (2, 1, 3, 4)}, VI_{ (2, 3, 4)} 4 Table 4. Model filtering results based on the IMSE indicator
Model Important variable Number MOD_{Year} VI_{ (2, 1, 3, 4)}, VI_{ (2, 3, 4)}, VI_{ (2, 1, 3)}, B_{4}, VI_{ (2, 1, 4)}, DVI_{ (2, 3)} 6 MOD_{Spr} VI_{ (4, 1, 3)}, NDVI_{ (1, 4)}, VI_{ (4, 1, 2, 3)}, EVI 4 MOD_{Sum} RVI_{ (1, 3)}, VI_{ (3, 1, 2)}, VI_{ (2, 3, 4)}, VI_{ (1, 3, 4)} 4 MOD_{Aut} VI_{ (2, 1, 3)}, RVI_{ (2, 3)}, DVI_{ (1, 2)}, VI_{ (4, 1, 2, 3)}, VI_{ (4, 1, 2)}, DVI_{ (2, 3)}, VI_{ (4, 2, 3)}, B_{4}, NDVI_{ (2, 3)} 9 MOD_{Win} RVI_{ (2, 3)}, VI_{ (2, 3, 4)}, VI_{ (3, 1, 2)}, RVI_{ (2, 4)} 4 Table 5. Model filtering results based on the INIP indicator
From the results of the above two types of filtering, it can be found that the important variables of each model are significantly different. The degree of coincidence of the important variables of the five models is only 67%, 0, 50%, 22%, and 75%, and the ranking order of the variables is markedly different. Only the variables ranked first in model MOD_{Win} are the same, while the order of the variables of the remaining four models are all different. From this analysis, it is not difficult to conclude that it might be inappropriate to use only one of the indicators to measure the importance of the variables and to use this as a criterion for selecting important variables.
In the following, we determine the effective band combinations based on the RIEI. As above, the results of the evaluation of the importance of the variables in MOD_{Year}, MOD_{Spr}, MOD_{Sum}, MOD_{Aut}, and MOD_{Win} are obtained according to the RIEI. As an example, the ranking of the importance of the variables in MOD_{Sum} and MOD_{Win} is shown in Figs. 4a and 4b, respectively.
Figure 4. Importance of each input variable in the RF training models of (a) MOD_{Sum} and (b) MOD_{Win} derived from the RIEI. For each model, only the first 15 RIEI values of the 29 latent variables are shown as examples. The larger the RIEI value, the larger the contribution of the variable to the Chla concentration retrieval.
According to the results of the importance evaluation, the important characteristic variables of the five models are filtered out. The results, including MOD_{Year}, MOD_{Spr}, MOD_{Sum}, MOD_{Aut}, and MOD_{Win}, have 6, 8, 12, 5, and 5 important variables respectively (Table 6).
Model Important variable Number MOD_{Year} VI_{ (2, 1, 3, 4)}, VI_{ (2, 3, 4)}, VI_{ (2, 1, 3)}, B_{4}, VI_{ (2, 1, 4)}, DVI_{ (2, 3)} 6 MOD_{Spr} DVI_{ (1, 2)}, VI_{ (2, 1, 4)}, VI_{ (2, 1, 3, 4)}, RVI_{ (1, 2)}, VI_{ (2, 1, 3)}, DVI_{ (2, 3)}, VI_{ (1, 2, 3)}, RVI_{ (2, 3)} 8 MOD_{Sum} RVI_{ (1, 3)}, VI_{ (3, 1, 2)}, RVI_{ (2, 3)}, VI_{ (2, 3, 4)}, VI_{ (1, 3, 4)}, DVI_{ (2, 3)}, DVI_{ (1, 3)}, VI_{ (1, 2, 3, 4)}, VI_{ (2, 1, 3, 4)}, VI_{ (1, 2, 3)}, VI_{ (2, 1, 3)}, VI_{ (1, 2, 3, 4)} 12 MOD_{Aut} VI_{ (4, 1, 2)}, DVI_{ (1, 2)}, VI_{ (2, 3, 4)}, VI_{ (4, 1, 2, 3)}, VI_{ (1, 2, 3, 4)} 5 MOD_{Win} RVI_{ (2, 3)}, VI_{ (2, 3, 4)}, RVI_{ (2, 4)}, VI_{ (2, 1, 3, 4)}, DVI_{ (2, 3)} 5 Table 6. Model filtering results based on the RIEI indicator
From the band combinations in the five models, we can obtain some interesting findings. The first is that each of the models contains all four bands of GF1 WFV, indicating that each band of GF1 WFV contains the Chla information. In fact, the spectral ranges of 4 multispectral bands of GF1 WFV are very similar to that of the 4 bands of Landsat TM/ETM+ [i.e., blue (0.45–0.52 µm), green (0.52–0.60 µm), red (0.63–0.69 µm), and NIR (0.76–0.90 µm)]. They are the most suitable Landsat TM/ETM+ bands for characterizing Chla in complex coastal waters (Qi et al., 2014). Recent studies on the inversion of Chla also show that various combinations of the 4 GF1 WFV bands are correlated with Chla in various seasons or regions (Xie et al., 2019). Secondly, in the variables of the five models, there is only one singleband variable, and the rest is multiband combination variables, representing that the performance of the multiband combination methods is better than the singleband methods. This similar result has been confirmed by extensive previous studies (Cheng et al., 2013). In addition, we can find that each model contains more band ratio variables, such as ratio vegetation index [RVI _{(1, 2)}, RVI _{(1, 3)}, RVI_{ (2, 3)}, and RVI _{(2, 4)}], and some difference vegetation indices, such as DVI _{(1, 2)}, DVI_{ (2, 3)}, etc. Chla is the primary photosynthetic pigment in terrestrial green plants and phytoplankton in water, which is strongly absorbent of the blue (B_{1}) and red (B_{3}) spectral regions, and highly reflective of the green (B_{2}) and NIR (B_{4}) spectral regions, indicating a similarity between the spectral reflectance of algaecontaining water and terrestrial vegetation (Cheng et al., 2013). Therefore, it is reasonable to include some vegetation index or similar band ratio variables in the model. Various types of vegetation index and similar band ratios are used for Chla retrieval in previous studies (Cheng et al., 2013; Xie et al., 2019). However, the model band combinations used by different algorithms may vary owing to the great spatial and temporal changes in the biophysical characteristics of turbid waters. Moreover, due to the variety of sampling times and positions, even the models of the same water body may vary from different authors. And the band combinations from different authors may also differ, even if the same modelbuilding method is used (Cheng et al., 2013). Therefore, it is reasonable to believe that the RF models with the optimal band combinations of GF1 WFV are feasible to estimate Chla in Lake Taihu.
Although these findings are encouraging, the RF derived algorithm is also a black box model and has main drawback that it is not easy to give a clear interpretation of decision procedure. However, this drawback could be compensated to some extent by initially selecting variables with physical meaningful and selecting the optimal band combinations according to the variable importance.

The R^{2} values of MOD_{Year}, MOD_{Spr}, MOD_{Sum}, MOD_{Aut}, and MOD_{Win} with the highest accuracy, are listed in Table 7. Additional, the model training is also performed by using the variables filtered by IMSE and INIP, separately, and the results are also presented in Table 7 for comparison. Among the five RIEI trained models, the R^{2} values of MOD_{Year}, MOD_{Spr}, MOD_{Sum}, and MOD_{Aut} are all more than 0.9, R^{2} of MOD_{Win} is 0.89. In the IMSE trained models, only R^{2} of MOD_{Year} is more than 0.9, and only two of the INIP trained models (MOD_{Spr} and MOD_{Aut}) have R^{2} more than 0.9. The maximum R^{2} of the five RIEI trained models is 0.98, which is slightly smaller than the maximum value (0.99) of any INIP trained model, but markedly greater than the maximum value (0.91) of any IMSE trained model. The minimum R^{2} of the five RIEI trained models is 0.89, which is markedly greater than that of any model trained with the other two indicators. Moreover, among the five RIEI trained models, only R^{2} of MOD_{Aut} is slightly lower than that of any INIP trained model, and R^{2} values of the other models are higher than or equal to those of the models trained with the other two indicators. The results indicate that the fitting of the models trained by the RIEI is better than that achieved for the models trained by a single indicator.
Indicator MOD_{Year} MOD_{Spr} MOD_{Sum} MOD_{Aut} MOD_{Win} RIEI 0.91 0.95 0.91 0.98 0.89 IMSE 0.91 0.87 0.84 0.89 0.85 INIP 0.89 0.91 0.83 0.99 0.85 Table 7. Coefficient of determination (R^{2}) of the five RF models trained by using the RIEI and single IMSE and INIP indicators
To illustrate intuitively the performance of the models trained with the different indicators, scatter plots of the fitting results of the RF models are shown in Fig. 5. It can be seen that the performance of the RIEI trained models is relatively stable with the RMSEs of 1.24–2.67 mg m^{−3}. The RMSEs of the models are all less than 2 mg m^{−3}, except for that of MOD_{Sum}, and the MAPEs fall in the range of 16.83%–27.26%. However, the maximum RMSE of the IMSE trained models is 5.98 mg m^{−3}, and the RMSE of the INIP trained MOD_{Aut} is 5.14 mg m^{−3}. The RIEI trained models show that the performance is superior to the single indicator trained model.

The MCCV is used to validate the performance of the RF models. Comparison of the Chla derived from insitu measurements and that retrieved by the RF models (MOD_{Year}, MOD_{Spr}, MOD_{Sum}, MOD_{Aut}, and MOD_{Win}) is shown in Fig. 6. Despite the limited number of matchup data, the RF models derived Chla generally agree well with the insitu measurements. There is high correlation between the Chla retrieved by each model and the insitu measured values, and the RMSE is small. The MCCV R^{2} of the five models is in the range of 0.71–0.94 and the RMSEs are in the range of 1.40–4.30 mg m^{−3}. Among them, MOD_{Aut} has the highest accuracy (R^{2} = 0.94, RMSE = 1.66 mg m^{−3}) followed by MOD_{Spr} and MOD_{Sum} (R^{2} values of 0.88 and 0.88 and RMSE values of 2.94 and 2.59 mg m^{−3}, respectively). The accuracy of MOD_{Year} is lowest (R^{2} =^{ } 0.71) and its RMSE (4.30 mg m^{−3}) is markedly higher than that of the other four seasonal models. The MAPEs of the four seasonal models vary between 18.47% and 25.59%, substantially lower than the value of 38.32% of MOD_{Year}. This result shows that the RF models using these optimized GF1 WFV band combinations can accurately estimate the Chla in Lake Taihu. Among them, the performance of the model with all samples is obviously inferior to that of the seasonal models, and the performance of MOD_{Aut} is better than that of the other three seasonal models.
Figure 6. Scatter plots of the crossvalidation results of our five models: (a) MOD_{Year}, (b) MOD_{Spr}, (c) MOD_{Sum}, (d) MOD_{Aut}, and (e) MOD_{Win}.
Examples of the performance of the models for nonbloom and bloom cases are shown in Fig. 7. The two GF1 WFV RGB images composed of three channels B_{1}, B_{2}, and B_{3} show two situations: no blooms in winter (Fig. 7a) and blooms in autumn (Fig. 7b), respectively. The retrieved Chla distribution for the two cases is shown in Figs. 7c, d, respectively. For the nonbloom case on 13 February 2018, the spatial distribution of the Chla throughout the lake is reasonably uniform, with the maximum and minimum values being approximately 8 and 3 mg m^{−3}, respectively. In the case of algal bloom on 27 October 2018, the spatial distribution of the Chla is highly consistent with that of the cyanobacterial bloom, and spatial heterogeneity is substantial, with the maximum and minimum values being approximately 30 and 3 mg m^{−3}, respectively. It demonstrates that our models can estimate the Chla in a large range of water and reveal its distribution in details.
Figure 7. Performances of the Chla concentration retrieval RF models in cases without algal bloom (13 February 2018) and with algal bloom (27 October 2018): (a) and (b) GF1 WFV RGB images without algal bloom (in winter) and with cyanobacterial algal bloom (in autumn), respectively, (c) and (d) Chla concentration for the same two cases derived by using the RF models, respectively.
To further evaluate the performance of our RF models, the GF1 WFV TOA reflectance on 4 June and 13 December 2019, are selected as input data, and the insitu measured Chla at the corresponding times are taken as independent test datasets. These datasets are not used for model training or crossvalidation purposes. The two datasets also represent two different situations: blooms in summer (4 June 2019) and no blooms in winter (13 December 2019), and the Chla of these 2 days are retrieved from MOD_{Win} and MOD_{Sum}, respectively. Figure 8 shows the GF1 WFV RGB images, distribution of retrieved Chla, and scatter plots of the insitu measured and the models retrieved Chla. It can be seen in the case without algal blooms that the distribution of MOD_{Win} retrieved Chla is more uniform, whereas in the case with algal blooms, the distribution of MOD_{Sum} derived Chla is consistent with that of the cyanobacterial blooms. For MOD_{Win} and MOD_{Sum}, tested with independent samples, R^{2} are 0.49 and 0.61, RMSEs are 2.34 and 5.31 mg m^{−3}, and MAPEs are 32.74% and 31.83%, respectively. Overall, the models estimate the Chla well, although R^{2} values are reduced slightly, and RMSEs and MAPEs are also slightly larger.
Figure 8. Validations of the RF models using independent sample datasets in the case of no algal blooms (13 December 2019) and algal blooms (4 June 2019): (a) and (b) GF1 WFV RGB images, (c) and (d) distribution of Chla concentration retrieved by RF models (MOD_{Win} and MOD_{Sum}), and (e) and (f) scatter plots of Chla concentration derived from insitu measurements versus those retrieved by using the models, respectively.

To investigate the annual variation of Chla in Lake Taihu, the seasonal models are used to retrieve the 15day Chla concentration in 2018, and the average value in the entire lake on each day is obtained, as shown in Fig. 9. Owing to the lack of matched GF1 WFV images, the estimated Chla for certain months (March, August, September, and November) is missing, but it can still be seen from Fig. 9 that the Chla in the lake exhibits an obvious seasonal trend. Although the lake shows relatively high Chla during all seasons, the Chla is substantially higher during summer (June–August) and autumn (September–November) in comparison with spring (March–May) and winter (December–February). The average Chla concentration in spring, summer, autumn, and winter is 7.7, 9.6, 8.6, and 7.1 mg m^{−3}, respectively. This temporal trend is largely consistent with previous research (Shi et al., 2017). As a comparison, Fig. 10 shows the monthly average area of the cyanobacterial bloom derived from the NDVI model using all available MODIS images collected during 2007–2018. The higher temporal resolution of MODIS allows for more frequent monitoring of the cyanobacterial blooms, and the time series of the area of cyanobacterial blooms can characterize the longterm trends in the cyanobacterial dynamics of Lake Taihu. Generally, as the Chla in the lake increases, the possibility of a cyanobacterial bloom increases. For cyanobacterial bloom areas, a clear seasonal cycle is observed in Lake Taihu, whereby cyanobacterial blooms occurred much more often during summer and autumn and less frequently in spring and winter. The extent of cyanobacterial bloom areas are also considerably higher in summer and autumn than in spring and winter. Under normal circumstances, after a longterm bloomfree period in winter, surviving cyanobacteria float to the water surface during spring, which means the Chla increases gradually and the intensity of the cyanobacterial bloom increases substantially. However, we find that the areas of cyanobacterial blooms in June and July, which are the time of year affected primarily by the Jianghuai Meiyu (a longterm rainy weather occurs almost every year in this area), are smaller than in May (Fig. 10). Even under the influence of the Jianghuai Meiyu, summer is still considered the season with the largest intensity and greatest areal extent of cyanobacterial blooms. The retrieved Chla and the observed cyanobacterial blooms in Lake Taihu are mutually corroborated by the temporal change trend.

To analyze the annual spatial distribution of Chla in Lake Taihu, the retrieved Chla concentration for 15 days in 2018 are averaged pixel by pixel to obtain a composite map of the spatial distribution of Chla in Lake Taihu in 2018 (Fig. 11a). Additionally, Fig. 11b shows the number of cyanobacteria blooms in each pixel obtained from MODIS. Although the number of matched GF1 WFV images (15) is far below that of the MODIS images (134) with cyanobacterial blooms, it can be seen that the spatial distribution of Chla retrieved by our models generally has high consistency with that of the number of cyanobacterial blooms. The northwestern part of the lake is the area with the greatest average Chla concentration in the year. Correspondingly, this area is also the area in which cyanobacterial blooms occur most frequently and where the water quality is the poorest. The eastern coastal area, with the best water quality, has the lowest Chla and fewer occurrences of cyanobacterial blooms. It should be noted that due to the abundance of aquatic plants, the Chla in the southeast corner is also relatively high and remains higher throughout the year (except in winter; Zhang Z. et al., 2018). These results are generally consistent with those of previous research (Qi et al., 2014).
Figure 11. Spatial distributions of (a) Chla concentration retrieved from RF models and (b) the cumulative number of occurrences of cyanobacterial blooms derived from EOS/MODIS in Lake Taihu in 2018. It is worth noting that only cyanobacterial blooms are concerned in (b) and usually there is only aquatic plants and no cyanobacterial blooms in southeast of Lake Taihu.
Figure 12 shows the seasonal spatial distribution of the retrieved Chla in Lake Taihu in 2018. The Chla exhibits obvious seasonal variability, which the Chla is substantially higher during summer and autumn than in spring and winter. Except for winter, there are obvious spatial changes in Chla in spring, summer, and autumn. In spring, the northwestern part is the area with the highest Chla, followed by the southern coastal and some central areas of the lake. In summer, the Chla in most areas of the northwest, west, and south of the lake is obviously higher, and only a few areas along the east coast have relatively low Chla. Compared with summer, the area with high Chla substantially reduces in autumn and mainly locates in the western coastal and northern of the lake. In winter, the Chla in the entire lake is markedly reduced, and the spatial distribution is more uniform. This result is consistent with the actual situation. Notably, the areas of high Chla in the three seasons other than winter are located mainly in the northwestern part of the lake, which is attributable mainly to the large number of rivers entering the lake and high density of urban sewage outfalls in these areas. The high nutrient concentration leads to serious eutrophication of the water body, which is beneficial to the growth of algal organisms (Dai et al., 2016; Zhu et al., 2018). Additionally, the spatial distribution pattern of Chla is also affected by meteorological conditions. The low and uniform distributions of Chla in winter are mainly due to the low temperature that causes algae in the water to almost stop growing (Xiong et al., 2012). The prevailing southeasterly wind in summer and autumn causes algal organisms to become concentrated in the northwest region of the lake (Qin et al., 2004; Li et al., 2016). This confirmed that the Chla spatial distribution patterns derived from the RF models appear reasonable.

Table 8 summarizes the model performances of different algorithms reported in previous studies on estimation of Chla in Lake Taihu using satellite. The spatial resolutions of the different models range from 16 m to 1 km, with most of the studies having coarse spatial resolutions of greater than 16 m. The CV R^{2} of the different models varies from 0.47 to 0.94, with most R^{2} values less than 0.88. The seasonal RF models captures about 86% average of the variability in Chla concentration in the samplebased CV, which is larger than the samplebased CV R^{2} of almost other models. The RMSE values of the seasonal models are generally lower than that of the other models. Our RF derived models also shows that the performance is comparable or superior to some previous models based on different ML algorithms (Zhang et al., 2009; Xu et al., 2019). Overall, the seasonal models have a robust and superior performance in estimating Chla with an extremely high spatial resolution of 16 m.
Reference Model R^{2} RMSE (mg m^{−3}) MAPE (%) Spatial resolution Satellite sensor Zhang et al. (2009) SVM 0.94 4.75 15.91 1 km Terra/Aqua MODIS LRM 0.80 9.28 26.71 1 km Terra/Aqua MODIS PCA 0.79 7.88 22.62 1 km Terra/Aqua MODIS Qi et al. (2014) EOF 0.47 1.81 – 1 km Terra/Aqua MODIS Song et al. (2017) BRM 0.72 – – 100 m HJ1A HSI NDCI 0.71 – – 100 m HJ1A HSI WCI 0.73 – – 100 m HJ1A HSI Xu et al. (2019) RF 0.80 4.89 – 30 m HJ1B CCD SVM 0.77 5.21 – 30 m HJ1B CCD BPANN 0.75 6.19 – 30 m HJ1B CCD DL 0.72 5.59 – 30 m HJ1B CCD This study Seasonal 0.88/0.88 2.94/2.59 24.17/25.59 16 m GF1 WFV RF 0.94/0.74 1.66/1.40 19.73/18.47 SVM: support vector machine model, LRM: linear regression model, PCA: principal component analysis model, EOF: empirical orthogonal function model, BRM: band ratio model, NDCI: normalized difference chlorophyll index model, WCI: water Chla index model, BPANN: back propagation artificial neural network, DL: deep learning, HIS: hyper spectral imagery, CCD: charge coupled device. Table 8. Model performances reported in some previous studies on estimation of Chla concentration in Lake Taihu using satellite

The greatest advantage of the proposed models developed in this study is its high spatial resolution of 16 m. The GF1 satellites can monitor the details of small patches of inland lakes better than the 250m spatial resolution of MODIS. The retrieval results show that the GF1 WFV images describe more details of the spatial distribution pattern of Chla in Lake Taihu than the MODIS images (Qi et al., 2014; Lary et al., 2016). Several previous studies have also shown the good performance of GF1 WFV in retrieving Chla over inland lakes (Zhu et al., 2017; Xu et al., 2020). Another difference from previous studies is that this study directly captures the Chla information from GF1 WFV TOA reflectance without an AC algorithm. This may avoid the need to retrieve the waterleaving radiance through AC, which is typically prone to large errors in turbid and highbiomass waters. The seasonal and spatial distribution of Chla in Lake Taihu estimated by our models are consistent with the previous studies (Qi et al., 2014; Lary et al., 2016). In addition, the insitu Chla used in this study is measured by YSI6600 instrument and fluorescence analysis method (Liu et al., 2010), which has the advantages of automatic acquisition and the results can be determined within 30 minutes. In contrast, the commonly used spectrophotometric analysis method is much less efficient, the method cannot automatically obtain results, and it may take nearly a day. Overall, the seasonal models, with its spatial resolution of 16 m, can retrieve the Chla with higher accuracy and reflect its distribution pattern and trend in the entire Lake Taihu.

Here, we develop seasonal models to estimate Chla in inland lakes with a high spatial resolution of 16 m by using the direct measurements of GF1 WFV TOA reflectance. Compared with the MODIS based models, the proposed models have a much higher spatial resolution and can provide more details about variations in Chla in Lake Taihu. However, there are some limitations and potential room for model improvements. Similar to other ML based methods, the RF derived algorithm is also a black box model, which is not easy to give an easy interpretation of decision procedure. Therefore, it is necessary to select more variables with physical significance for model training, like the indices from red color and near infrared channels. Furthermore, the slight difference in the center wavelength of the four GF1 WFV cameras may affect the NDVI value. This difference may have been taken into account when training the models with artificial intelligence method, and the output results of the models seem to have a high accuracy. Nevertheless, we will further study the physical mechanisms of these subtle effects by using radiation transfer modes to obtain more accurate models in the future. The insitu Chla measured by the fluorescence analysis method is used in the model development, having relatively smaller value than that of the spectrophotometric method (Liu et al., 2010), this may cause the Chla retrieved by our models are slightly lower. It is clearly that future efforts should be dedicated to compare with widely adopted methods [such as high performance liquid chromatography (HPLC) and spectrophotometric method] and evaluate its potential impact on the algorithm development to further improve the accuracy of the model. In addition, the limited number of samples and model run results also suggest the need for further improvement of our models. We anticipate collecting more insitu measured Chla and GF1 WFV images and carrying out experiments for other lakes in the near future. On a broader scale, the approaches and findings of this study may be extended to other inland eutrophication lakes such as Lake Chaohu, Lake Dianchi, and Lake Erie. Our results will help managers and decisionmakers account for and modify their strategies for controlling water eutrophication and cyanobacterial blooms in response to future climate change and human impacts.
Spectral band  Spectral range (μm)  Spatial resolution (m)  Swath width (km)  Swaying capacity  Revisit period (day) 
B_{1}  0.45–0.52  16  800 (4 cameras)  ±35°  2 
B_{2}  0.52–0.59  
B_{3}  0.63–0.69  
B_{4}  0.77–0.89 