Estimation of Chlorophyll-a Concentration in Lake Taihu from Gaofen-1 Wide-Field-of-View Data through a Machine Learning Trained Algorithm

通过机器学习训练算法从高分1号WFV数据估算太湖叶绿素a浓度

+ Author Affiliations + Find other works by these authors
  • Corresponding author: Yachun LI, jsqxlyc@163.com
  • Funds:

    Supported by the National Key Research and Development Program of China (2018YFC1506500) and Foundation for Key Scientific Research of Jiangsu Meteorological Bureau (KZ202003)

  • doi: 10.1007/s13351-022-1146-y

PDF

  • Wide-field-of-view (WFV) imager that observes the earth environment with four solar reflective bands in a spatial resolution of 16 m is equipped on board Gaofen-1 (GF-1) satellite. Chlorophyll-a (Chl-a) concentration in Lake Taihu, China from 2018 to 2019 is collected and collocated with GF-1 satellite data. This study develops a general and reliable estimation of Chl-a concentration from GF-1 WFV data under turbid inland water conditions. The collocated data are classified according to season and used in random forest (RF) regression to train models for retrieving the lake Chl-a concentration. A composite index is developed to select the most important variables in the models. The models trained for each season show a better performance than the model trained by using the whole year data in terms of the coefficient of determination (R2) between retrievals and observations. Specifically, the R2 values in spring, summer, autumn, and winter are 0.88, 0.88, 0.94, and 0.74, respectively; whereas that using the whole year data is only 0.71. The Chl-a concentration in Lake Taihu exhibits an obvious seasonal change with the highest in summer, followed by autumn and spring, and the lowest in winter. The Chl-a concentration also displays an obvious spatial variation with season. A high concentration occurs mainly in the northwest of the lake. The temporal and spatial changes of Chl-a concentration are almost consistent with the changes in the areas and times of cyanobacteria blooms based on Moderate Resolution Imaging Spectroradiometer (MODIS) data. The proposed algorithm can be operated without a priori knowledge on atmospheric conditions and water quality. Our study also demonstrates that GF-1 data are increasingly valuable for monitoring the Chl-a concentration of inland water bodies in China at a high spatial resolution.
    本文收集了2018–2019年太湖叶绿素a(Chl-a)浓度数据并与高分1号(GF-1)卫星数据匹配。按照季节对数据样本进行分类,用于RF训练,并建立Chl-a反演模型。开发了一个综合指数来确定模型中最重要的变量。根据精度验证的R2,发现季节模型比全年数据模型更优。具体而言,春、夏、秋、冬的 R2 值分别为 0.88、0.88、0.94 和 0.74,而使用全年数据的R2仅为 0.71。太湖Chl-a浓度呈现明显的季节性变化,夏季最高,秋季和春季次之,冬季最低。Chl-a 浓度也表现出明显的空间变化,高浓度主要发生在湖的西北部。Chl-a 浓度的时空变化与基于 MODIS 数据的蓝藻水华面积和时间的变化几乎一致。本文所提出的算法可以在没有大气条件和水质的先验知识的情况下运行。研究还表明,GF-1 卫星数据对监测高分辨率中国内陆水体Chl-a 浓度越来越有价值。
  • 加载中
  • Fig. 1.  Distribution of the 19 water quality buoy sites in Lake Taihu, China. All buoy sites are equipped with a multiparameter water quality monitoring instrument (model: YSI6600) and provide in-situ measurements of Chl-a concentration required for model training and validation.

    Fig. 2.  Statistical distributions of frequency for the Chl-a sample dataset constructed by (a) the whole year dataset, (b) the spring dataset, (c) the summer dataset, (d) the autumn dataset, and (e) the winter dataset. The percent value is the ratio of the count of each Chl-a to the total count of the samples in the dataset. All Chl-a widths are 2 mg m−3 for each panel.

    Fig. 3.  Detailed flowchart of the process for determining the optimal RF model.

    Fig. 4.  Importance of each input variable in the RF training models of (a) MODSum and (b) MODWin derived from the RIEI. For each model, only the first 15 RIEI values of the 29 latent variables are shown as examples. The larger the RIEI value, the larger the contribution of the variable to the Chl-a concentration retrieval.

    Fig. 5.  Performances of the five RF models (MODYear, MODSpr, MODSum, MODAut, and MODWin) trained with the RIEI, and the single IMSE and INIP indicators, using a test dataset.

    Fig. 6.  Scatter plots of the cross-validation results of our five models: (a) MODYear, (b) MODSpr, (c) MODSum, (d) MODAut, and (e) MODWin.

    Fig. 7.  Performances of the Chl-a concentration retrieval RF models in cases without algal bloom (13 February 2018) and with algal bloom (27 October 2018): (a) and (b) GF-1 WFV RGB images without algal bloom (in winter) and with cyanobacterial algal bloom (in autumn), respectively, (c) and (d) Chl-a concentration for the same two cases derived by using the RF models, respectively.

    Fig. 8.  Validations of the RF models using independent sample datasets in the case of no algal blooms (13 December 2019) and algal blooms (4 June 2019): (a) and (b) GF-1 WFV RGB images, (c) and (d) distribution of Chl-a concentration retrieved by RF models (MODWin and MODSum), and (e) and (f) scatter plots of Chl-a concentration derived from in-situ measurements versus those retrieved by using the models, respectively.

    Fig. 9.  Temporal variation of Chl-a concentration in Lake Taihu in 2018 estimated by using the RF models. The red line is the 2-day moving average of the Chl-a concentration.

    Fig. 10.  Temporal variation in the monthly average area of cyanobacterial blooms in Lake Taihu derived from EOS/MODIS acquired during 2007–2018. The red line is the 2-month moving average of the cyanobacterial bloom area.

    Fig. 11.  Spatial distributions of (a) Chl-a concentration retrieved from RF models and (b) the cumulative number of occurrences of cyanobacterial blooms derived from EOS/MODIS in Lake Taihu in 2018. It is worth noting that only cyanobacterial blooms are concerned in (b) and usually there is only aquatic plants and no cyanobacterial blooms in southeast of Lake Taihu.

    Fig. 12.  Spatial distributions of Chl-a concentration in Lake Taihu, as estimated by the RF models: (a) spring, (b) summer, (c) autumn, and (d) winter.

    Table 1.  Parameters of the GF-1 WFV sensors

    Spectral bandSpectral range (μm)Spatial resolution (m)Swath width (km)Swaying capacityRevisit period (day)
    B10.45–0.5216800 (4 cameras)±35°2
    B20.52–0.59
    B30.63–0.69
    B40.77–0.89
    Download: Download as CSV

    Table 2.  GF-1 images selected for model training and cross-validation

    YearDateNumber of image
    201812 January2
    5, 6, 13, and 23 February6
    8 and 28 April2
    15 and 23 May3
    25 June, 19 and 20 July5
    6 and 27 October2
    18 December2
    201917 and 24 January3
    Download: Download as CSV

    Table 3.  The 39 latent variables involved in the random forest modeling

    Variable nameInputQuantity
    Bi (single wave band)B1, B2, B3, B44
    EVI (enhanced vegetation index)$ \mathrm{E}\mathrm{V}\mathrm{I}\mathrm{ }=\mathrm{ }2.5\mathrm{ }\times \dfrac{{B}_{4}-{B}_{3}}{{B}_{4}+ 6 \times {B}_{3}- 7.5 \times {B}_{1}+ 1}. $1
    NDVI (i, j) (normalized vegetation index)NDVI (1, 2) = $ \left({B}_{1}-{B}_{2}\right)/\left({B}_{1}+{B}_{2}\right) $, NDVI (1, 3) = $ \left({B}_{1}-{B}_{3}\right)/\left({B}_{1}+{B}_{3}\right) $,
    NDVI (1, 4) = $ \left({B}_{1}-{B}_{4}\right)/\left({B}_{1}+{B}_{4}\right) $, NDVI (2, 3) = $ \left({B}_{2}-{B}_{3}\right)/\left({B}_{2}+{B}_{3}\right) $,
    NDVI (2, 4) = $ \left({B}_{2}-{B}_{4}\right)/\left({B}_{2}+{B}_{4}\right) $, NDVI (3, 4) = $ \left({B}_{3}-{B}_{4}\right)/\left({B}_{3}+{B}_{4}\right) $.
    6
    DVI (i, j) (vegetation index difference)DVI (1, 2) = $ {B}_{1}-{B}_{2} $, DVI (1, 3) = $ {B}_{1}-{B}_{3} $, DVI (1, 4) = $ {B}_{1}-{B}_{4} $,
    DVI (2, 3) = $ {B}_{2}-{B}_{3} $, DVI (2, 4) = $ {B}_{2}-{B}_{4} $, DVI (3, 4) = $ {B}_{3}-{B}_{4} $.
    6
    RVI (i, j) (vegetation index ratio)RVI (1, 2) = $ {B}_{1}/{B}_{2} $, RVI (1, 3) = $ {B}_{1}/{B}_{3} $, RVI (1, 4) = $ {B}_{1}/{B}_{4} $,
    RVI (2, 3) = $ {B}_{2}/{B}_{3} $, RVI (2, 4) = $ {B}_{2}/{B}_{4} $, RVI (3, 4) = $ {B}_{3}/{B}_{4} $.
    6
    VI (other)VI (4, 1, 2, 3) = $ {B}_{4}/\left({B}_{1}+{B}_{2}+{B}_{3}\right) $, VI (1, 2, 3, 4) = $ {B}_{1}/\left({B}_{2}+{B}_{3}+{B}_{4}\right) $,
    VI (2, 1, 3, 4) = $ {B}_{2}/\left({B}_{1}+{B}_{3}+{B}_{4}\right) $, VI (3, 1, 2, 4) = $ {B}_{3}/\left({B}_{1}+{B}_{2}+{B}_{4}\right) $,
    VI (1, 2, 3) = $ {B}_{1}/\left({B}_{2}+{B}_{3}\right) $, VI (1, 2, 4) = $ {B}_{1}/\left({B}_{2}+{B}_{4}\right) $, VI (1, 3, 4) = $ {B}_{1}/\left({B}_{3}+{B}_{4}\right) $,
    VI (2, 1, 3) = $ {B}_{2}/\left({B}_{1}+{B}_{3}\right) $, VI (2, 1, 4) = $ {B}_{2}/\left({B}_{1}+{B}_{4}\right) $, VI (2, 3, 4) = $ {B}_{2}/\left({B}_{3}+{B}_{4}\right) $,
    VI (3, 1, 2) = $ {B}_{3}/\left({B}_{1}+{B}_{2}\right) $, VI (3, 2, 4) = $ {B}_{3}/\left({B}_{2}+{B}_{4}\right) $, VI (3, 1, 4) = $ {B}_{3}/\left({B}_{1}+{B}_{4}\right) $,
    VI (4, 2, 3) = $ {B}_{4}/\left({B}_{2}+{B}_{3}\right) $, VI (4, 1, 3) = $ {B}_{4}/\left({B}_{1}+{B}_{3}\right) $, VI (4, 1, 2) = ${B}_{4}/\left({B}_{1}+{B}_{2}\right) .$
    16
    B1 refers to the blue wave band (0.45–0.52 μm), B2 refers to the green wave band (0.52–0.59 μm), B3 refers to the red wave band (0.63–0.69 μm), and B4 refers to the near-infrared wave band (0.77–0.89 μm). The “i, j” in Bi, NDVI(i, j), DVI(i, j), and RVI(i, j) refer to the number of different bands respectively.
    Download: Download as CSV

    Table 4.  Model filtering results based on the IMSE indicator

    ModelImportant variableNumber
    MODYearDVI (2, 3), VI (2, 1, 3), RVI (2, 3), NDVI (2, 3), VI (2, 1, 3, 4), VI (2, 3, 4)6
    MODSprVI (2, 1, 4), DVI (2, 3), VI (2, 1, 3), VI (2, 1, 3, 4)4
    MODSumVI (3, 1, 2), RVI (1, 3), DVI (2, 3), RVI (2, 3)4
    MODAutVI (2, 3, 4), RVI (1, 2), VI (1, 2, 3, 4), VI (1, 2, 3), VI (4, 1, 2), NDVI (2, 4), VI (1, 2, 4), VI (1, 3, 4), DVI (1, 2)9
    MODWinRVI (2, 3), RVI (2, 4), VI (2, 1, 3, 4), VI (2, 3, 4)4
    Download: Download as CSV

    Table 5.  Model filtering results based on the INIP indicator

    ModelImportant variableNumber
    MODYearVI (2, 1, 3, 4), VI (2, 3, 4), VI (2, 1, 3), B4, VI (2, 1, 4), DVI (2, 3)6
    MODSprVI (4, 1, 3), NDVI (1, 4), VI (4, 1, 2, 3), EVI4
    MODSumRVI (1, 3), VI (3, 1, 2), VI (2, 3, 4), VI (1, 3, 4)4
    MODAutVI (2, 1, 3), RVI (2, 3), DVI (1, 2), VI (4, 1, 2, 3), VI (4, 1, 2), DVI (2, 3), VI (4, 2, 3), B4, NDVI (2, 3)9
    MODWinRVI (2, 3), VI (2, 3, 4), VI (3, 1, 2), RVI (2, 4)4
    Download: Download as CSV

    Table 6.  Model filtering results based on the RIEI indicator

    ModelImportant variableNumber
    MODYearVI (2, 1, 3, 4), VI (2, 3, 4), VI (2, 1, 3), B4, VI (2, 1, 4), DVI (2, 3)6
    MODSprDVI (1, 2), VI (2, 1, 4), VI (2, 1, 3, 4), RVI (1, 2), VI (2, 1, 3), DVI (2, 3), VI (1, 2, 3), RVI (2, 3)8
    MODSumRVI (1, 3), VI (3, 1, 2), RVI (2, 3), VI (2, 3, 4), VI (1, 3, 4), DVI (2, 3), DVI (1, 3), VI (1, 2, 3, 4), VI (2, 1, 3, 4), VI (1, 2, 3), VI (2, 1, 3), VI (1, 2, 3, 4)12
    MODAutVI (4, 1, 2), DVI (1, 2), VI (2, 3, 4), VI (4, 1, 2, 3), VI (1, 2, 3, 4)5
    MODWinRVI (2, 3), VI (2, 3, 4), RVI (2, 4), VI (2, 1, 3, 4), DVI (2, 3)5
    Download: Download as CSV

    Table 7.  Coefficient of determination (R2) of the five RF models trained by using the RIEI and single IMSE and INIP indicators

    IndicatorMODYearMODSprMODSumMODAutMODWin
    RIEI0.910.950.910.980.89
    IMSE0.910.870.840.890.85
    INIP0.890.910.830.990.85
    Download: Download as CSV

    Table 8.  Model performances reported in some previous studies on estimation of Chl-a concentration in Lake Taihu using satellite

    ReferenceModelR2RMSE (mg m−3)MAPE (%)Spatial resolutionSatellite sensor
    Zhang et al. (2009)SVM0.944.7515.911 kmTerra/Aqua MODIS
    LRM0.809.2826.711 kmTerra/Aqua MODIS
    PCA0.797.8822.621 kmTerra/Aqua MODIS
    Qi et al. (2014)EOF0.471.811 kmTerra/Aqua MODIS
    Song et al. (2017)BRM0.72100 mHJ-1A HSI
    NDCI0.71100 mHJ-1A HSI
    WCI0.73100 mHJ-1A HSI
    Xu et al. (2019)RF0.804.8930 mHJ-1B CCD
    SVM0.775.2130 mHJ-1B CCD
    BPANN0.756.1930 mHJ-1B CCD
    DL0.725.5930 mHJ-1B CCD
    This studySeasonal0.88/0.882.94/2.5924.17/25.5916 mGF-1 WFV
    RF0.94/0.741.66/1.4019.73/18.47
    SVM: support vector machine model, LRM: linear regression model, PCA: principal component analysis model, EOF: empirical orthogonal function model, BRM: band ratio model, NDCI: normalized difference chlorophyll index model, WCI: water Chl-a index model, BPANN: back propagation artificial neural network, DL: deep learning, HIS: hyper spectral imagery, CCD: charge coupled device.
    Download: Download as CSV
  • [1]

    Breiman, L., 2001: Random forests. Mach. Learn., 45, 5–32. doi: 10.1023/A:1010933404324.
    [2]

    Cheng, C. M., Y. C. Wei, G. N. Lyu, et al., 2013: Remote estimation of chlorophyll-a concentration in turbid water using a spectral index: A case study in Taihu Lake, China. J. Appl. Remote Sens., 7, 073465. doi: 10.1117/1.JRS.7.073465.
    [3]

    Dai, X. L., P. Q. Qian, L. Ye, et al., 2016: Changes in nitrogen and phosphorus concentrations in Lake Taihu, 1985–2015. J. Lake Sci., 28, 935–943. doi: 10.18307/2016.0502. (in Chinese)
    [4]

    Fan, Y. Z., W. Li, C. K. Gatebe, et al., 2017: Atmospheric correction over coastal waters using multilayer neural networks. Remote Sens. Environ., 199, 218–240. doi: 10.1016/j.rse.2017.07.016.
    [5]

    Fang, X. R., Z. F. Wen, J. L. Chen, et al., 2019: Remote sensing estimation of suspended sediment concentration based on random forest regression model. J. Remote Sens., 23, 756–772. doi: 10.11834/jrs.20197498. (in Chinese)
    [6]

    Ghorbanzadeh, O., H. Shahabi, F. Mirchooli, et al., 2020: Gully erosion susceptibility mapping (GESM) using machine learning methods optimized by the multi-collinearity analysis and K-fold cross-validation. Geomat. Nat. Hazards Risk, 11, 1653–1678. doi: 10.1080/19475705.2020.1810138.
    [7]

    Gordon, H. R., and M. H. Wang, 1994a: Retrieval of water-leaving radiance and aerosol optical thickness over the oceans with SeaWiFS: A preliminary algorithm. Appl. Opt., 33, 443–452. doi: 10.1364/AO.33.000443.
    [8]

    Gordon, H. R., and M. H. Wang, 1994b: Influence of oceanic whitecaps on atmospheric correction of ocean-color sensors. Appl. Opt., 33, 7754–7763. doi: 10.1364/AO.33.007754.
    [9]

    Gordon, H. R., O. B. Brown, R. H. Evans, et al., 1988: A semianalytic radiance model of ocean color. J. Geophys. Res. Atmos., 93, 10,909–10,924. doi: 10.1029/JD093iD09p10909.
    [10]

    Goyens, C., C. Jamet, and T. Schroeder, 2013: Evaluation of four atmospheric correction algorithms for MODIS-aqua images over contrasted coastal waters. Remote Sens. Environ., 131, 63–75. doi: 10.1016/j.rse.2012.12.006.
    [11]

    Harvey, E. T., S. Kratzer, and P. Philipson, 2015: Satellite-based water quality monitoring for improved spatial and temporal retrieval of chlorophyll-a in coastal waters. Remote Sens. Environ., 158, 417–430. doi: 10.1016/j.rse.2014.11.017.
    [12]

    He, J. Y., Y. J. Chen, J. P. Wu, et al., 2020: Space-time chlorophyll-a retrieval in optically complex waters that accounts for remote sensing and modeling uncertainties and improves remote estimation accuracy. Water Res., 171, 115403. doi: 10.1016/j.watres.2019.115403.
    [13]

    IOCCG, 2006: Remote Sensing of Inherent Optical Properties: Fundamentals, Tests of Algorithms, and Applications. Reports of the International Ocean-Colour Coordinating Group, Z. P. Lee, Ed., IOCCG, Dartmouth, 122 pp.
    [14]

    IOCCG, 2010: Atmospheric Correction for Remotely-Sensed Ocean-Colour Products. Reports of International Ocean-Color Coordinating Group, M. Wang, Ed., IOCCG, Dartmouth, 132 pp.
    [15]

    Iverson, L. R., A. M. Prasad, S. N. Matthews, et al., 2008: Estimating potential habitat for 134 eastern US tree species under six climate scenarios. Forest Ecol. Manage., 254, 390–406. doi: 10.1016/j.foreco.2007.07.023.
    [16]

    Jamet, C., H. Loisel, C. P. Kuchinke, et al., 2011: Comparison of three SeaWIFS atmospheric correction algorithms for turbid waters using AERONET-OC measurements. Remote Sens. Environ., 115, 1955–1965. doi: 10.1016/j.rse.2011.03.018.
    [17]

    Jiang, G. J., L. Zhou, R. H. Ma, et al., 2013: Remote sensing retrieval for chlorophyll-a concentration in turbid case II waters (II): Application on MERIS image. J. Infrared Millim. Waves, 32, 372–378. doi: 10.3724/SP.J.1010.2013.00372. (in Chinese)
    [18]

    Kong, X. Y., Y. Y. Sun, R. G. Su, et al., 2017: Real-time eutrophication status evaluation of coastal waters using support vector machine with grid search algorithm. Mar. Pollut. Bull., 119, 307–319. doi: 10.1016/j.marpolbul.2017.04.022.
    [19]

    Lary, D. J., A. H. Alavi, A. H. Gandomi, et al., 2016: Machine learning in geosciences and remote sensing. Geosci. Front., 7, 3–10. doi: 10.1016/j.gsf.2015.07.003.
    [20]

    Li, Y. C., X. P. Xie, X. Hang, et al., 2016: Analysis of wind field features causing cyanobacteria bloom in Taihu Lake combined with remote sensing methods. China Environ. Sci., 36, 525–533. doi: 10.3969/j.issn.1000-6923.2016.02.032. (in Chinese)
    [21]

    Li, Y. M., J. Z. Huang, Y. C. Wei, et al., 2006: Inversing chlorophyll concentration of Taihu Lake by analytic model. J. Remote Sens., 10, 169–175. (in Chinese)
    [22]

    Liaw, A., and M. Wiener, 2002: Classification and regression by randomForest. R. News., 23, 18–22. Available online at https://www.academia.edu/20101897/Classification_and_Regression_by_randomForest. Accessed on 14 January 2022.
    [23]

    Liu, Y., Y. W. Chen, and J. M. Deng, 2010: Discussion on accuracy and errors for phytoplankton chlorophyll-a concentration analysis using YSI (Multi-parameter water analyzer). J. Lake Sci., 22, 965–968. (in Chinese)
    [24]

    Luo, J. M., Y. W. Huo, and X. Q. Han, 2017: Inversion of chlorophyll a concentration in offshore Ⅱ waters using HJ satellite data—example in the north of the Luanhe Delta. Haiyang Xuebao, 39, 117–129. doi: 10.3969/j.issn.0253-4193.2017.04.012. (in Chinese)
    [25]

    Matthews, M. W., 2011: A current review of empirical procedures of remote sensing in inland and near-coastal transitional waters. Int. J. Remote Sens., 32, 6855–6899. doi: 10.1080/01431161.2010.512947.
    [26]

    Mouw, C. B., S. Greb, D. Aurin, et al., 2015: Aquatic color radiometry remote sensing of coastal and inland waters: Challenges and recommendations for future satellite missions. Remote Sens. Environ., 160, 15–30. doi: 10.1016/j.rse.2015.02.001.
    [27]

    Pahlevan, N., B. Smith, J. Schalles, et al., 2020: Seamless retrievals of chlorophyll-a from Sentinel-2 (MSI) and Sentinel-3 (OLCI) in inland and coastal waters: A machine-learning approach. Remote Sens. Environ., 240, 111604. doi: 10.1016/j.rse.2019.111604.
    [28]

    Palmer, S. C. J., P. D. Hunter, T. Lankester, et al., 2015: Validation of envisat meris algorithms for chlorophyll retrieval in a large, turbid and optically-complex shallow lake. Remote Sens. Environ., 157, 158–169. doi: 10.1016/j.rse.2014.07.024.
    [29]

    Qi, L., C. M. Hu, H. T. Duan, et al., 2014: An EOF-based algorithm to estimate chlorophyll a concentrations in Taihu Lake from MODIS land-band measurements: Implications for near real-time applications and forecasting models. Remote Sens., 6, 10,694–10,715. doi: 10.3390/rs61110694.
    [30]

    Qin, B. Q., W. P. Hu, and W. M. Chen, 2004: Process and Mechanism of Environmental Changes of the Taihu Lake. Science Press, Beijing, 136–137.
    [31]

    Shi, K., Y. L. Zhang, Y. Q. Zhou, et al., 2017: Long-term MODIS observations of cyanobacterial dynamics in Lake Taihu: Responses to nutrient enrichment and meteorological factors. Sci. Rep., 7, 40326. doi: 10.1038/srep40326.
    [32]

    Shi, W., and M. H. Wang, 2007: Detection of turbid waters and absorbing aerosols for the MODIS ocean color data processing. Remote Sens. Environ., 110, 149–161. doi: 10.1016/j.rse.2007.02.013.
    [33]

    Singh, R. K., and P. Shanmugam, 2014: A novel method for estimation of aerosol radiance and its extrapolation in the atmospheric correction of satellite data over optically complex oceanic waters. Remote Sens. Environ., 142, 188–206. doi: 10.1016/j.rse.2013.12.001.
    [34]

    Song, T., W. L. Zhou, J. Z. Liu, et al., 2017: Evaluation on distribution of chlorophyll-a content in surface water of Taihu Lake by hyperspectral inversion models. Acta Sci. Circumst., 37, 888–899. doi: 10.13671/j.hjkxxb.2016.0438. (in Chinese)
    [35]

    Wang, M. H., 2007: Remote sensing of the ocean contributions from ultraviolet to near-infrared using the shortwave infrared bands: Simulations. Appl. Opt., 46, 1535–1547. doi: 10.1364/AO.46.001535.
    [36]

    Wang, M. H., W. Shi, and L. D. Jiang, 2012: Atmospheric correction using near-infrared bands for satellite ocean color data processing in the turbid western Pacific region. Opt. Express, 20, 741–753. doi: 10.1364/OE.20.000741.
    [37]

    Wang, M. H., X. M. Liu, L. Q. Tan, et al., 2013: Impacts of VIIRS SDR performance on ocean color products. J. Geophys. Res. Atmos., 118, 10,347–10,360. doi: 10.1002/jgrd.50793.
    [38]

    Wu, C. Q., Z. F. Yang, Q. Wang, et al., 2009: A reverse method of chlorophyll-a based on dynamic apex. J. Lake Sci., 21, 223–227. doi: 10.18307/2009.0210. (in Chinese)
    [39]

    Xie, T. T., Y. Z. Chen, W. F. Lu, et al., 2019: Comparison and analysis of chlorophyll-a retrieval model in the lower reaches of Minjiang River based on GF-1 WFV image. Acta Sci. Circumst., 39, 4276–4283. doi: 10.13671/j.hjkxxb.2019.0180. (in Chinese)
    [40]

    Xiong, W., X. Qian, R. Ye, et al., 2012: Eco-model based analysis of Lake Taihu cyanobacteria growth factors. J. Lake Sci., 24, 698–704. doi: 10.18307/2012.0509. (in Chinese)
    [41]

    Xu, M., H. X. Liu, R. Beck, et al., 2019: Regionally and locally adaptive models for retrieving chlorophyll-a concentration in inland waters from remotely sensed multispectral and hyperspectral imagery. IEEE Trans. Geosci. Remote Sens., 57, 4758–4774. doi: 10.1109/TGRS.2019.2892899.
    [42]

    Xu, N., F. Deng, B. Q. Liu, et al., 2021: Changes in the urban surface thermal environment of a Chinese coastal city revealed by downscaling MODIS LST with random forest algorithm. J. Meteor. Res., 35, 759–774. doi: 10.1007/s13351-021-0023-4.
    [43]

    Xu, P. F., F. Mao, P. B. Jin, et al., 2020: Spatial–temporal variations of chlorophyll-a in Qiandao lake using GF1_WFV data. China Environ. Sci., 40, 4580–4588. doi: 10.3969/j.issn.1000-6923.2020.10.045. (in Chinese)
    [44]

    Xu, Y., X. Y. Dong, and J. J. Wang, 2019: Use of remote multispectral imaging to monitor chlorophyll-a in Taihu Lake: A comparison of four machine learning models. Journal of Hydroecology, 40, 48–57. doi: 10.15928/j.1674-3075.2019.04.007. (in Chinese)
    [45]

    Yajima, H., and J. Derot, 2018: Application of the random forest model for chlorophyll-a forecasts in fresh and brackish water bodies in Japan, using multivariate long-term databases. J. Hydroinform., 20, 206–220. doi: 10.2166/hydro.2017.010.
    [46]

    Yang, T., H. Zhang, Q. Wang, et al., 2011: Retrieving for chlorophyll-a concentration and suspended substance concentration based on HJ-1A HIS image. Environ. Sci., 32, 3207–3214. (in Chinese)
    [47]

    Zhang, M. H., H. Su, and B. W. Ji, 2018: Retrieving nearshore chlorophyll-a concentration using MODIS time-series images in the Fujian Province (China). Acta Sci. Circumst., 38, 4831–4839. doi: 10.13671/j.hjkxxb.2018.0343. (in Chinese)
    [48]

    Zhang, S. B., F. Z. Weng, and Y. Wei, 2020: A multivariable approach for estimating soil moisture from microwave radiation imager (MWRI). J. Meteor. Res., 34, 732–747. doi: 10.1007/s13351-020-9203-x.
    [49]

    Zhang, Y. C., X. Qian, Y. Qian, et al., 2009: Application of SVM on Chl-a concentration retrievals in Taihu Lake. China Environ. Sci., 29, 78–83. doi: 10.3321/j.issn:1000-6923.2009.01.016. (in Chinese)
    [50]

    Zhang, Z., M. Zhang, W. Xiao, et al., 2018: Analysis of temporal and spatial variations in NDVI of aquatic vegetation in Lake Taihu. J. Remote Sens., 22, 324–334. doi: 10.11834/jrs.20186495. (in Chinese)
    [51]

    Zhou, L., R. H. Ma, H. T. Duan, et al., 2011: Remote sensing retrieval for chlorophyll-a concentration in turbid case Ⅱ waters (Ⅰ): The optimal model. J. Infrared Millim. Waves, 30, 531–536.
    [52]

    Zhu, G. W., B. Q. Qin, Y. L. Zhang, et al., 2018: Variation and driving factors of nutrients and chlorophyll-a concentrations in northern region of Lake Taihu, China, 2005–2017. J. Lake Sci., 30, 279–295. doi: 10.18307/2018.0201. (in Chinese)
    [53]

    Zhu, Y. F., L. Zhu, J. G. Li, et al., 2017: The study of inversion of chlorophyll a in Taihu based on GF-1 WFV image and BP neural network. Acta Sci. Circumst., 37, 130–137. doi: 10.13671/j.hjkxxb.2016.0275. (in Chinese)
  • Yachun LI and Xin HANG.pdf

  • 加载中
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Estimation of Chlorophyll-a Concentration in Lake Taihu from Gaofen-1 Wide-Field-of-View Data through a Machine Learning Trained Algorithm

    Corresponding author: Yachun LI, jsqxlyc@163.com
  • 1. Jiangsu Climate Center, Jiangsu Meteorological Bureau, Nanjing 210008
  • 2. Nanjing Joint Institute for Atmospheric Sciences, Nanjing 210009
Funds: Supported by the National Key Research and Development Program of China (2018YFC1506500) and Foundation for Key Scientific Research of Jiangsu Meteorological Bureau (KZ202003)

Abstract: Wide-field-of-view (WFV) imager that observes the earth environment with four solar reflective bands in a spatial resolution of 16 m is equipped on board Gaofen-1 (GF-1) satellite. Chlorophyll-a (Chl-a) concentration in Lake Taihu, China from 2018 to 2019 is collected and collocated with GF-1 satellite data. This study develops a general and reliable estimation of Chl-a concentration from GF-1 WFV data under turbid inland water conditions. The collocated data are classified according to season and used in random forest (RF) regression to train models for retrieving the lake Chl-a concentration. A composite index is developed to select the most important variables in the models. The models trained for each season show a better performance than the model trained by using the whole year data in terms of the coefficient of determination (R2) between retrievals and observations. Specifically, the R2 values in spring, summer, autumn, and winter are 0.88, 0.88, 0.94, and 0.74, respectively; whereas that using the whole year data is only 0.71. The Chl-a concentration in Lake Taihu exhibits an obvious seasonal change with the highest in summer, followed by autumn and spring, and the lowest in winter. The Chl-a concentration also displays an obvious spatial variation with season. A high concentration occurs mainly in the northwest of the lake. The temporal and spatial changes of Chl-a concentration are almost consistent with the changes in the areas and times of cyanobacteria blooms based on Moderate Resolution Imaging Spectroradiometer (MODIS) data. The proposed algorithm can be operated without a priori knowledge on atmospheric conditions and water quality. Our study also demonstrates that GF-1 data are increasingly valuable for monitoring the Chl-a concentration of inland water bodies in China at a high spatial resolution.

通过机器学习训练算法从高分1号WFV数据估算太湖叶绿素a浓度

本文收集了2018–2019年太湖叶绿素a(Chl-a)浓度数据并与高分1号(GF-1)卫星数据匹配。按照季节对数据样本进行分类,用于RF训练,并建立Chl-a反演模型。开发了一个综合指数来确定模型中最重要的变量。根据精度验证的R2,发现季节模型比全年数据模型更优。具体而言,春、夏、秋、冬的 R2 值分别为 0.88、0.88、0.94 和 0.74,而使用全年数据的R2仅为 0.71。太湖Chl-a浓度呈现明显的季节性变化,夏季最高,秋季和春季次之,冬季最低。Chl-a 浓度也表现出明显的空间变化,高浓度主要发生在湖的西北部。Chl-a 浓度的时空变化与基于 MODIS 数据的蓝藻水华面积和时间的变化几乎一致。本文所提出的算法可以在没有大气条件和水质的先验知识的情况下运行。研究还表明,GF-1 卫星数据对监测高分辨率中国内陆水体Chl-a 浓度越来越有价值。
    • Lake eutrophication and harmful algal blooms are common ecological problems of water ecosystems globally (Dai et al., 2016; Zhu et al., 2017). Because chlorophyll-a (Chl-a) is the most abundant pigment in phytoplankton and algae, accurate observation of its concentration is of great importance in both assessing the degree of eutrophication of water bodies and promoting water environment management and ecological protection (Harvey et al., 2015).

      Lake Taihu is the third largest freshwater lake in China, which is characterized by typically large shallow-water. With the extent climate warming and eutrophication, the frequency, intensity, and annual duration of cyanobacteria blooms have been increasing for decades. Therefore, there is an urgent need for effectively monitoring and managing water quality in Lake Taihu, and for a broader understanding the optical, biological, and ecological processes and phenomena in all fresh inland waters.

      Although in-situ measurements can provide details on inland lake water quality, they have limitations, in particular, the lack of spatial coverage. With much higher temporal and spatial coverages, various satellite sensors have been widely used for estimation of Chl-a in large water bodies (He et al., 2020). In principle, the optical radiometric measurements of clean oceanic waters allow for accurate monitoring of ocean color, which is primarily governed by Chl-a and any of its accessory pigments. Remotely sensed aquatic signals recorded at the top-of-atmosphere (TOA) are bulk optical properties emanating from the absorption and scattering of solar photons in the atmosphere, within the water column, and at the air–water interface. After removing the atmospheric effects, the TOA signal is reduced to the remote sensing reflectance, from which in-water optical properties are quantified (Wang et al., 2013; Pahlevan et al., 2020). Therefore, the Chl-a is usually obtained through the following two stages: (1) application of an atmospheric correction (AC) algorithm to generate normalized water-leaving radiance spectra, and (2) generation of Chl-a using the obtained water-leaving radiance spectra data (Gordon and Wang, 1994a; Wang et al., 2012). However, those standard Chl-a retrieval algorithms originally developed for open ocean waters tend to fail when applied to more turbid inland and coastal waters whose optically properties are strongly influenced by non-covarying concentration of non-algal particles and colored dissolved organic matter (IOCCG, 2006; Matthews, 2011; Palmer et al., 2015). In addition, the continentality of the atmospheres overlying in-land and coastal waters and the proximity of the adjacent land surface mean that standard approaches to AC over ocean waters are not always reliable (Mouw et al., 2015; Palmer et al., 2015). There is a clear need to develop and validate atmospheric and in-water algorithms specifically for turbid inland and coastal waters.

      Many efforts have been made in recent years to retrieve Chl-a from coastal and turbid in-land waters. Multiple algorithms have been developed to retrieve Chl-a from multispectral and hyperspectral images for over coastal and in-land waters (Xu et al., 2019), including the empirical models (Zhou et al., 2011; Qi et al., 2014), semi-analytical algorithms (Gordon et al., 1988; Li et al., 2006), and machine learning (ML) models (Jiang et al., 2013; Zhu et al., 2017; He et al., 2020). Of these algorithms, it is difficult to estimate Chl-a in turbid and severely eutrophied lakes using traditional empirical or semi-analytical because these models rely on blue–green wavelengths where water constituents other than phytoplankton often dominate the optical properties and satellite-derived water leaving radiance in these wavelengths often contains substantial uncertainties (Palmer et al., 2015; Shi et al., 2017). ML, a subset of artificial intelligence, offers the ability of in-depth mining of data features through nonlinear and complex calculations, showing superior performance (Lary et al., 2016). Multiple ML algorithms have been used to estimate Chl-a, such as the neural network (Zhu et al., 2017), support vector machine (Zhang et al., 2009; Kong et al., 2017), and random forest (RF; Zhang et al., 2018). The RF is an ensemble ML algorithm (Zhang et al., 2020; Xu et al., 2021), providing a nonparametric, nonlinear, and multivariate regression analysis with high performance when estimating and predicting Chl-a (Yajima and Derot, 2018; Zhang et al., 2018).

      Previous studies have generally used an AC algorithm to remove the atmospheric and surface effects from satellite observations in order to retrieve the water-leaving radiance, which can be used to produce water color products such as the Chl-a concentration. However, removing a large signal and deriving a very small signal accurately from the water are the major challenge in the turbid inland lake conditions (Wang, 2007; IOCCG, 2010). Those conventional and modified AC algorithms (Gordon and Wang, 1994b; Shi and Wang, 2007) may produce doubtful results in turbid inland lakes mainly due to the difficulty in deriving normalized water-leaving radiance spectra data (Jamet et al., 2011; Goyens et al., 2013; Singh and Shanmugam, 2014; Fan et al., 2017).

      Therefore, we introduce an RF based approach for Chl-a retrieval over Lake Taihu. The alternative approach is not to attempt removing atmosphere effects, but to extract the Chl-a information directly from the TOA reflectance observed by satellites. This may avoid the need to retrieve the water-leaving radiance from application of an AC, which is typically prone to large errors in turbid and high-biomass waters. In addition, Gaofen-1 (GF-1) is the first high-resolution earth observation satellite of China and equipped with four 16-m-resolution multispectral wide-field-of-view (WFV) sensors. The high resolution of GF-1 provides great convenience for the inversion of water quality parameters. However, there are currently few studies on the use of the GF-1 satellite for Chl-a retrieval in inland lakes (Zhu et al., 2017). The primary objective of this study is to develop a model for Chl-a retrieval with the high spatial resolution of 16 m from GF-1 WFV. A fine-resolution and high-frequency Chl-a concentration would benefit to such small-scale water environmental and ecological studies as in Lake Taihu. The remainder of the manuscript is structured as follows. In Section 2, the satellite data and in-situ measurements used in the study are presented. In Section 3, the ML process adopted for Chl-a concentration retrieval is described. The retrieval results are then described in Section 4. A discussion is presented in Section 5, followed by the conclusions in Section 6.

    2.   Data
    • Lake Taihu is one of the largest freshwater lakes in China. It is located in the southeast portion of the Yangtze River Delta (30°55′–31°30′N, 119°55′–120°40′E), covering an area of 2338 km2 with an average water depth of 2 m. Following the rapid development of China’s economy in the previous 30 years, increasing volumes of industrial, agricultural, and domestic sewage from surrounding cities have been discharged continually into Lake Taihu. This process has caused the water to become eutrophic, resulting in frequent occurrence of large-scale cyanobacterial blooms (Li et al., 2016; Shi et al., 2017).

    • The Department of Ecology and Environment of Jiangsu Province established 19 water quality buoy sites in Lake Taihu (as shown in Fig. 1) to automatically receive water quality data. Although few in number, these sites are generally distributed uniformly throughout the lake, and the observations can represent the water quality distribution in the lake. All the buoy sites are equipped with a multiparameter water quality monitoring instrument (model: YSI6600). The observational data include 14 water quality parameters such as Chl-a concentration, algae density, and dissolved oxygen concentration. The observation instrument is calibrated once a week, and the actual water sample comparison experiment is carried out at the same time to ensure the stability and reliability of the data. The Chl-a selected in this study represent instantaneous values observed simultaneously with the time of overpass (1130 BT) of the GF-1 satellite.

      Figure 1.  Distribution of the 19 water quality buoy sites in Lake Taihu, China. All buoy sites are equipped with a multiparameter water quality monitoring instrument (model: YSI6600) and provide in-situ measurements of Chl-a concentration required for model training and validation.

    • The GF-1 satellite, which is the first high-resolution satellite developed by China, was launched on 26 April 2013. This satellite is equipped with a 2-m-resolution panchromatic sensor, 8-m-resolution multispectral sensor, and four 16-m-resolution multispectral WFV sensors. This study uses the data acquired by the 16-m-resolution WFV sensors. Table 1 lists the basic parameters of the WFV sensors onboard GF-1.

      Spectral bandSpectral range (μm)Spatial resolution (m)Swath width (km)Swaying capacityRevisit period (day)
      B10.45–0.5216800 (4 cameras)±35°2
      B20.52–0.59
      B30.63–0.69
      B40.77–0.89

      Table 1.  Parameters of the GF-1 WFV sensors

      The GF-1 WFV data are acquired from January 2018 to May 2019 and include the TOA reflectance at four bands. The 27 GF-1 images selected for model training and cross-validation covered 18 days (Table 2). Furthermore, the TOA reflectances observed by GF-1 on 4 June and 13 December 2019 are also selected as model input data for increasing the node impurity (INIP) Chl-a retrieval. These data are used only for independent testing of the performance of the models, that is, not for training or cross-validation purposes. All images underwent orthorectification, radiometric calibration, and image mosaicking processing.

      YearDateNumber of image
      201812 January2
      5, 6, 13, and 23 February6
      8 and 28 April2
      15 and 23 May3
      25 June, 19 and 20 July5
      6 and 27 October2
      18 December2
      201917 and 24 January3

      Table 2.  GF-1 images selected for model training and cross-validation

      For comparison, Earth Observing System (EOS) Moderate Resolution Imaging Spectroradiometer (MODIS) data are also collected for retrieving cyanobacterial blooms. The commonly used normalized difference vegetation index (NDVI) is used to extract cyanobacterial bloom information from all available MODIS data collected during 2007–2018. In this paper, the use of cyanobacterial blooms in the area and the times of occurrence of two indicators, wherein the indicator area is determined by setting a threshold NDVI, and the times are determined based on the area (Li et al., 2016).

    • From all the measured Chl-a concentration acquired at the time of overflight of the GF-1 satellite on the 18 days, excluding 152 missing and abnormal data (mainly the influence of the cloud), 190 matching data elements are obtained to form the Chl-a sample dataset for model training and cross-validation purposes. Grouped by season, 57 spring (March–May), 30 summer (June–August), 20 autumn (September–November), and 83 winter (December–February) sample subsets of measured Chl-a are obtained. Additionally, the measured Chl-a on 4 June and 13 December 2019, are used to independently test the performance of the models. Figure 2 shows the frequency distributions of the Chl-a sample dataset. It can be seen that the distribution of Chl-a concentration is mainly concentrated between 4 and 10 mg m−3, of which the winter dataset accounts for the highest proportion of 95.2%, followed by the whole year dataset with 88.4%, and the summer dataset with the lowest proportion of 73.4%. The maximum Chl-a concentration is 37.4 mg m−3, which appears in autumn; and the minimum is 0.9 mg m−3, which appears in spring. From the mean Chl-a concentration in the four seasons, the maximum is 9.6 mg m−3 in summer, followed by autumn (8.0 mg m−3), and the minimum in winter (6.6 mg m−3). The distribution range of the Chl-a sample dataset used for modeling is reasonable, and the seasonal concentration changes are also consistent with the actual situation, which can be used for model training and cross-validation.

      Figure 2.  Statistical distributions of frequency for the Chl-a sample dataset constructed by (a) the whole year dataset, (b) the spring dataset, (c) the summer dataset, (d) the autumn dataset, and (e) the winter dataset. The percent value is the ratio of the count of each Chl-a to the total count of the samples in the dataset. All Chl-a widths are 2 mg m−3 for each panel.

      The GF-1 WFV TOA reflectance is used as the input for the RF model. It is considered reasonable to use the reflectance in the four GF-1 WFV bands for water quality parameter retrieval because the reflectance of a single band is a complex function of many parameters (Kong et al., 2017). Following Fang et al. (2019), considering the 4 single bands and the main vegetation index commonly used to retrieve Chl-a, and including some other wave band combinations of the GF-1 WFV, 39 variables constructed are selected as the latent variables for model filtering (Table 3).

      Variable nameInputQuantity
      Bi (single wave band)B1, B2, B3, B44
      EVI (enhanced vegetation index)$ \mathrm{E}\mathrm{V}\mathrm{I}\mathrm{ }=\mathrm{ }2.5\mathrm{ }\times \dfrac{{B}_{4}-{B}_{3}}{{B}_{4}+ 6 \times {B}_{3}- 7.5 \times {B}_{1}+ 1}. $1
      NDVI (i, j) (normalized vegetation index)NDVI (1, 2) = $ \left({B}_{1}-{B}_{2}\right)/\left({B}_{1}+{B}_{2}\right) $, NDVI (1, 3) = $ \left({B}_{1}-{B}_{3}\right)/\left({B}_{1}+{B}_{3}\right) $,
      NDVI (1, 4) = $ \left({B}_{1}-{B}_{4}\right)/\left({B}_{1}+{B}_{4}\right) $, NDVI (2, 3) = $ \left({B}_{2}-{B}_{3}\right)/\left({B}_{2}+{B}_{3}\right) $,
      NDVI (2, 4) = $ \left({B}_{2}-{B}_{4}\right)/\left({B}_{2}+{B}_{4}\right) $, NDVI (3, 4) = $ \left({B}_{3}-{B}_{4}\right)/\left({B}_{3}+{B}_{4}\right) $.
      6
      DVI (i, j) (vegetation index difference)DVI (1, 2) = $ {B}_{1}-{B}_{2} $, DVI (1, 3) = $ {B}_{1}-{B}_{3} $, DVI (1, 4) = $ {B}_{1}-{B}_{4} $,
      DVI (2, 3) = $ {B}_{2}-{B}_{3} $, DVI (2, 4) = $ {B}_{2}-{B}_{4} $, DVI (3, 4) = $ {B}_{3}-{B}_{4} $.
      6
      RVI (i, j) (vegetation index ratio)RVI (1, 2) = $ {B}_{1}/{B}_{2} $, RVI (1, 3) = $ {B}_{1}/{B}_{3} $, RVI (1, 4) = $ {B}_{1}/{B}_{4} $,
      RVI (2, 3) = $ {B}_{2}/{B}_{3} $, RVI (2, 4) = $ {B}_{2}/{B}_{4} $, RVI (3, 4) = $ {B}_{3}/{B}_{4} $.
      6
      VI (other)VI (4, 1, 2, 3) = $ {B}_{4}/\left({B}_{1}+{B}_{2}+{B}_{3}\right) $, VI (1, 2, 3, 4) = $ {B}_{1}/\left({B}_{2}+{B}_{3}+{B}_{4}\right) $,
      VI (2, 1, 3, 4) = $ {B}_{2}/\left({B}_{1}+{B}_{3}+{B}_{4}\right) $, VI (3, 1, 2, 4) = $ {B}_{3}/\left({B}_{1}+{B}_{2}+{B}_{4}\right) $,
      VI (1, 2, 3) = $ {B}_{1}/\left({B}_{2}+{B}_{3}\right) $, VI (1, 2, 4) = $ {B}_{1}/\left({B}_{2}+{B}_{4}\right) $, VI (1, 3, 4) = $ {B}_{1}/\left({B}_{3}+{B}_{4}\right) $,
      VI (2, 1, 3) = $ {B}_{2}/\left({B}_{1}+{B}_{3}\right) $, VI (2, 1, 4) = $ {B}_{2}/\left({B}_{1}+{B}_{4}\right) $, VI (2, 3, 4) = $ {B}_{2}/\left({B}_{3}+{B}_{4}\right) $,
      VI (3, 1, 2) = $ {B}_{3}/\left({B}_{1}+{B}_{2}\right) $, VI (3, 2, 4) = $ {B}_{3}/\left({B}_{2}+{B}_{4}\right) $, VI (3, 1, 4) = $ {B}_{3}/\left({B}_{1}+{B}_{4}\right) $,
      VI (4, 2, 3) = $ {B}_{4}/\left({B}_{2}+{B}_{3}\right) $, VI (4, 1, 3) = $ {B}_{4}/\left({B}_{1}+{B}_{3}\right) $, VI (4, 1, 2) = ${B}_{4}/\left({B}_{1}+{B}_{2}\right) .$
      16
      B1 refers to the blue wave band (0.45–0.52 μm), B2 refers to the green wave band (0.52–0.59 μm), B3 refers to the red wave band (0.63–0.69 μm), and B4 refers to the near-infrared wave band (0.77–0.89 μm). The “i, j” in Bi, NDVI(i, j), DVI(i, j), and RVI(i, j) refer to the number of different bands respectively.

      Table 3.  The 39 latent variables involved in the random forest modeling

    3.   Methodology
    • The mathematical details and structure of the RF algorithm have been discussed elsewhere (Breiman, 2001; Iverson et al., 2008); therefore, only a brief introduction to the RF algorithm is given here. The RF is a type of supervised ensemble ML technique that uses multiple decision trees and bootstrap aggregation to provide a nonparametric, multivariable, and nonlinear regression. First, “k” features are selected at random from the total features and used to calculate the root node via the best split approach. Then, a tree is constructed by using the root node. Multiple randomly constructed trees generated from the above process are then used to build multiple decision trees. Finally, each prediction from the multiple trees is merged to obtain the average results (Liaw and Wiener, 2002).

      The RF model used here is developed by incorporating in-situ Chl-a and TOA reflectance to estimate the Chl-a concentration. The input variables included the in-situ Chl-a, reflectance variables of band combinations of the GF-1 WFV, and the latitude and longitude coordinates of the water quality buoy sites. The use of latitude and longitude as variables accounts for the spatiotemporal variation of the Chl-a. The performance of the models is compared by using different settings of the number of trees (ntree) and the number of variable per level (mtry), and the optimal model performance is achieved when ntree is assigned the value of 600 and mtry is assigned the value of one-third of the total number of variables of each model. It is noted that the RF is a supervised ML algorithm; thus, although the in-situ Chl-a is critical for model fitting, it is not necessary for model application.

      The Monte Carlo cross-validation (MCCV) technique is used to assess the potential of model fitting and model robustness (Ghorbanzadeh et al., 2020). The MCCV technique is an asymptotically consistent method for model selection. It can avoid an unnecessary large model, decrease the risk of over-fitting in model training, and has a relatively high probability for choosing the most appropriate model. Here, the sample dataset is split randomly into two subsample datasets: one subset containing 25% of the total samples is used to validate the model, and the other subset containing the remaining 75% of the samples is used to train the model. Such an independent model training and test procedure are repeated multiple times, and the average of these test results is taken as an indicator with which to verify the accuracy of the model. Several statistical indicators are used to quantitatively evaluate the model performance, that are, the coefficient of determination (R2), the root-mean-square error (RMSE), and the mean absolute percentage error (MAPE) between the cross-validation predicted and observed Chl-a. The MAPE is calculated as follows:

      $${\rm{ MAPE}}=\frac{1}{n}\sum _{i=1}^{n}\frac{\left|{\rm Chla}^{\rm obs}\left(i\right)-{\rm Chla}^{\rm pre}\left(i\right)\right|}{{\rm Chla}^{\rm obs}\left(i\right)}\times 100 \text%, $$ (1)

      where $ n $ is the total number of samples, and ${\rm Chla}^{\rm obs}$ and ${\rm Chla}^{\rm pre}$ are the observed and predicated Chl-a, respectively.

    • The selection of the most effective band plays an important role in accurate estimation of Chl-a (Goyens et al., 2013; Jiang et al., 2013). Some previous studies have determined the Chl-a spectral characteristics and sensitive bands of lake water bodies through statistical analysis or field measurement (Wu et al., 2009; Yang et al., 2011). However, for relatively turbid inland water bodies, owing to the presence of phytoplankton, suspended matter, dissolved organic matter, and many other substances that affect the absorption spectrum of Chl-a, all components mix and interact with each other such that their spectral characteristics are more complicated. The actual measured Chl-a spectral reflectance and absorption peaks between these water bodies will also have significant differences (Luo et al., 2017). Moreover, most such measurements have been undertaken in specific water bodies and therefore they lack extensive validation.

      The RF algorithm can be used to quantitatively assess the relative importance of each variable to the model. Thus, certain irrelevant or redundant characteristic variables can be excluded from the initial large number of variables, and a small number of characteristic variables that contribute most to the model can be filtered to obtain a more accurate model. The contribution of each input variable in each model is evaluated based on two factors: the percentage for increasing the mean square error (IMSE) and INIP. As the percentage for IMSE or the INIP increases, the contribution of the variable to the Chl-a retrieval increases. Previous studies generally use only one of the indicators of IMSE and INIP to select the important variables, but this practice could be inaccurate. Therefore, a novel relative importance evaluation index (RIEI), which is developed by combining IMSE and INIP to measure the relative importance of the variables, is used to filter the important variables as model inputs. The RIEI is calculated as follows:

      $$ {\rm{RIEI}}=\left(\frac{{\rm IMSE}_{i}-{\rm IMSE}_{\rm min}}{{\rm IMSE}_{\rm max}-{\rm IMSE}_{\rm min}}+\frac{{\rm INIP}_{i}-{\rm INIP}_{\rm min}}{{\rm INIP}_{\rm max}-{\rm INIP}_{\rm min}}\right)/2 , $$ (2)

      where $ {\rm IMSE}_{i} $ is the IMSE value of the ith model of the 20 training models with the highest accuracy (HAMs) from MCCV, $ {\rm IMSE}_{\rm min} $ is the minimum of the 20 HAMs, and $ {\rm IMSE}_{\rm max} $ is the maximum of the 20 HAMs; $ {\rm INIP}_{i} $ is the ith INIP value of the 20 HAMs, $ {\rm INIP}_{\rm min} $ is the minimum of the 20 HAMs, and $ {\rm INIP}_{\rm max} $ is the maximum of the 20 HAMs.

      From Eq. (2), the RIEI value is proportional to the $ {\rm IMSE}_{i} $ and $ {\rm INIP}_{i} $, that is, the larger the $ {\rm IMSE}_{i} $ or $ {\rm INIP}_{i} $, the larger the RIEI. Similarly, we can also conclude that as the RIEI increases, the contribution of the corresponding variable to Chl-a retrieval increases.

    • First, a representative training dataset is needed for RF model training to estimate Chl-a concentration. The GF-1 WFV TOA reflectance and the synchronously observed Chl-a for 18 days from January 2018 to May 2019 are used as the training dataset. The 190 measured Chl-a are selected and grouped by season to obtain 5 sample subsets for model training. Three-quarters of the data are selected at random from each sample subset as the training sample set, and the remaining quarter of the data are taken as the validation sample set.

      The selected important variables are considered as the input variables and the RF model is trained separately, where the value of parameter mtry is set as one-third of the number of characteristic variables, and it adopts the four values of 1, 2, 3, and 4 successively. The parameter ntree adopts the value of 400, 500, or 600 based on the previous error analysis results. Each set of corresponding parameter combinations (mtry, ntree) is repeated multiple times in the modeling procedure, and the parameter combination with the highest accuracy of each model is selected. A detailed flowchart of the process for determining the optimal RF model is shown in Fig. 3.

      Figure 3.  Detailed flowchart of the process for determining the optimal RF model.

    4.   Results
    • First, we select the band combinations based on the single index of either IMSE or INIP for comparison purposes. The above-determined parameters mtry and ntree are used for modeling and optimization, and IMSE and INIP values and their ranking of each variable in each model are obtained separately, based on which the top ranked variables are regarded as the important variables of model input. For comparability of the results, we select the same number of important variables in each model. The result is that MODYear, MODSpr, MODSum, MODAut, and MODWin have 6, 4, 4, 9, and 4 important variables, respectively (Tables 4, 5).

      ModelImportant variableNumber
      MODYearDVI (2, 3), VI (2, 1, 3), RVI (2, 3), NDVI (2, 3), VI (2, 1, 3, 4), VI (2, 3, 4)6
      MODSprVI (2, 1, 4), DVI (2, 3), VI (2, 1, 3), VI (2, 1, 3, 4)4
      MODSumVI (3, 1, 2), RVI (1, 3), DVI (2, 3), RVI (2, 3)4
      MODAutVI (2, 3, 4), RVI (1, 2), VI (1, 2, 3, 4), VI (1, 2, 3), VI (4, 1, 2), NDVI (2, 4), VI (1, 2, 4), VI (1, 3, 4), DVI (1, 2)9
      MODWinRVI (2, 3), RVI (2, 4), VI (2, 1, 3, 4), VI (2, 3, 4)4

      Table 4.  Model filtering results based on the IMSE indicator

      ModelImportant variableNumber
      MODYearVI (2, 1, 3, 4), VI (2, 3, 4), VI (2, 1, 3), B4, VI (2, 1, 4), DVI (2, 3)6
      MODSprVI (4, 1, 3), NDVI (1, 4), VI (4, 1, 2, 3), EVI4
      MODSumRVI (1, 3), VI (3, 1, 2), VI (2, 3, 4), VI (1, 3, 4)4
      MODAutVI (2, 1, 3), RVI (2, 3), DVI (1, 2), VI (4, 1, 2, 3), VI (4, 1, 2), DVI (2, 3), VI (4, 2, 3), B4, NDVI (2, 3)9
      MODWinRVI (2, 3), VI (2, 3, 4), VI (3, 1, 2), RVI (2, 4)4

      Table 5.  Model filtering results based on the INIP indicator

      From the results of the above two types of filtering, it can be found that the important variables of each model are significantly different. The degree of coincidence of the important variables of the five models is only 67%, 0, 50%, 22%, and 75%, and the ranking order of the variables is markedly different. Only the variables ranked first in model MODWin are the same, while the order of the variables of the remaining four models are all different. From this analysis, it is not difficult to conclude that it might be inappropriate to use only one of the indicators to measure the importance of the variables and to use this as a criterion for selecting important variables.

      In the following, we determine the effective band combinations based on the RIEI. As above, the results of the evaluation of the importance of the variables in MODYear, MODSpr, MODSum, MODAut, and MODWin are obtained according to the RIEI. As an example, the ranking of the importance of the variables in MODSum and MODWin is shown in Figs. 4a and 4b, respectively.

      Figure 4.  Importance of each input variable in the RF training models of (a) MODSum and (b) MODWin derived from the RIEI. For each model, only the first 15 RIEI values of the 29 latent variables are shown as examples. The larger the RIEI value, the larger the contribution of the variable to the Chl-a concentration retrieval.

      According to the results of the importance evaluation, the important characteristic variables of the five models are filtered out. The results, including MODYear, MODSpr, MODSum, MODAut, and MODWin, have 6, 8, 12, 5, and 5 important variables respectively (Table 6).

      ModelImportant variableNumber
      MODYearVI (2, 1, 3, 4), VI (2, 3, 4), VI (2, 1, 3), B4, VI (2, 1, 4), DVI (2, 3)6
      MODSprDVI (1, 2), VI (2, 1, 4), VI (2, 1, 3, 4), RVI (1, 2), VI (2, 1, 3), DVI (2, 3), VI (1, 2, 3), RVI (2, 3)8
      MODSumRVI (1, 3), VI (3, 1, 2), RVI (2, 3), VI (2, 3, 4), VI (1, 3, 4), DVI (2, 3), DVI (1, 3), VI (1, 2, 3, 4), VI (2, 1, 3, 4), VI (1, 2, 3), VI (2, 1, 3), VI (1, 2, 3, 4)12
      MODAutVI (4, 1, 2), DVI (1, 2), VI (2, 3, 4), VI (4, 1, 2, 3), VI (1, 2, 3, 4)5
      MODWinRVI (2, 3), VI (2, 3, 4), RVI (2, 4), VI (2, 1, 3, 4), DVI (2, 3)5

      Table 6.  Model filtering results based on the RIEI indicator

      From the band combinations in the five models, we can obtain some interesting findings. The first is that each of the models contains all four bands of GF-1 WFV, indicating that each band of GF-1 WFV contains the Chl-a information. In fact, the spectral ranges of 4 multispectral bands of GF-1 WFV are very similar to that of the 4 bands of Landsat TM/ETM+ [i.e., blue (0.45–0.52 µm), green (0.52–0.60 µm), red (0.63–0.69 µm), and NIR (0.76–0.90 µm)]. They are the most suitable Landsat TM/ETM+ bands for characterizing Chl-a in complex coastal waters (Qi et al., 2014). Recent studies on the inversion of Chl-a also show that various combinations of the 4 GF-1 WFV bands are correlated with Chl-a in various seasons or regions (Xie et al., 2019). Secondly, in the variables of the five models, there is only one single-band variable, and the rest is multi-band combination variables, representing that the performance of the multi-band combination methods is better than the single-band methods. This similar result has been confirmed by extensive previous studies (Cheng et al., 2013). In addition, we can find that each model contains more band ratio variables, such as ratio vegetation index [RVI (1, 2), RVI (1, 3), RVI (2, 3), and RVI (2, 4)], and some difference vegetation indices, such as DVI (1, 2), DVI (2, 3), etc. Chl-a is the primary photosynthetic pigment in terrestrial green plants and phytoplankton in water, which is strongly absorbent of the blue (B1) and red (B3) spectral regions, and highly reflective of the green (B2) and NIR (B4) spectral regions, indicating a similarity between the spectral reflectance of algae-containing water and terrestrial vegetation (Cheng et al., 2013). Therefore, it is reasonable to include some vegetation index or similar band ratio variables in the model. Various types of vegetation index and similar band ratios are used for Chl-a retrieval in previous studies (Cheng et al., 2013; Xie et al., 2019). However, the model band combinations used by different algorithms may vary owing to the great spatial and temporal changes in the biophysical characteristics of turbid waters. Moreover, due to the variety of sampling times and positions, even the models of the same water body may vary from different authors. And the band combinations from different authors may also differ, even if the same model-building method is used (Cheng et al., 2013). Therefore, it is reasonable to believe that the RF models with the optimal band combinations of GF-1 WFV are feasible to estimate Chl-a in Lake Taihu.

      Although these findings are encouraging, the RF derived algorithm is also a black box model and has main drawback that it is not easy to give a clear interpretation of decision procedure. However, this drawback could be compensated to some extent by initially selecting variables with physical meaningful and selecting the optimal band combinations according to the variable importance.

    • The R2 values of MODYear, MODSpr, MODSum, MODAut, and MODWin with the highest accuracy, are listed in Table 7. Additional, the model training is also performed by using the variables filtered by IMSE and INIP, separately, and the results are also presented in Table 7 for comparison. Among the five RIEI trained models, the R2 values of MODYear, MODSpr, MODSum, and MODAut are all more than 0.9, R2 of MODWin is 0.89. In the IMSE trained models, only R2 of MODYear is more than 0.9, and only two of the INIP trained models (MODSpr and MODAut) have R2 more than 0.9. The maximum R2 of the five RIEI trained models is 0.98, which is slightly smaller than the maximum value (0.99) of any INIP trained model, but markedly greater than the maximum value (0.91) of any IMSE trained model. The minimum R2 of the five RIEI trained models is 0.89, which is markedly greater than that of any model trained with the other two indicators. Moreover, among the five RIEI trained models, only R2 of MODAut is slightly lower than that of any INIP trained model, and R2 values of the other models are higher than or equal to those of the models trained with the other two indicators. The results indicate that the fitting of the models trained by the RIEI is better than that achieved for the models trained by a single indicator.

      IndicatorMODYearMODSprMODSumMODAutMODWin
      RIEI0.910.950.910.980.89
      IMSE0.910.870.840.890.85
      INIP0.890.910.830.990.85

      Table 7.  Coefficient of determination (R2) of the five RF models trained by using the RIEI and single IMSE and INIP indicators

      To illustrate intuitively the performance of the models trained with the different indicators, scatter plots of the fitting results of the RF models are shown in Fig. 5. It can be seen that the performance of the RIEI trained models is relatively stable with the RMSEs of 1.24–2.67 mg m−3. The RMSEs of the models are all less than 2 mg m−3, except for that of MODSum, and the MAPEs fall in the range of 16.83%–27.26%. However, the maximum RMSE of the IMSE trained models is 5.98 mg m−3, and the RMSE of the INIP trained MODAut is 5.14 mg m−3. The RIEI trained models show that the performance is superior to the single indicator trained model.

      Figure 5.  Performances of the five RF models (MODYear, MODSpr, MODSum, MODAut, and MODWin) trained with the RIEI, and the single IMSE and INIP indicators, using a test dataset.

    • The MCCV is used to validate the performance of the RF models. Comparison of the Chl-a derived from in-situ measurements and that retrieved by the RF models (MODYear, MODSpr, MODSum, MODAut, and MODWin) is shown in Fig. 6. Despite the limited number of matchup data, the RF models derived Chl-a generally agree well with the in-situ measurements. There is high correlation between the Chl-a retrieved by each model and the in-situ measured values, and the RMSE is small. The MCCV R2 of the five models is in the range of 0.71–0.94 and the RMSEs are in the range of 1.40–4.30 mg m−3. Among them, MODAut has the highest accuracy (R2 = 0.94, RMSE = 1.66 mg m−3) followed by MODSpr and MODSum (R2 values of 0.88 and 0.88 and RMSE values of 2.94 and 2.59 mg m−3, respectively). The accuracy of MODYear is lowest (R2 = 0.71) and its RMSE (4.30 mg m−3) is markedly higher than that of the other four seasonal models. The MAPEs of the four seasonal models vary between 18.47% and 25.59%, substantially lower than the value of 38.32% of MODYear. This result shows that the RF models using these optimized GF-1 WFV band combinations can accurately estimate the Chl-a in Lake Taihu. Among them, the performance of the model with all samples is obviously inferior to that of the seasonal models, and the performance of MODAut is better than that of the other three seasonal models.

      Figure 6.  Scatter plots of the cross-validation results of our five models: (a) MODYear, (b) MODSpr, (c) MODSum, (d) MODAut, and (e) MODWin.

      Examples of the performance of the models for non-bloom and bloom cases are shown in Fig. 7. The two GF-1 WFV RGB images composed of three channels B1, B2, and B3 show two situations: no blooms in winter (Fig. 7a) and blooms in autumn (Fig. 7b), respectively. The retrieved Chl-a distribution for the two cases is shown in Figs. 7c, d, respectively. For the non-bloom case on 13 February 2018, the spatial distribution of the Chl-a throughout the lake is reasonably uniform, with the maximum and minimum values being approximately 8 and 3 mg m−3, respectively. In the case of algal bloom on 27 October 2018, the spatial distribution of the Chl-a is highly consistent with that of the cyanobacterial bloom, and spatial heterogeneity is substantial, with the maximum and minimum values being approximately 30 and 3 mg m−3, respectively. It demonstrates that our models can estimate the Chl-a in a large range of water and reveal its distribution in details.

      Figure 7.  Performances of the Chl-a concentration retrieval RF models in cases without algal bloom (13 February 2018) and with algal bloom (27 October 2018): (a) and (b) GF-1 WFV RGB images without algal bloom (in winter) and with cyanobacterial algal bloom (in autumn), respectively, (c) and (d) Chl-a concentration for the same two cases derived by using the RF models, respectively.

      To further evaluate the performance of our RF models, the GF-1 WFV TOA reflectance on 4 June and 13 December 2019, are selected as input data, and the in-situ measured Chl-a at the corresponding times are taken as independent test datasets. These datasets are not used for model training or cross-validation purposes. The two datasets also represent two different situations: blooms in summer (4 June 2019) and no blooms in winter (13 December 2019), and the Chl-a of these 2 days are retrieved from MODWin and MODSum, respectively. Figure 8 shows the GF-1 WFV RGB images, distribution of retrieved Chl-a, and scatter plots of the in-situ measured and the models retrieved Chl-a. It can be seen in the case without algal blooms that the distribution of MODWin retrieved Chl-a is more uniform, whereas in the case with algal blooms, the distribution of MODSum derived Chl-a is consistent with that of the cyanobacterial blooms. For MODWin and MODSum, tested with independent samples, R2 are 0.49 and 0.61, RMSEs are 2.34 and 5.31 mg m−3, and MAPEs are 32.74% and 31.83%, respectively. Overall, the models estimate the Chl-a well, although R2 values are reduced slightly, and RMSEs and MAPEs are also slightly larger.

      Figure 8.  Validations of the RF models using independent sample datasets in the case of no algal blooms (13 December 2019) and algal blooms (4 June 2019): (a) and (b) GF-1 WFV RGB images, (c) and (d) distribution of Chl-a concentration retrieved by RF models (MODWin and MODSum), and (e) and (f) scatter plots of Chl-a concentration derived from in-situ measurements versus those retrieved by using the models, respectively.

    • To investigate the annual variation of Chl-a in Lake Taihu, the seasonal models are used to retrieve the 15-day Chl-a concentration in 2018, and the average value in the entire lake on each day is obtained, as shown in Fig. 9. Owing to the lack of matched GF-1 WFV images, the estimated Chl-a for certain months (March, August, September, and November) is missing, but it can still be seen from Fig. 9 that the Chl-a in the lake exhibits an obvious seasonal trend. Although the lake shows relatively high Chl-a during all seasons, the Chl-a is substantially higher during summer (June–August) and autumn (September–November) in comparison with spring (March–May) and winter (December–February). The average Chl-a concentration in spring, summer, autumn, and winter is 7.7, 9.6, 8.6, and 7.1 mg m−3, respectively. This temporal trend is largely consistent with previous research (Shi et al., 2017). As a comparison, Fig. 10 shows the monthly average area of the cyanobacterial bloom derived from the NDVI model using all available MODIS images collected during 2007–2018. The higher temporal resolution of MODIS allows for more frequent monitoring of the cyanobacterial blooms, and the time series of the area of cyanobacterial blooms can characterize the long-term trends in the cyanobacterial dynamics of Lake Taihu. Generally, as the Chl-a in the lake increases, the possibility of a cyanobacterial bloom increases. For cyanobacterial bloom areas, a clear seasonal cycle is observed in Lake Taihu, whereby cyanobacterial blooms occurred much more often during summer and autumn and less frequently in spring and winter. The extent of cyanobacterial bloom areas are also considerably higher in summer and autumn than in spring and winter. Under normal circumstances, after a long-term bloom-free period in winter, surviving cyanobacteria float to the water surface during spring, which means the Chl-a increases gradually and the intensity of the cyanobacterial bloom increases substantially. However, we find that the areas of cyanobacterial blooms in June and July, which are the time of year affected primarily by the Jianghuai Meiyu (a long-term rainy weather occurs almost every year in this area), are smaller than in May (Fig. 10). Even under the influence of the Jianghuai Meiyu, summer is still considered the season with the largest intensity and greatest areal extent of cyanobacterial blooms. The retrieved Chl-a and the observed cyanobacterial blooms in Lake Taihu are mutually corroborated by the temporal change trend.

      Figure 9.  Temporal variation of Chl-a concentration in Lake Taihu in 2018 estimated by using the RF models. The red line is the 2-day moving average of the Chl-a concentration.

      Figure 10.  Temporal variation in the monthly average area of cyanobacterial blooms in Lake Taihu derived from EOS/MODIS acquired during 2007–2018. The red line is the 2-month moving average of the cyanobacterial bloom area.

    • To analyze the annual spatial distribution of Chl-a in Lake Taihu, the retrieved Chl-a concentration for 15 days in 2018 are averaged pixel by pixel to obtain a composite map of the spatial distribution of Chl-a in Lake Taihu in 2018 (Fig. 11a). Additionally, Fig. 11b shows the number of cyanobacteria blooms in each pixel obtained from MODIS. Although the number of matched GF-1 WFV images (15) is far below that of the MODIS images (134) with cyanobacterial blooms, it can be seen that the spatial distribution of Chl-a retrieved by our models generally has high consistency with that of the number of cyanobacterial blooms. The northwestern part of the lake is the area with the greatest average Chl-a concentration in the year. Correspondingly, this area is also the area in which cyanobacterial blooms occur most frequently and where the water quality is the poorest. The eastern coastal area, with the best water quality, has the lowest Chl-a and fewer occurrences of cyanobacterial blooms. It should be noted that due to the abundance of aquatic plants, the Chl-a in the southeast corner is also relatively high and remains higher throughout the year (except in winter; Zhang Z. et al., 2018). These results are generally consistent with those of previous research (Qi et al., 2014).

      Figure 11.  Spatial distributions of (a) Chl-a concentration retrieved from RF models and (b) the cumulative number of occurrences of cyanobacterial blooms derived from EOS/MODIS in Lake Taihu in 2018. It is worth noting that only cyanobacterial blooms are concerned in (b) and usually there is only aquatic plants and no cyanobacterial blooms in southeast of Lake Taihu.

      Figure 12 shows the seasonal spatial distribution of the retrieved Chl-a in Lake Taihu in 2018. The Chl-a exhibits obvious seasonal variability, which the Chl-a is substantially higher during summer and autumn than in spring and winter. Except for winter, there are obvious spatial changes in Chl-a in spring, summer, and autumn. In spring, the northwestern part is the area with the highest Chl-a, followed by the southern coastal and some central areas of the lake. In summer, the Chl-a in most areas of the northwest, west, and south of the lake is obviously higher, and only a few areas along the east coast have relatively low Chl-a. Compared with summer, the area with high Chl-a substantially reduces in autumn and mainly locates in the western coastal and northern of the lake. In winter, the Chl-a in the entire lake is markedly reduced, and the spatial distribution is more uniform. This result is consistent with the actual situation. Notably, the areas of high Chl-a in the three seasons other than winter are located mainly in the northwestern part of the lake, which is attributable mainly to the large number of rivers entering the lake and high density of urban sewage outfalls in these areas. The high nutrient concentration leads to serious eutrophication of the water body, which is beneficial to the growth of algal organisms (Dai et al., 2016; Zhu et al., 2018). Additionally, the spatial distribution pattern of Chl-a is also affected by meteorological conditions. The low and uniform distributions of Chl-a in winter are mainly due to the low temperature that causes algae in the water to almost stop growing (Xiong et al., 2012). The prevailing southeasterly wind in summer and autumn causes algal organisms to become concentrated in the northwest region of the lake (Qin et al., 2004; Li et al., 2016). This confirmed that the Chl-a spatial distribution patterns derived from the RF models appear reasonable.

      Figure 12.  Spatial distributions of Chl-a concentration in Lake Taihu, as estimated by the RF models: (a) spring, (b) summer, (c) autumn, and (d) winter.

    5.   Discussion
    • Table 8 summarizes the model performances of different algorithms reported in previous studies on estimation of Chl-a in Lake Taihu using satellite. The spatial resolutions of the different models range from 16 m to 1 km, with most of the studies having coarse spatial resolutions of greater than 16 m. The CV R2 of the different models varies from 0.47 to 0.94, with most R2 values less than 0.88. The seasonal RF models captures about 86% average of the variability in Chl-a concentration in the sample-based CV, which is larger than the sample-based CV R2 of almost other models. The RMSE values of the seasonal models are generally lower than that of the other models. Our RF derived models also shows that the performance is comparable or superior to some previous models based on different ML algorithms (Zhang et al., 2009; Xu et al., 2019). Overall, the seasonal models have a robust and superior performance in estimating Chl-a with an extremely high spatial resolution of 16 m.

      ReferenceModelR2RMSE (mg m−3)MAPE (%)Spatial resolutionSatellite sensor
      Zhang et al. (2009)SVM0.944.7515.911 kmTerra/Aqua MODIS
      LRM0.809.2826.711 kmTerra/Aqua MODIS
      PCA0.797.8822.621 kmTerra/Aqua MODIS
      Qi et al. (2014)EOF0.471.811 kmTerra/Aqua MODIS
      Song et al. (2017)BRM0.72100 mHJ-1A HSI
      NDCI0.71100 mHJ-1A HSI
      WCI0.73100 mHJ-1A HSI
      Xu et al. (2019)RF0.804.8930 mHJ-1B CCD
      SVM0.775.2130 mHJ-1B CCD
      BPANN0.756.1930 mHJ-1B CCD
      DL0.725.5930 mHJ-1B CCD
      This studySeasonal0.88/0.882.94/2.5924.17/25.5916 mGF-1 WFV
      RF0.94/0.741.66/1.4019.73/18.47
      SVM: support vector machine model, LRM: linear regression model, PCA: principal component analysis model, EOF: empirical orthogonal function model, BRM: band ratio model, NDCI: normalized difference chlorophyll index model, WCI: water Chl-a index model, BPANN: back propagation artificial neural network, DL: deep learning, HIS: hyper spectral imagery, CCD: charge coupled device.

      Table 8.  Model performances reported in some previous studies on estimation of Chl-a concentration in Lake Taihu using satellite

    • The greatest advantage of the proposed models developed in this study is its high spatial resolution of 16 m. The GF-1 satellites can monitor the details of small patches of inland lakes better than the 250-m spatial resolution of MODIS. The retrieval results show that the GF-1 WFV images describe more details of the spatial distribution pattern of Chl-a in Lake Taihu than the MODIS images (Qi et al., 2014; Lary et al., 2016). Several previous studies have also shown the good performance of GF-1 WFV in retrieving Chl-a over inland lakes (Zhu et al., 2017; Xu et al., 2020). Another difference from previous studies is that this study directly captures the Chl-a information from GF-1 WFV TOA reflectance without an AC algorithm. This may avoid the need to retrieve the water-leaving radiance through AC, which is typically prone to large errors in turbid and high-biomass waters. The seasonal and spatial distribution of Chl-a in Lake Taihu estimated by our models are consistent with the previous studies (Qi et al., 2014; Lary et al., 2016). In addition, the in-situ Chl-a used in this study is measured by YSI6600 instrument and fluorescence analysis method (Liu et al., 2010), which has the advantages of automatic acquisition and the results can be determined within 30 minutes. In contrast, the commonly used spectrophotometric analysis method is much less efficient, the method cannot automatically obtain results, and it may take nearly a day. Overall, the seasonal models, with its spatial resolution of 16 m, can retrieve the Chl-a with higher accuracy and reflect its distribution pattern and trend in the entire Lake Taihu.

    • Here, we develop seasonal models to estimate Chl-a in inland lakes with a high spatial resolution of 16 m by using the direct measurements of GF-1 WFV TOA reflectance. Compared with the MODIS based models, the proposed models have a much higher spatial resolution and can provide more details about variations in Chl-a in Lake Taihu. However, there are some limitations and potential room for model improvements. Similar to other ML based methods, the RF derived algorithm is also a black box model, which is not easy to give an easy interpretation of decision procedure. Therefore, it is necessary to select more variables with physical significance for model training, like the indices from red color and near infrared channels. Furthermore, the slight difference in the center wavelength of the four GF-1 WFV cameras may affect the NDVI value. This difference may have been taken into account when training the models with artificial intelligence method, and the output results of the models seem to have a high accuracy. Nevertheless, we will further study the physical mechanisms of these subtle effects by using radiation transfer modes to obtain more accurate models in the future. The in-situ Chl-a measured by the fluorescence analysis method is used in the model development, having relatively smaller value than that of the spectrophotometric method (Liu et al., 2010), this may cause the Chl-a retrieved by our models are slightly lower. It is clearly that future efforts should be dedicated to compare with widely adopted methods [such as high performance liquid chromatography (HPLC) and spectrophotometric method] and evaluate its potential impact on the algorithm development to further improve the accuracy of the model. In addition, the limited number of samples and model run results also suggest the need for further improvement of our models. We anticipate collecting more in-situ measured Chl-a and GF-1 WFV images and carrying out experiments for other lakes in the near future. On a broader scale, the approaches and findings of this study may be extended to other inland eutrophication lakes such as Lake Chaohu, Lake Dianchi, and Lake Erie. Our results will help managers and decision-makers account for and modify their strategies for controlling water eutrophication and cyanobacterial blooms in response to future climate change and human impacts.

    6.   Conclusions
    • Four seasonal Chl-a estimation models for Lake Taihu with a high spatial resolution of 16 m are developed based on the RF algorithm by directly using GF-1 WFV measurements of TOA reflectance and relevant in-situ measured water quality data. A novel variable importance evaluation comprehensive index (RIEI) is used instead of a single index (IMSE or INIP) to determine the optimal GF-1 bands and band combinations for improving retrieval accuracy. Compared with the MODIS images with coarse spatial resolution of 250 m, the proposed models estimate Chl-a on fine spatial resolution (16 m) very well, especially in the seasons with high Chl-a. The four seasonal models have high performance during the sample-based cross-validation. MODAut has the highest performance with R2 (0.94) and RMSE (1.66 mg m−3), followed by MODSpr and MODSum (R2 values of 0.88 and 0.88 and RMSE values of 2.94 and 2.59 mg m−3, respectively). The results of independent sample tests show that these models accurately capture the fine features of the Chl-a distribution pattern and amplitude over Lake Taihu. The proposed model also reveals the temporal and spatial variation characteristics of Chl-a in Lake Taihu, which are related mainly to the distribution of water pollution and annual climatic alternation.

      The results of this study also illustrate the unique advantages of GF-1 WFV data in relation to the observation of Chl-a, and GF-1 WFV data are becoming increasingly important for monitoring China’s inland water bodies with high spatial resolution. Our study provides a new perspective regarding the acquisition of high-quality, high-resolution Chl-a using remote sensing technology. The application of the high-resolution Chl-a concentration provided in this study could further improve the prediction accuracy of cyanobacterial blooms. Furthermore, this study provides the government with information on the spatial distribution and temporal variations of Chl-a in Lake Taihu, which is of great importance regarding the ecological management of the water and ecosystems of Lake Taihu.

      Acknowledgments. The authors would like to thank the Jiangsu Environmental Monitoring Center for providing the in-situ measured data of Chl-a concentration in Lake Taihu. The authors thank Mrs. Rongrong Hang for providing IT support and James Buxton MSc for editing the English text of this manuscript.

Reference (53)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return