Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India

Nayak, H.; Silva, J.V.; Parihar, C.M.; Krupnik, T.J.; Sena, D.R.; Kakraliya Suresh Kumar; Jat, H.S.; Sidhu, H.S.; Sharma, P.C.; Jat, M.L.; Sapkota, T.B.

Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India

The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fold: (1) to assess the performance of different machine learning methods to explain on-farm wheat yield variability in the Northwestern Indo-Gangetic Plains of India, and (2) to identify the most important drivers and interactions explaining wheat yield variability. A suite of fine-tuned machine learning models (ridge and lasso regression, classification and regression trees, k-nearest neighbor, support vector machines, gradient boosting, extreme gradient boosting, and random forest) were statistically compared using the R2, root mean square error (RMSE), and mean absolute error (MAE). The best performing model was again fine-tuned using a grid search approach for the bias-variance trade-off. Three post-hoc model agnostic techniques were used to interpret the best performing model: variable importance (a variable was considered “important” if shuffling its values increased or decreased the model error considerably), interaction strength (based on Friedman’s H-statistic), and two-way interaction (i.e., how much of the total variability in wheat yield was explained by a particular two-way interaction). Model outputs were compared against empirical data to contextualize results and provide a blueprint for future analysis in other production systems. Tree-based and decision boundary-based methods outperformed regression-based methods in explaining wheat yield variability. Random forest was the best performing method in terms of goodness-of-fit and model precision and accuracy with RMSE, MAE, and R2 ranging between 367 and 470 kg ha−1, 276–345 kg ha−1, and 0.44–0.63, respectively. Random forest was then used for selection of important variables and interactions. The most important management variables explaining wheat yield variability were nitrogen application rate and crop residue management, whereas the average of monthly cumulative solar radiation during February and March (coinciding with reproductive phase of wheat) was the most important biophysical variable. The effect size of these variables on wheat yield ranged between 227 kg ha−1 for nitrogen application rate to 372 kg ha−1 for cumulative solar radiation during February and March. The effect of important interactions on wheat yield was detected in the data namely the interaction between crop residue management and disease management and, nitrogen application rate and seeding rate. For instance, farmers’ fields with moderate disease incidence yielded 750 kg ha−1 less when crop residues were removed than when crop residues were retained. Similarly, wheat yield response to residue retention was higher under low seed and N application rates. As an inductive research approach, the appropriate application of interpretable machine learning methods can be used to extract agronomically actionable information from large-scale farmer field data.

Saved in:

Bibliographic Details
Main Authors:	Nayak, H., Silva, J.V., Parihar, C.M., Krupnik, T.J., Sena, D.R., Kakraliya Suresh Kumar, Jat, H.S., Sidhu, H.S., Sharma, P.C., Jat, M.L., Sapkota, T.B.
Format:	Article biblioteca
Language:	English
Published:	Elsevier 2022
Subjects:	AGRICULTURAL SCIENCES AND BIOTECHNOLOGY, FORESTS, MACHINE LEARNING, WHEAT, YIELDS, CROP RESIDUES,
Online Access:	https://hdl.handle.net/10883/22163
Tags:	Add Tag No Tags, Be the first to tag this record!

id	dig-cimmyt-10883-22163
record_format	koha
institution	CIMMYT
collection	DSpace
country	México
countrycode	MX
component	Bibliográfico
access	En linea
databasecode	dig-cimmyt
tag	biblioteca
region	America del Norte
libraryname	CIMMYT Library
language	English
topic	AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES
spellingShingle	AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES Nayak, H. Silva, J.V. Parihar, C.M. Krupnik, T.J. Sena, D.R. Kakraliya Suresh Kumar Jat, H.S. Sidhu, H.S. Sharma, P.C. Jat, M.L. Sapkota, T.B. Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
description	The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fold: (1) to assess the performance of different machine learning methods to explain on-farm wheat yield variability in the Northwestern Indo-Gangetic Plains of India, and (2) to identify the most important drivers and interactions explaining wheat yield variability. A suite of fine-tuned machine learning models (ridge and lasso regression, classification and regression trees, k-nearest neighbor, support vector machines, gradient boosting, extreme gradient boosting, and random forest) were statistically compared using the R2, root mean square error (RMSE), and mean absolute error (MAE). The best performing model was again fine-tuned using a grid search approach for the bias-variance trade-off. Three post-hoc model agnostic techniques were used to interpret the best performing model: variable importance (a variable was considered “important” if shuffling its values increased or decreased the model error considerably), interaction strength (based on Friedman’s H-statistic), and two-way interaction (i.e., how much of the total variability in wheat yield was explained by a particular two-way interaction). Model outputs were compared against empirical data to contextualize results and provide a blueprint for future analysis in other production systems. Tree-based and decision boundary-based methods outperformed regression-based methods in explaining wheat yield variability. Random forest was the best performing method in terms of goodness-of-fit and model precision and accuracy with RMSE, MAE, and R2 ranging between 367 and 470 kg ha−1, 276–345 kg ha−1, and 0.44–0.63, respectively. Random forest was then used for selection of important variables and interactions. The most important management variables explaining wheat yield variability were nitrogen application rate and crop residue management, whereas the average of monthly cumulative solar radiation during February and March (coinciding with reproductive phase of wheat) was the most important biophysical variable. The effect size of these variables on wheat yield ranged between 227 kg ha−1 for nitrogen application rate to 372 kg ha−1 for cumulative solar radiation during February and March. The effect of important interactions on wheat yield was detected in the data namely the interaction between crop residue management and disease management and, nitrogen application rate and seeding rate. For instance, farmers’ fields with moderate disease incidence yielded 750 kg ha−1 less when crop residues were removed than when crop residues were retained. Similarly, wheat yield response to residue retention was higher under low seed and N application rates. As an inductive research approach, the appropriate application of interpretable machine learning methods can be used to extract agronomically actionable information from large-scale farmer field data.
format	Article
topic_facet	AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES
author	Nayak, H. Silva, J.V. Parihar, C.M. Krupnik, T.J. Sena, D.R. Kakraliya Suresh Kumar Jat, H.S. Sidhu, H.S. Sharma, P.C. Jat, M.L. Sapkota, T.B.
author_facet	Nayak, H. Silva, J.V. Parihar, C.M. Krupnik, T.J. Sena, D.R. Kakraliya Suresh Kumar Jat, H.S. Sidhu, H.S. Sharma, P.C. Jat, M.L. Sapkota, T.B.
author_sort	Nayak, H.
title	Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_short	Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_full	Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_fullStr	Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_full_unstemmed	Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
title_sort	interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in northwest india
publisher	Elsevier
publishDate	2022
url	https://hdl.handle.net/10883/22163
work_keys_str_mv	AT nayakh interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT silvajv interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT pariharcm interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT krupniktj interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT senadr interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT kakraliyasureshkumar interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT jaths interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT sidhuhs interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT sharmapc interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT jatml interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia AT sapkotatb interpretablemachinelearningmethodstoexplainonfarmyieldvariabilityofhighproductivitywheatinnorthwestindia
_version_	1792501479492288512
spelling	dig-cimmyt-10883-221632024-01-22T15:37:20Z Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India Nayak, H. Silva, J.V. Parihar, C.M. Krupnik, T.J. Sena, D.R. Kakraliya Suresh Kumar Jat, H.S. Sidhu, H.S. Sharma, P.C. Jat, M.L. Sapkota, T.B. AGRICULTURAL SCIENCES AND BIOTECHNOLOGY FORESTS MACHINE LEARNING WHEAT YIELDS CROP RESIDUES The increasing availability of complex, geo-referenced on-farm data demands analytical frameworks that can guide crop management recommendations. Recent developments in interpretable machine learning techniques offer opportunities to use these methods in agronomic studies. Our objectives were two-fold: (1) to assess the performance of different machine learning methods to explain on-farm wheat yield variability in the Northwestern Indo-Gangetic Plains of India, and (2) to identify the most important drivers and interactions explaining wheat yield variability. A suite of fine-tuned machine learning models (ridge and lasso regression, classification and regression trees, k-nearest neighbor, support vector machines, gradient boosting, extreme gradient boosting, and random forest) were statistically compared using the R2, root mean square error (RMSE), and mean absolute error (MAE). The best performing model was again fine-tuned using a grid search approach for the bias-variance trade-off. Three post-hoc model agnostic techniques were used to interpret the best performing model: variable importance (a variable was considered “important” if shuffling its values increased or decreased the model error considerably), interaction strength (based on Friedman’s H-statistic), and two-way interaction (i.e., how much of the total variability in wheat yield was explained by a particular two-way interaction). Model outputs were compared against empirical data to contextualize results and provide a blueprint for future analysis in other production systems. Tree-based and decision boundary-based methods outperformed regression-based methods in explaining wheat yield variability. Random forest was the best performing method in terms of goodness-of-fit and model precision and accuracy with RMSE, MAE, and R2 ranging between 367 and 470 kg ha−1, 276–345 kg ha−1, and 0.44–0.63, respectively. Random forest was then used for selection of important variables and interactions. The most important management variables explaining wheat yield variability were nitrogen application rate and crop residue management, whereas the average of monthly cumulative solar radiation during February and March (coinciding with reproductive phase of wheat) was the most important biophysical variable. The effect size of these variables on wheat yield ranged between 227 kg ha−1 for nitrogen application rate to 372 kg ha−1 for cumulative solar radiation during February and March. The effect of important interactions on wheat yield was detected in the data namely the interaction between crop residue management and disease management and, nitrogen application rate and seeding rate. For instance, farmers’ fields with moderate disease incidence yielded 750 kg ha−1 less when crop residues were removed than when crop residues were retained. Similarly, wheat yield response to residue retention was higher under low seed and N application rates. As an inductive research approach, the appropriate application of interpretable machine learning methods can be used to extract agronomically actionable information from large-scale farmer field data. 2022-09-02T00:25:13Z 2022-09-02T00:25:13Z 2022 Article Published Version https://hdl.handle.net/10883/22163 10.1016/j.fcr.2022.108640 English https://www.sciencedirect.com/science/article/pii/S0378429022002118?via%3Dihub#sec0120 Nutrition, health & food security Transforming Agrifood Systems in South Asia Resilient Agrifood Systems CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS) United States Agency for International Development (USAID) Bill & Melinda Gates Foundation (BMGF) https://hdl.handle.net/10568/127194 CIMMYT manages Intellectual Assets as International Public Goods. The user is free to download, print, store and share this work. In case you want to translate or create any other derivative work and share or distribute such translation/derivative work, please contact CIMMYT-Knowledge-Center@cgiar.org indicating the work you want to use and the kind of use you intend; CIMMYT will contact you with the suitable license for that purpose Open Access India Amsterdam (Netherlands) Elsevier 287 0378-4290 Field Crops Research 108640

Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India

Similar Items

Resource Map