Using machine learning for crop yield prediction in the past or the future

Morales, Alejandro; Villalobos, Francisco J.; Ministerio de Ciencia e Innovación (España)

Using machine learning for crop yield prediction in the past or the future

The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.

Saved in:

Bibliographic Details
Main Authors:	Morales, Alejandro, Villalobos, Francisco J.
Other Authors:	Ministerio de Ciencia e Innovación (España)
Format:	artículo biblioteca
Language:	English
Published:	Frontiers Media 2023-03-30
Subjects:	Wheat, DSSAT, Crop simulation model, Machine learning, Neural network, Sunflower,
Online Access:	http://hdl.handle.net/10261/349494 http://dx.doi.org/10.13039/501100004837 http://dx.doi.org/10.13039/501100011033 http://dx.doi.org/10.13039/501100011011 https://api.elsevier.com/content/abstract/scopus_id/85153341851
Tags:	Add Tag No Tags, Be the first to tag this record!

id	dig-ias-es-10261-349494
record_format	koha
institution	IAS ES
collection	DSpace
country	España
countrycode	ES
component	Bibliográfico
access	En linea
databasecode	dig-ias-es
tag	biblioteca
region	Europa del Sur
libraryname	Biblioteca del IAS España
language	English
topic	Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower
spellingShingle	Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower Morales, Alejandro Villalobos, Francisco J. Using machine learning for crop yield prediction in the past or the future
description	The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.
author2	Ministerio de Ciencia e Innovación (España)
author_facet	Ministerio de Ciencia e Innovación (España) Morales, Alejandro Villalobos, Francisco J.
format	artículo
topic_facet	Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower
author	Morales, Alejandro Villalobos, Francisco J.
author_sort	Morales, Alejandro
title	Using machine learning for crop yield prediction in the past or the future
title_short	Using machine learning for crop yield prediction in the past or the future
title_full	Using machine learning for crop yield prediction in the past or the future
title_fullStr	Using machine learning for crop yield prediction in the past or the future
title_full_unstemmed	Using machine learning for crop yield prediction in the past or the future
title_sort	using machine learning for crop yield prediction in the past or the future
publisher	Frontiers Media
publishDate	2023-03-30
url	http://hdl.handle.net/10261/349494 http://dx.doi.org/10.13039/501100004837 http://dx.doi.org/10.13039/501100011033 http://dx.doi.org/10.13039/501100011011 https://api.elsevier.com/content/abstract/scopus_id/85153341851
work_keys_str_mv	AT moralesalejandro usingmachinelearningforcropyieldpredictioninthepastorthefuture AT villalobosfranciscoj usingmachinelearningforcropyieldpredictioninthepastorthefuture
_version_	1802820018941984768
spelling	dig-ias-es-10261-3494942024-05-14T20:48:44Z Using machine learning for crop yield prediction in the past or the future Morales, Alejandro Villalobos, Francisco J. Ministerio de Ciencia e Innovación (España) Junta de Andalucía Agencia Estatal de Investigación (España) Wheat DSSAT Crop simulation model Machine learning Neural network Sunflower The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study. This work was funded by Ministerio de Ciencia e Innovación, Spain, through grant PCI2019–103621, associated to the MAPPY project (JPI-Climate ERA-NET, AXIS call), and the “María de Maeztu” program for centers and units of excellence in research and development [grant number CEX2019–000968-M]. Publication costs were funded by Grupo PAIDI AGR-119 Junta de Andalucia. With funding from the Spanish government through the "Severo Ochoa Centre of Excellence" accreditation (CEX2019–000968-M). Peer reviewed 2024-03-06T19:51:50Z 2024-03-06T19:51:50Z 2023-03-30 artículo http://purl.org/coar/resource_type/c_6501 Frontiers in Plant Science 14: 1128388 (2023) CEX2019–000968-M http://hdl.handle.net/10261/349494 10.3389/fpls.2023.1128388 1664-462X http://dx.doi.org/10.13039/501100004837 http://dx.doi.org/10.13039/501100011033 http://dx.doi.org/10.13039/501100011011 37063228 2-s2.0-85153341851 https://api.elsevier.com/content/abstract/scopus_id/85153341851 en #PLACEHOLDER_PARENT_METADATA_VALUE# #PLACEHOLDER_PARENT_METADATA_VALUE# info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/CEX2019–000968-M info:eu-repo/grantAgreement/AEI//PCI2019–103621 Publisher's version The underlying dataset has been published as supplementary material of the article in the publisher platform at DOI 10.3389/fpls.2023.1128388 https://doi.org/10.3389/fpls.2023.1128388 Sí open application/pdf Frontiers Media

Using machine learning for crop yield prediction in the past or the future

Similar Items

Resource Map