By Muhammad Mohsin Raza
Blog Editor: Lisa Rothmann
We are happy to kick start a new blog section called OPP Research Compendia. We aim to encourage plant pathologists to engage in reproducible research practices by providing examples developed by the plant pathology community. Briefly, the importance of the research and the main steps of the workflow is summarized and links are provided to data, scripts and other research outcomes that help to understand, reproduce and communicate the research findings.
Meet the researcher
In this inaugural post, we showcase a reproducible example of research conducted by Dr. Muhammad Mohsin Raza, a recent graduate from Iowa State who worked under the supervision of Dr. Leonor Leandro, describes his research, the tools he used and how to reproduce his work. His project is a wonderful narrative, and one of many ways to build a research compendium to encourage reproducibility of the analysis and communication of the findings. These efforts help us Plant Pathologists (interdisciplinary researchers at our core) shape the impact we have on society by the data we convert to information which can be implemented in reality.
The plant disease on target
Sudden death syndrome (SDS) is a disease of significant economic importance to soybeans, especially in the United States where it is distributed across 23 of the 28 U.S. states producing soybean, and substantially affecting soybean yield. A recent study on the economic impacts of soybean diseases in the U.S. estimated that from 1996 to 2016, the country had suffered 6.75 billion dollars loss due to SDC. A key component for an effective management of SDS includes early and accurate detection in soybean fields. Traditional scouting methods, based on ground based visual assessments, are time-consuming, labor-intensive, and often destructive, hence, a more time and cost-effective alternative method for monitoring and quantifying the distribution of SDS in soybean fields is needed. With this in mind, we hypothesized that SDS can be detected before symptoms are observed visually in large soybean plots through high-resolution (3 m) satellite imagery. After 3 years of research, a paper titled “Exploring the Potential of High-Resolution Satellite Imagery for the Detection of Soybean Sudden Death Syndrome” was recently published as open access in the Remote Sensing Journal. The general research workflow, tools and outcomes are shown in the figure below:
Download the data: Figshare
We collected both ground truth and remote sensing data from a soybean field experiment conducted at Iowa State University’s Marsden Farm located in Boone County, Iowa. We chose this site because of a wide range of SDS foliar incidence since 2010. This site is thus representative of a field-level soybean production system which has the potential to develop SDS symptoms significant enough to ensure a successful monitoring campaign.
The ground truth data were obtained as foliar disease incidence, visually rated in-field. SDS is rated on the whole plant, to ensure accuracy and precision we conducted inter and intra-rater agreement tests, as a training exercise with experienced raters. PlanetScope (PS) satellite images were acquired from Planet Labs Inc, a private imaging company based in San Francisco, CA, USA. Planet Labs provides a free subscription to students and researchers working in academia. PS satellite provides 4-band images with red, green, blue (RGB) and near-infrared (NIR) bands at 3-m spatial resolution daily. Besides that, PS provided images are geo-rectified (i.e., processed to remove distortions caused by tilt and terrain), radiometrically and atmospherically corrected, and projected.
The analytical approach
Download the scripts: Figshare
For classification, we used a Random Forest method, i.e., a powerful and robust machine learning algorithm. Random Forest works by generating dozens of decision trees and aggregating their results to make final predictions. We chose this method because it can handle massive, multidimensional datasets, and it is robust to multicollinearity and over-fitting. Additionally, Random Forest evaluates the predictive importance of input features, such as reflectance wavebands in our study, hence supporting feature selection for subsequent analysis.
We performed Random Forest classification using open source libraries in R randomForest and randomForestSRC and Python Scikit-learn. For mapping, we used ArcGIS Pro, the professional desktop GIS application from Esri (Environmental Systems Research Institute based in Redland, CA, USA).
Community impacts and contribution
We obtained promising results indicating that high-resolution satellite imagery and Random Forest algorithm have the potential to detect SDS in soybean fields even before visible foliar symptom onset, i.e. detection via manual scouting. This approach may facilitate large-scale monitoring of SDS (and possibly other economically important soybean diseases). This information is useful for guiding recommendations for site-specific management in current and future seasons.
To better communicate the overall approach and the findings, I used ArcGIS story map to build an interactive tour of the problem statement and the workflow of data collection, automated data processing and analysis for SDS detection. Story maps can be developed for free.
Motivation for reproducible research
I started my PhD with a passion for Epidemiology, Remote Sensing and Geographic Information System (GIS). I was thrilled to learn Big Data analytics tools and explore their application in Plant Pathology. In this journey, my adviser and other lab members fully supported me, although it was a new research venture for our lab. Learning new skills are essential for personal development and for creating and achieving new goals. Plus, sharing these skills with others not only fosters vision in others but also deepens our knowledge and gives rise to new opportunities and collaborations.
While reviewing the literature, I was always fascinated by the techniques and tools researchers used to collect and analyze data. However, their methods and analysis were irreproducible due to the lack of data and code sharing. I realized the need for reproducibility early during my PhD. I pledged to publicize my data and code so that others can also run and learn the tools I used for data analysis. Thus, along with writing my dissertation, I deposited all of my data and code in Iowa State University’s data repository on figshare. Thus, making all the components of data extraction from satellite images, training, tuning and testing Random Forest models and evaluating the quality of SDS predictions in soybean plots available for our (future) selves, research communities and the public.
Communication is a crucial component when describing scientific findings to the general public. I am enthusiastic about presenting my research in an interactive and visually enriched way so that the audience can understand it fully. Therefore, I presented my research as two extension videos and a story map so that the layperson can also know what I am doing and the importance thereof for society. I urge other students and researchers to also believe in reproducibility and make their data and code public so that we can see and learn the tools researchers are using.
Summary/links of the outcomes
- PhD Dissertation
- Original Article
- Research compendium (data + scripts)
- ArcGIS story map
- Outreach video 1 and video 2
This blog section is in the hope of providing encouragement and support to those who are apprehensive, starting to or already fostering (graduate) research to be conducted in reproducible research workflows.