How Much Do We (Plant Pathologists) Value Openness and Transparency?

R4PlantPath Reproducible Research

Our most recent paper examines code and data sharing practices in plant pathology and shares some ideas for what we can do to improve.

Adam Sparks https://adamhsparks.netlify.app , Emerson Del Ponte https://emersondelponte.netlify.app/
2023-03-27

We (Emerson Del Ponte and Adam Sparks) started this initiative (Open Plant Pathology) in early January 2018 with the idea that we would create a community in which plant pathologists could come together and share resources and ideas and encourage a freer exchange of information, code and data. One of the reasons for this was that a few years before that, we’d started working on analysis of randomly selected plant pathology papers, initially we looked at 300 published from 2012 until 2018, but it later grew to encompass 450 papers published from 2012 until 2021, with Kaique Alves, Zachary Foster and Nik Grünwald, which was published in Phytopathology® in January (Sparks et al. 2023b). What we were finding as we looked at papers across 21 journals that were dedicated to plant pathology research or published specialised sections or articles in the field of plant pathology was not surprising, but still disappointing. As a discipline, we simply do not make much, if any, effort to help ensure that others can easily reproduce our work after it is published (Sparks et al. 2023b).

We found that most articles were not reproducible according to our scoring system and failed to take advantage of open science and reproduciblity methods that would benefit both the authors and the readers. To wit, the vast majority of articles we looked at made no attempt to share code or data, scoring “0” in our system (Figure 1).

library(Reproducibility.in.Plant.Pathology)
library(ggplot2)
library(patchwork)

rrpp <- import_notes()

a <- ggplot(rrpp, aes(x = as.factor(comp_mthds_avail))) +
  geom_bar(fill = "black") +
  ylab("Count") +
  xlab("Article Score") +
  ggtitle("Code")

b <- ggplot(rrpp, aes(x = as.factor(data_avail))) +
  geom_bar(fill = "black") +
  ylab("Count") +
  xlab("Article Score") +
  ggtitle("Data")

p <- a + b
p <- p +
  plot_annotation(tag_levels = "A") &
  theme_light()

p
Aggregated article scores for each of the two categories evaluated, (A) displays 'Code availability', where '0' was 'Not available or not mentioned in the publication'; '1' was 'Available upon request to the author; '2' was 'Online, but inconvenient or non-permanent (e.g., login needed, paywall, FTP server, personal lab website that may disappear, or may have already disappeared)'; and '3' was 'Freely available online to anonymous users for foreseeable future (e.g., archived using Zenodo, dataverse or university library or some other proper archiving system)'; 'NA' indicates that no code was created to conduct the work that was detectable. (B) shows 'Data availability', where '0' was 'Not available or not mentioned in the publication'; '1' was 'Available upon request to the author; '2' was 'Online, but inconvenient or non-permanent (e.g., login needed, paywall, FTP server, personal lab website that may disappear, or may have already disappeared)'; and '3' was 'Freely available online to anonymous users for foreseeable future (e.g., archived using Zenodo, dataverse or university library or some other proper archiving system)'; 'NA' indicates that no data were generated, e.g., a methods paper. Figure reproduced from [@Sparks2023] under a Creative Commons Licence using code found in [@Sparks2023a].

Figure 1: Aggregated article scores for each of the two categories evaluated, (A) displays ‘Code availability’, where ‘0’ was ‘Not available or not mentioned in the publication’; ‘1’ was ‘Available upon request to the author; ’2’ was ‘Online, but inconvenient or non-permanent (e.g., login needed, paywall, FTP server, personal lab website that may disappear, or may have already disappeared)’; and ‘3’ was ‘Freely available online to anonymous users for foreseeable future (e.g., archived using Zenodo, dataverse or university library or some other proper archiving system)’; ‘NA’ indicates that no code was created to conduct the work that was detectable. (B) shows ‘Data availability’, where ‘0’ was ‘Not available or not mentioned in the publication’; ‘1’ was ‘Available upon request to the author; ’2’ was ‘Online, but inconvenient or non-permanent (e.g., login needed, paywall, FTP server, personal lab website that may disappear, or may have already disappeared)’; and ‘3’ was ‘Freely available online to anonymous users for foreseeable future (e.g., archived using Zenodo, dataverse or university library or some other proper archiving system)’; ‘NA’ indicates that no data were generated, e.g., a methods paper. Figure reproduced from (Sparks et al. 2023b) under a Creative Commons Licence using code found in (Sparks et al. 2023a).

That’s a pretty shocking, but unsurprising figure.

We get it. It’s just one (or more) things that you have to do when you’re prepping that paper for submission. We mean, why bother ensuring that your code and data are available. The paper describes everything and if anyone has any questions they can just contact you, right?

Except it isn’t that easy. At least not for the readers. A 2018 study by -Stodden et al. (2018) found that from

“a random sample of 204 scientific papers published in the journal Science after the implementation of their policy in February 2011. We found that we were able to obtain artifacts from 44% of our sample and were able to reproduce the findings for 26%. We find this policy—author remission of data and code postpublication upon request—an improvement over no policy, but currently insufficient for reproducibility.”

The whole article is available from PNAS and it’s well worth a read if you’re at all interested, which we assume that you are if you’re reading this blog post. But going farther, Tedersoo et al. (2021) published “Data sharing practices and data availability upon request differ across scientific disciplines” saying, “We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals.” We, personally, know this too well having tried and failed to get additional data missing from papers to reproduce work or models that we were interested in and authors didn’t respond or if they did, were dismissive and didn’t or weren’t able to provide what we was looking for.

In fact, it seems that it’s not that we don’t want to or intend to or at least “we say that we want to but then we don’t(Watson 2022).

But looking at Figure 1, it looks like we don’t even mention the data or code being available (a score of “1”) for the most part either in plant pathology.

So what is it then? Just no time? Lack of know-how? We just don’t care? Or maybe, we haven’t been provided with enough training to use the tools and enough information to realise how beneficial it really is. There are some examples in the community though, Open Wheat Blast is a great example of what can be achieved when scientists collaborate openly.

Now that we’ve quantified the problem, we would like to see more of these and we’re here to help. Feel free to contact any of us directly or through Mastodon, Twitter, Slack (an open invite) or GitHub, we’re here to help, after all, it’s our mission.

Colophon

This post was constructed using R Version 4.4.0 (R Core Team 2022).

─ Session info ─────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       macOS Sonoma 14.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Australia/Perth
 date     2024-04-29
 pandoc   3.1.13 @ /opt/homebrew/bin/ (via rmarkdown)

─ Packages ─────────────────────────────────────────────────────────
 ! package                            * version date (UTC) lib source
 P bslib                                0.7.0   2024-03-29 [?] CRAN (R 4.4.0)
 P cachem                               1.0.8   2023-05-01 [?] CRAN (R 4.4.0)
 P cellranger                           1.1.0   2016-07-27 [?] CRAN (R 4.4.0)
 P cli                                  3.6.2   2023-12-11 [?] CRAN (R 4.4.0)
 P colorspace                           2.1-0   2023-01-23 [?] CRAN (R 4.4.0)
 P crayon                               1.5.2   2022-09-29 [?] CRAN (R 4.4.0)
 P digest                               0.6.35  2024-03-11 [?] CRAN (R 4.4.0)
 P distill                              1.6     2023-10-06 [?] CRAN (R 4.4.0)
 P downlit                              0.4.3   2023-06-29 [?] CRAN (R 4.4.0)
 P dplyr                                1.1.4   2023-11-17 [?] CRAN (R 4.4.0)
 P evaluate                             0.23    2023-11-01 [?] CRAN (R 4.4.0)
 P fansi                                1.0.6   2023-12-08 [?] CRAN (R 4.4.0)
 P farver                               2.1.1   2022-07-06 [?] CRAN (R 4.4.0)
 P fastmap                              1.1.1   2023-02-24 [?] CRAN (R 4.4.0)
 P generics                             0.1.3   2022-07-05 [?] CRAN (R 4.4.0)
 P ggplot2                            * 3.5.1   2024-04-23 [?] CRAN (R 4.4.0)
 P glue                                 1.7.0   2024-01-09 [?] CRAN (R 4.4.0)
 P gtable                               0.3.5   2024-04-22 [?] CRAN (R 4.4.0)
 P highr                                0.10    2022-12-22 [?] CRAN (R 4.4.0)
 P hms                                  1.1.3   2023-03-21 [?] CRAN (R 4.4.0)
 P htmltools                            0.5.8.1 2024-04-04 [?] CRAN (R 4.4.0)
 P jquerylib                            0.1.4   2021-04-26 [?] CRAN (R 4.4.0)
 P jsonlite                             1.8.8   2023-12-04 [?] CRAN (R 4.4.0)
 P knitr                                1.46    2024-04-06 [?] CRAN (R 4.4.0)
 P labeling                             0.4.3   2023-08-29 [?] CRAN (R 4.4.0)
 P lifecycle                            1.0.4   2023-11-07 [?] CRAN (R 4.4.0)
 P magrittr                             2.0.3   2022-03-30 [?] CRAN (R 4.4.0)
 P memoise                              2.0.1   2021-11-26 [?] CRAN (R 4.4.0)
 P munsell                              0.5.1   2024-04-01 [?] CRAN (R 4.4.0)
 P patchwork                          * 1.2.0   2024-01-08 [?] CRAN (R 4.4.0)
 P pillar                               1.9.0   2023-03-22 [?] CRAN (R 4.4.0)
 P pkgconfig                            2.0.3   2019-09-22 [?] CRAN (R 4.4.0)
 P R6                                   2.5.1   2021-08-19 [?] CRAN (R 4.4.0)
 P readODS                              2.2.0   2024-02-01 [?] CRAN (R 4.4.0)
 P readr                                2.1.5   2024-01-10 [?] CRAN (R 4.4.0)
 P Reproducibility.in.Plant.Pathology * 1.0.1   2024-04-26 [?] Github (openplantpathology/Reproducibility_in_Plant_Pathology@13ce7f6)
 P rlang                                1.1.3   2024-01-10 [?] CRAN (R 4.4.0)
 P rmarkdown                            2.26    2024-03-05 [?] CRAN (R 4.4.0)
 P rstudioapi                           0.16.0  2024-03-24 [?] CRAN (R 4.4.0)
 P sass                                 0.4.9   2024-03-15 [?] CRAN (R 4.4.0)
 P scales                               1.3.0   2023-11-28 [?] CRAN (R 4.4.0)
 P sessioninfo                          1.2.2   2021-12-06 [?] CRAN (R 4.4.0)
 P stringi                              1.8.3   2023-12-11 [?] CRAN (R 4.4.0)
 P tibble                               3.2.1   2023-03-20 [?] CRAN (R 4.4.0)
 P tidyselect                           1.2.1   2024-03-11 [?] CRAN (R 4.4.0)
 P tzdb                                 0.4.0   2023-05-12 [?] CRAN (R 4.4.0)
 P utf8                                 1.2.4   2023-10-22 [?] CRAN (R 4.4.0)
 P vctrs                                0.6.5   2023-12-01 [?] CRAN (R 4.4.0)
 P withr                                3.0.0   2024-01-16 [?] CRAN (R 4.4.0)
 P xfun                                 0.43    2024-03-25 [?] CRAN (R 4.4.0)
 P yaml                                 2.3.8   2023-12-11 [?] CRAN (R 4.4.0)
 P zip                                  2.3.1   2024-01-27 [?] CRAN (R 4.4.0)

 [1] /Users/adamsparks/Developer/GitHub/openplantpathology/OpenPlantPathology/renv/library/macos/R-4.4/aarch64-apple-darwin20
 [2] /Users/adamsparks/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
 [3] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library

 P ── Loaded and on-disk path mismatch.

────────────────────────────────────────────────────────────────────
R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/.
Sparks, A. H., Grunwald, N., Foster, Z., and Ponte, E. M. D. 2023a. Openplantpathology/reproducibility_in_plant_pathology: Phytopathology 1.0.0.
Sparks, A. H., Ponte, E. M. D., Alves, K., Foster, Z. S. L., and Grünwald, N. 2023b. Openness and computational reproducibility in plant pathology: Where we stand and a way forward. Phytopathology.
Stodden, V., Seiler, J., and Ma, Z. 2018. An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences. 115:2584–2589.
Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., et al. 2021. Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data. 8.
Watson, C. 2022. Many researchers say they’ll share data but don’t. Nature. 606:853–853.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/openplantpathology/OpenPlantPathology, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".