Is ecology ready for big code and the errors they hide?

I recently was informed that a paper in a glossy big name journal that I had paid attention to was retracted due to a coding error. I’m not going to name the paper because that is not the point. Quite the opposite – I commend authors who do the right thing, because mistakes happen to all of us.

Rather I want to return to a point briefly raised in my earlier post and refreshed by the recent retraction. Ecology is increasingly turning into a field in which a single paper often hinges on a big pile of code (which I’ll define as many hundreds to thousands of lines of code). There are still plenty of field-based papers that can be analyzed simply using point and click statistical software. But the big code based papers are certainly a growing fraction. So my question is what are we going to do about the fact that ecology is increasingly advancing through “big code” papers? The solution is NOT individual based – to just close our eyes and then excoriate the unlucky souls who actually get caught in the error. Like all fields that have taken error seriously (especially medical and transportation where error can be fatal), we need to treat this as a community problem that falls on all of us.

Broadly, I think we need to look at solutions in two areas: peer review and training.

I don’t have great solutions for peer review. There has been a big push in the past decade to make code and data available to reviewers (following the push the previous decade to make code and data available to readers). This is great. But to date the high profile errors that have been found that I am aware of all got caught at the reading stage, not the peer review stage (of course the peer review stage is less public, so this may be misleading). Given that we are in a peer review crisis*, I don’t think it is a serious suggestion to tell peer reviewers that it is now their job to read 1000 lines of computer code line by line and play with and debug that code on top of reading the 400 lines of prose in the main paper. And in fairness this level of technical oversight has never been a part of peer review – peer reviewers review high level descriptions of statistical approaches, but have never been expected to verify every step in the statistical process. Nor are they expected to pull voucher specimens and confirm taxonomic accuracy. That’s just not the role of peer review. Add in the brokenness of peer review and I don’t think this is where solutions lie. Although if somebody has a great idea, I would love to hear it.

I think the examples of statistics and taxonomic ID in the last paragraph are telling. We trust the author because we know the author has received extensive training in these areas, and the author has learned what standards of certainty and caution to use in these endeavors.

We DO NOT have this culture around code. Most scientists learn coding in a sink or swim , iterate until it works fashion. And almost no scientists know or have borrowed key techniques in widespread use in the software industry who deal with MUCH more complicated code (when I was in the software industry I was part of a team of 5 people working on a code base with 1,000,000 lines of code). So I think training around big code best practices is the key missing piece. It’s not super obvious how to inject this training because by the very nature of being a new problem, the senior scientists training junior scientists are also weak in this domain.

Here are a few of my quick ideas for best practice for training scientists to be more (never achieving 100%) error free.

Actually teach coding – most departments now teach stats through one or multiple courses, but coding is still supposed to come by osmosis. It often starts with scripting stats, and slowly evolves into using if/then/else conditionals, while loops and etc. Often it doesn’t advance too much further into more advanced principles like factoring and functions, encapsulation (using objects or functions or other ways to avoid global variables).

Use code factoring – following on the last point, most students launch into big code from scripting, where it is common to take a 5-10 lines and paste them over and over again, and then tweak those lines for different species etc. This introduces enormous opportunities for errors. So much better to factor the code and make a function that is called over and over so those 5-10 lines only appear once, and errors in them only need to be fixed in one place. That 1,000,000 lines of code project I mentioned was manageable because it was highly factored to about 5 levels deep (i.e. layers of functions on top of layers of functions).

Use defensive coding – once you have a function, does the code do sanity checks on the parameters being passed into it? Does it use assert statements/tests to make sure things are progressing reasonably.

Use hand-calculated walk throughs – this involves all of the above steps and more. Take a function, set a breakpoint on a simple test case (e.g. few data points), step line by line through the code, verifying it is doing what you intended it to do and paralleling with hand calculations to see if you get the same answer. Or just put a billion print statements scattered through the code so you can follow what is happening every few lines, and then follow along by hand. Now of course you are not going to hand verify every calculation. Otherwise, there is no point in coding it. But hand check at least one species or one site or one something.

Test boundary cases – use parameters to simplify your code to see if matches already known results. Can you dial dispersal down to zero and watch drift take over? Or up to 100% in which case it should act like one giant population. Can you look at the smallest geographic range and the largest geographic range and see if those produce sensible results? Can you do a scenario where there is no variance within groups (or between groups). Etc?

Try alternative methods – use two different R packages that are supposed to do the same thing. Use two different algorithms. Are the results qualitatively similar?

Noodle your results – spend time getting to know what is in your data and outputs. Print out a graph for every single species even if you’re looping and summarizing over species as the main point. Eyeball those graphs. Change parameters. Do they move results in the expected direction? Dive into intermediate results for sanity checks?

Use code walk throughs – do this with your lab (or with an advisor or a peer). We spend a lot of time discussing with colleagues our writing, our presentations, and our statistical methods. We don’t spend a lot of time discussing our code. That’s a problem. While I suggested it is unrealistic to expect peer reviewers to take on the burden of reading code, it is entirely realistic and appropriate to ask our co-authors and our labmates to read our code, just like we ask them to read ugly first drafts of papers.

Use automated testing – write some extra code that runs your core scientific code with input that produces known output. Then automatically run that “test harness” every day to make sure the code hasn’t quietly broken. Add more test scenarios as your code advances.

Use GAI when it is good enough – someday tools like ChatGPT may be able to help here. I am skeptical that it is good enough to find subtle errors in complex code to address today’s big code errors. But I figured it is worth speculating about the future.

As you can see from the hyperlinks, at least four of these ideas are common in the software industry and I think are useful in scientific coding (not all software engineering practices are). A quick google can pull up more information than you ever wanted to know about them.

Now you may be saying to yourself, that sounds like a lot of work. It’s a totally different vision than struggling line by line to reach the top of the mountain where your code runs all the way through once and produces a result and then you celebrate. Yes. Yes it is. That is exactly the point. It is extra work to set up controls and replication in an experiment or to file voucher specimens too, but we train everybody that this is part of the rigor of science. We need to have the same kind of training and expectations around big code. And rigor in big coding is nothing more or less than spending a lot of time with the code AFTER it seems finished to be sure it is correct. Readers of papers should have a right to assume this level of diligence was used on the big code papers they read**.

In an ideal world, we could briefly summarizing the safe big code practices we used in our methods, so that those could be peer reviewed. Methods sections are for establishing that scientifically normed and rigorous approaches were followed. That’s how it works with statistics – we write enough to convince the reader I knew what the pitfalls were and paid attention to them, while not giving every single line of code. While I like this idea, not only have I not done it myself, I don’t think I’ve ever seen it done. And indeed I mostly think if I started talking about boundary cases I tested and intermediate results I looked at, people would read me as an inexperienced or unconfident coder, rather than the opposite. So there probably needs to be culture change here too.

That’s my philosophy and proposed solution. What do you think? What is yours? How could we peer review rigor of software development without actually expecting a peer reviewer to read the code? Or should we expect them to read the code?

*(since COVID finding reviewers for papers is tedious and slow – way more nos than yeses, despite knowing that most of those nos are still submitting lots of papers)

**and to be clear I am not expressing an opinion of whether that time was or wasn’t invested in some of the cases where errors were caught post-publication; errors will still slip through even after extensive caution is exercised.

HT – thanks to Jacquelyn Gill for conversations inspiring this blog post.

Hot Topics

Related Articles