A successful paper reproduction project

A standard part of our natural language courses, and I believe in other ML-based science too, is reproducing papers. From several papers offered, the students are invited to pick a paper and reproduce the work in the paper identically (as much as possible). This is a great practice because of several reasons:

  • the students get a blueprint for building complex systems and learn the details of their inner workings

  • in the case of a successful reproduction it can serve as infrastructure to be extended with students’ own experiment ideas

  • the science and paper itself are being challenged - are their results reproducible?

I’ve been lucky to have this task more than once, and each time it has been very insightful. I really enjoyed my reproduction project this semester, and it turned out to be a very successful one in general. I am sure that a great deal of that is because I had a wonderful and competent team to work with, and also because we are getting more experienced with the matter.

Taking in my years of software engineering experience with me into these (late) studies, I get to benefit from a lot of the modern software development practices I got to learn over the years too.

The project

I took only one course this semester because I was working for the whole duration of it at Explosion(the makers of spaCy), and I feel very lucky for picking such a great one! I took a course called Language, Vision and Interaction. We focused on models that work on the intersection of natural language processing and computer vision, while also learning how to model dialogue. We used the GuessWhat?! dataset, and explored some more recent papers featuring it.

The project was structured to give us a big overview of the field. We started with reading a large survey paper, and many more recent papers right after. To facilitate the retaining of knowledge and to not rely solely on our motivation to go throughout that giant reading list, we were asked to pick a subset and present that subset to each other.

This is the final outcome.

Picking the right paper

Even before taking on the task of presenting the subset of papers, my group kept a little document with reproduction ideas which got updated with each paper read. By the time the presentations were over and we needed to make our pick, we already had a solid list to look at.

Besides interest in the idea behind the papers and the specific implementation, we ranked their reproducibility potential equally high. All papers are definitely not made equal in this regard. Some helpful hints in deciding this potential:

  • Does the paper have a reproducibility section? How informative is it?

  • Is the code in some accessible public repository?

  • Are the authors known to be responsive and approachable?

The process

This might not be the definitive guide, and you might have a bit of a different style when it comes to delivery, but here are a few points that might find a place in your process.

Project roadmap

Even though I had to go through quite the mental shift to adjust to working in research constellations and conditions, I still always think about delivering the minimum stable and most useful product first, calculating the time to thoroughly test each of the changes and including the documentation penalty.

Below is our very simple roadmap for this project. You can read it as defined phases with priority, since this type of work is never linear and certain things always need more attention. The several points in the “Prep” phase could be tackled individually, which we did, and then each point would move through the process until its end:

  • we would make it “run”

  • conduct all our experiments with the model that is running and document them in W&B

  • immediately write the documentation in the README for this part

  • take notes for the report in a separate document.

Again, the team was absolutely brilliant when it comes to taking responsibility and doing their little part at each moment, which was very lucky. This shows the roadmap mustn’t mean fancy tools or extreme detail in the planning, just a direction and accountability.

Making it just “run”

If you are doing a reproduction project, you could be reproducing code that was written the same year the reproduction is taking place, two-three years before, or sometimes, if you are particularly unlucky, more. In machine learning years, that is ancient.

You would open this old code with its relics that make no sense from a 2022 informed perspective, and it is so easy to fall into the trap of jumping to improvements immediately. Making it just “run” sounds unglamorous, but that is exactly what a reproduction project is asking for.

Sometimes a lot of code changes cannot be avoided because there is no proper backward compatibility. Maybe it cannot run on your GPU machine because the software version has a clash with the installed CUDA from being so old. Maybe you will need to work on the efficiency to make it work with your limited computational resources. In that case, good luck! Remember to adjust your roadmap too once you find all of this out.

Whether you will have lot of roadblocks or just a few, they can prove to be major ones. Outdated documentation that will make you read a lot of code, missing data files that can only be obtained by contacting the authors, there might be some very persistent errors from very old code you can barely figure out by googling. Plan the most time for just making it “run”, no matter how many models are in question.

One small practical tip: when there are 2-3 people working on a project, pushing directly to the main branch might seem like an acceptable thing to do. Although, when you are constantly introducing crucial changes and working with such a short timeframe, you will benefit more from working with branches and only merging meaningful tested PRs. Any change can be easily reverted, and the project can move forward in a more controlled manner.

Documenting early

Good documentation is crucial to the project. Documentation can also quickly pile up and become overwhelming, or get out of date and become confusing. It might not seem like a priority in the frantic exploration phase, but it is important to keep up to date with it from the very beginning.

Because we initialised and structured the documentation even before writing any code or having anything running, it was easy to start filling it in as we went on with the project. I worked on four different machines, and the easy environment replication files we set up, the links to all data needed, the detailed commands to just copy paste and the troubleshooting sections enabled each setup to be without any issues or waste of time.

One easy tip to keep up to date with the documentation is to require it with every new PR. When the PR changes something documented, it cannot be merged without the documentation. Here is an example of such a PR in this project.

Contacting the authors

I will preface this by saying that we were incredibly lucky with the authors. We had both data and parts of the code that weren’t online sent to us, and we were supported in our theoretical musings too.

If you ask meaningful questions and don’t expect anyone to do the work for you, I am sure that most researchers would be happy to chat about their work. It might potentially save you a lot of time.

Prioritisation, flexibility and time for reflection

Prioritising getting the exact results featured in the paper means that there would be sufficient time for reflection and interpretation. Changing as little as possible in the code before getting those initial results would also make that interpretation simpler. The machine learning models are sensitive to so many different changes: being in the data, the configurations, or the code. Making a lot of changes before attempting to reproduce would bring on a lot of speculative notions why the results might be different.

Prioritisation and re-prioritisation along the way as you learn might push for flexibility in the initial plans. We managed to fulfil all of our plans, except running one of the five models we were attempting. The problem was with our computational resources, which we had considerably less than the researchers that wrote the code, and it was taking quite a bit longer to find a workaround. As we prioritised the results and the interpretation, it was clear to us at some point that they would suffer if we keep putting our time in making that model run. So, we decided to let it go, and shift it to the phase 2 of the project, where we extended the paper explorations with our own experiments and ideas. Our reproduction paper as still great and meaningful.

Using a platform like W&B

Your experiments will be logged with all their particularities and you will have pretty meaningful graphs to include in the report or just learn from. Once you create a project all the team can access, these results will be immediately shared with everyone. There is so many other benefits, but these are enough to consider it.

Contributing back to the original research repositories

Oh why not! And may I say, please do! Research projects run on very rushed deadlines, in a small, hasty, timeframe before the researchers run off to the next thing. You would be helping the students after you to not suffer the same fate and have the code just “run” in the year 2022, and you will also be doing science a favour by giving a bit back the community working on these problems.

If this is not enticing enough for you, maybe the networking aspect of submitting code to people relevant to your research interests might be it, or just having your name as a contributor on a frequented repository that might look nice for future employers.

We gave back to this repository, the contributions to the other repository we used are pending.

Conclusion

Thanks for reading and good luck with your paper reproduction! In these kinds of project science meets project management meets software engineering practices. It does help to equally take note of all of them. I hope some of these experiences were useful to you.

This is the first of my series inspired by this particular project. Here is an article about phase two, or extending the reproduction project with own experiments and ideas and a quick technical guide on how to modernise, improve and optimise an older research code base.

Previous
Previous

Beyond paper reproduction: adding own experiments and reflections

Next
Next

Appearance