Beyond paper reproduction: adding own experiments and reflections

Recently I had the chance of doing a paper reproduction project, as part of the course I took this semester at the University. The second phase of the project included extending the reproduced work with own experiments, improvements and reflections. Happening over a timeframe of a month for everything from ideating to writing a report, the real challenge was how to make make the best use of the existing infrastructure.

In this short article, I share a bit of our experience, what made it enjoyable instead of stressful and how we managed to get meaningful results in the end despite the computational and time constraints. It is divided into three parts: some general hints about picking experiments, a more detailed report section about what this meant in practice for our project, and a short section at the end on reflections, result interpretations and report writing.

Picking experiments

As I was writing this section, it seemed like it is growing exponentially compared to any other parts of this process. Therefore, I organised it into tiny bullet-pointed paragraphs with the main idea behind them in bold. So, what can help when we need to pick meaningful experiments to conduct on a pretty tight schedule, in a knowledge area we are just beginning to get our footing in?

  • Author’s own hints. It is customary that the authors mention a few of the open research ideas and perspectives they didn’t have time to explore at the end of their paper. Oftentimes, this might be something that you already thought of while looking at the data and the various setups and models they explored, or it might be new to you. The authors know all the nooks and crannies of their subject, so unless they have an immediate paper out that accounts for these suggestions, it makes sense to go to explore them first.

  • “Huh?!” moments. Reproducing the paper gives us the unique perspective of questioning its building parts. Has there been a moment of “huh?!” when exploring any part of the project? If so, those details that elicited that kind of response might be a good foundation for finding what other ideas to try.

  • The wider research field. Inspiration can also come from the larger field. Researching a bit more widely than the immediate narrow field might bring in some fresh outlooks, new methods to try or connections to make.

  • Respecting the time constraints. What can we realistically achieve in the given amount of time? Do we need to manipulate only the data or do we need to create model and infrastructure changes in our reproduced project? Do we have a running version of all the underlying infrastructure and can we start training on the go, or we are still lacking some crucial work there? And lastly, how much do the models need to train given the computational resources we have? Sitting down and calculating the time that we need to get the result we want, plus a buffer for optimisation and fine-tuning if applicable would be really appreciated by that future self that is trying to write a paper to submit on time.

  • Team competences. If we are doing something creative, it might require a certain level of comfort with the matter to be able to expand its boundaries. It is a good idea to go over the team strengths and see where exactly you can bring the biggest contribution. If you take on a great model restructuring, but the team is overall is still struggling with (advanced) machine learning concepts or debugging, it might not be in your best interest given the time.

How these hints translate into practice: the example of our project

We reproduced the paper: Greco, C., Testoni, A., & Bernardi, R. (2020). Which Turn do Neural Models Exploit the Most to Solve GuessWhat? Diving into the Dialogue History Encoding in Transformers and LSTMs. NL4AI@AI*IA (link).

The paper explores how the dialogue history permutations (reversing, removing the last term) influence the task success, which in this case is the ability of the model to guess the target object on the image correctly while playing the GuessWhat?! game. Both blind (language-only) and multimodal (language + vision) LSTM and Transformer based models were used.

Experiments

When looking at the dialogue data we noticed that there is the “raw category” (ex. ball) of the target object included for each game. Later when looking at the code, we also saw that this category is propagated in various places across the models. We found that interesting and wondered about it (our “huh?!” moment). At the end of the paper the authors themselves mention that it will be worthwhile to explore how much the models are dependent on the raw category (author's own hints). This seemed like something worth exploring, so upon looking further we found additional research suggesting that the models are indeed overly reliant on it. So, since we had the infrastructure already fully set up we decided to make this our main experiment. Interestingly the authors are currently working on publishing a paper with very similar experiments so we got to compare notes.

Another experiment we picked was testing how shuffling the dialogue history will impact the task's success. This idea came from two sides: the first hint was that reversing the history didn’t obliterate the performance of RoBERTa, which had just a slight dip in task success. From a linguistic point of view and from what we know about how humans learn and process language, this didn’t make much sense. Since reversing the history still preserves some order, we wanted to shuffle to see what happens. The second hint was reading literature like this paper and similar ones from the wider field.

We decided to name our paper: What do the GuessWhat?! Guesser models learn and what do they learn it from? to be able to unite these two separate ideas into one report. You can read our final report about our experiments here.

Planning and perspective

When we are working with models that need to train for a long time, like LXMERT, the biggest bottlenecks are all those days and hours of training using the low computational resources that we have as students. If we want to obtain results for all the experiments (including tweaking) and have sufficient time to properly reflect and interpret the results we have to start early. If there is only a month ahead, that means (almost) immediately. For us, LXMERT ran almost 2 full days for 30 epochs, so it was crucial that we focus on making it work as early as possible and as the highest priority. This meant also making a decision about an epoch cut-off - both from results, but also from a pragmatic, resource-based perspective (respecting time constraints).

Of course, we wanted to scratch a bunch of code/system itches that we had during reproducing, like modernising, optimising, and improving the usability of the code base. During the reproduction project, we neatly documented all of these thoughts for the final project plan we just prioritised and pruned a bit. You can find our list of small improvements on the 4th slide here as an example.

The image under is the big-picture final project roadmap we prepared in July, right at the end of the reproduction project. More on how to read it under it.

Experiments vs code improvements

As mentioned above, the results and experiments have priority. The code improvements are nice to have, but not necessary for the project.

The “small improvements”, which were code improvements, are pictured with a grey colour above all points in the roadmap, without a specific place. They appear right after the prep phase and span all the way close to the report phase, leaving a little gap until the end. For us, that meant that they are fully optional to begin with. Finishing all the remaining infrastructure issues and running the experiments took precedence. In a way, we treated the experiments and the code improvements as two separate tracks.

For university group projects we are always encouraged to work together on all the points and not silo people into one task. That is all fine and good and everyone should learn the whole picture of the project, but at some point there are multiple focus points appearing which will require people to separate. We found that as long as we keep the communication frequent and high, we document and discuss all changes and progress, no one gets silo-ed although they have their own temporary focus. This is very much just like a team would work in the industry and allows everyone to use their strengths to the benefit of the project (team competencies).

Reflections, interpretations and report writing

There are many articles out there on this topic by people with way more experience than me, so I will mention just three important points.

Communicating with other researchers

For me, the continuous communication with the authors and other researchers was the highlight of the project. I enjoyed learning about their thought processes behind certain experiments, then taking those thoughts back and discussing them over in our little group. The exchange of existing literature that each of us discovered was also a kind of engagement with other ways of thinking. I found everyone very approachable and nice and I am so glad we started the conversations early. I definitely recommend reaching out with any questions and doubts about other people’s work.

Being flexible

There is so much we don’t know about a field when we first approach it. Depending on the subject, there can be an enormous amount of literature out there, sometimes even conflicting one. It is frustrating to have to change the initial assumptions and sometimes even end up with results that don’t say much, but that doesn’t cancel out all the work that has been done.

Writing early

Prioritising getting the results and allowing enough time for interpretations and reflections will also mean the chance to start writing early. Even if details change in the paper, the big insights that need proper discussion and careful justification will be slowly laid out already.

Conclusion

I hope you manage to have at least a bit of fun while making these projects despite the overall stressful timelines and pressure during the submission months 🤞. I’d love to hear from others about what kind of tips you have for this part of the project yourselves. There are two social media buttons below where you could find me.

Previous
Previous

Language profile: Macedonian

Next
Next

A successful paper reproduction project