Research Data Management for Trustworthy Science. An interview with Johan Philips

Johan Philips holds MA degrees in Informatics (Computer Science) and Artificial Intelligence from KU Leuven. In 2012 he obtained a PhD in robotics and has since worked as a postdoctoral researcher at the KU Leuven robotics lab. Recently, Johan obtained a permanent position within the Research Management staff as a Research Expert on Reproducible and Trustworthy Science, with a focus on automation and technology to improve trustworthiness. .

What is research data management and why should people care?

To give you my brief definition: research data management is about how to obtain, keep and preserve your data in an orderly fashion. In the case of our lab, this interest in research data management infrastructures and best practices has grown bottom-up. We are for instance involved in my research projects that are being run on a European scale and that deal with applications of robotics in a range of different contexts. This means we have to communicate with many different partners and stakeholders, and that the management of the data that are generated or that need to be analyzed becomes more and more complex.

This trend can of course also be noticed outside of our own lab context. It appears to be the case that research in general takes on increasingly large proportions: research teams are growing, people tend to generate more data, the experiments are being conducted on a much larger scale. So thinking about how to properly manage the data that are at the base of these projects is an important future challenge.

Can you give some examples of projects that you have undertaken to facilitate the management of research data?

Because of my computer science background, my role in many projects has been to design, set up and manage the required software. This for instance included implementing and calibrating tools required to store the data properly and creating databases to store meta data.

Within our department, I have recently rolled out a research data and research software management infrastructure, MechCloud. This platform allows researchers to safely store their research data and professionalize the management of their research software. To achieve this, I looked at the evolution of different techniques that are also used in professional IT companies, such as DevOps – a methodology that promotes rapid prototyping but connecting Development with Operations.

Together with a small team of about 4 people, including the local IT office, I designed and set up MechCloud, and our first release was in September 2017. Since then, interest has grown fast and currently it is being used by another ten to twelve research groups.

When it comes to educating other researchers in the field of RDM, I do not believe that every scholar should have all of the knowledge to set up their own data management infrastructures. Rather, we choose to give courses on how researcher can use the software that we provide. This to me is crucial to the role of so-called data stewards or research software engineers: they should be people that have a background in both the research as well as the technological side of things. This allows them to quickly help researchers. If no such help is available, it is still possible for scholars to build their own infrastructures, but these tend to stop working as soon as they leave the institution or finish the project.

Can you explain some of the ties between research data management and the reproducibility crisis in science? What solutions might research data management bring to this situation?

I would describe this relationship as follows: in order to make research reproducible, you need proper research data management. In the field of engineering for instance, research frequently follows the pattern of formulating a hypothesis, doing experiments that produce data, and eventually analyzing those data.

Our lab for instance does a lot of experiments. If the data that come out of these experimenst are not well managed, you produce results that are not generic. Another problem that might arise in those cases is that research continuity is at stake. If a researcher leaves the lab, he or she might leave behind some results, but more is required to ensure that the research can be continued afterwards.

I also want to point out that reproducibility is about much more than data. Ensuring that your research is reproducible is also intrinsically tied to many other methodological aspects. Yet data are a key part of it, and research methodology and data management mutually influence each other.

Another interesting connection here is the relationship with Open Science. In my opinion, research data management is an important pillar of open science in the sense that there are some data that you might also want to share openly. However, we need to get clear how we structure this relationship. A lot of different topics and themes like ‘openness’ and ‘RDM’ are being thrown around in this space, and while there are interesting interactions between them, they often cannot be equated to each other.

Based on the ongoing rationalization and automation of research that can be associated with research data management, how far are we removed from seeing the first AI lab assistants?

In the research context that I am familiar with, automation is a big part of the process. We see automation as a toolkit that we provide to researchers. To implement this service, we use what is called a DevOps approach. This means that we maintain a close link between development cycles and operations cycles. DevOps as such allows us to do rapid development and prototyping in research.

The same methodology can be used to support research processes if it is matched with the research life cycle. This is where my software background fits in well. I take a look at professional IT solutions and see how these can be integrated into the research process.

Based on these developments, I would say that digital research assistants are a possibility. They could definitely help make the toolkit that we have in mind more advanced.

Apart from data, there is also software. How can we help researchers manage their digital tools?

In terms of management I would put code next to data. What I see is that the sciences are taking a computational turn. In the not-so-distant past a lot of scientific work was conducted on paper, think for instance of ‘manually’ solving equations. As more and more scientists are resorting to computers, a vaster knowledge is required of the software and the IT that is used to do computations. Many analyses are currently done using scripting, but not all researchers have gotten the education to do this. So we are thinking about how we can make researchers’ lives easier. This mostly means ensuring that they can focus on their research questions rather than having to develop new software to conduct their analyses.

In an ideal case, a researcher would thus have a background in his own research domain, as well as some computational or software skills. Another increasingly important factor in this domain is statistics. Some people do statistics based around p values but they do not really know what the actual meaning of those values is in their case. To me, this seems like a growing issue and it could be cited as a negative example of the reliance on technology. Today’s technology makes it so easy to generate p values or ANOVA tests: you put in your data and a script spits out a number. But what do those numbers tell us if we do not fundamentally understand their meaning?

What does the future hold for your research?

One important part of my research will consist of what could be called ‘meta-science’. This means doing research into what the proper methodologies are for this digital age. We are confronted with digitization and emerging technologies across sectors and we need to figure out how methods can be adjusted to match this day and age. This also means that we have to answer many research questions related to the life cycle of data. What information do we need, what are the different steps that researchers take in their processes and how can we document this? Then, once the results have been obtained, how can these be reproduced?

When it comes to research data management in general, I also believe that academia is trailing behind. Think for instance of the fact that European laws concerning data protection have already been in place for a year, while the way some researchers manage their data could still be vastly improved. Many scholars are struggling with this.

Research data management can benefit researchers, but might it also serve the general public?

Research data management can help build transparency in research. The public puts a lot of money into research, but sometimes it can be difficult to see the results of this investment. A clearer view is needed of how the money is spent. This could for instance be achieved by pointing out the ethical code that every researcher has to follow, namely that published results should be reproducible.

More concretely, in my domain, it would be possible to instrument the entire research process. This way, people could get a better idea of how results were obtained from the data. Another point that needs to be addressed, is people’s tendency to keep results hidden ahead of publication in order to beat their academic competition. I do believe that we should keep making a clear distinction here between how the private sector operates and how academia works. Academia should strive to be as open as possible, and as closed as necessary.

If you had to recommend one book to the readers, which book would that be?

I would recommend The inmates are running the asylum by Alan Cooper. This book investigates why so many technological innovations are so user-unfriendly. One of the ideas it explores is that the ways in which engineers develop user interfaces can be completely different from what a user would intuitively expect. The integration of computers into appliances has for instance been a key factor in this process. Take the example of a scale. An analog scale is very easy to use: you simply step on it and you read your weight from a dial. A digital scale is much more difficult to use, as you have to press a series of buttons or adjust a number of settings before you can actually use it. Another example is a digital alarm clock. Pressing a button has a different function than pressing and holding a button, or pressing two buttons simultaneously, … Alan denotes this mode confusion between the user of the appliance and its internal state as “Cognitive friction”. I very much like that term and frequently use it at work whenever some demo or setup goes wrong because of a misinterpretation of the user. I think the illustration of how appliances have become user-unfriendly could also be relevant for the current research context, which is growing increasingly complex. Also some of Alan’s solutions, such as using Personas, could benefit system development in research.

Leave a Reply Cancel reply