Delivering large data science projects remotely: lessons learnt
Doing data science working from home is challenging: here is what we have learnt...
2020 will be remembered by all of us, mostly for the Coronavirus pandemic and all the consequences it brought along; a national lockdown, an economic crisis, social distancing, and challenges to the way we do, well, everything. At Artesia, we have been fortunate enough for our business to remain largely unaffected by the economic crisis, but we have had to move to a complete working-from-home approach during lockdown(s), and the majority of our staff have been working remotely since March and will continue to do so for the foreseeable future. This coincided with some major projects; in particular, Artesia began working on the demand forecasts for the Water Resource Management Plans 2024 (WRMP24) for 11 water companies, for both Household (HH) and Non-Household (NHH) water demand. These would be challenging projects under normal circumstances but coordinating large data science projects while working remotely brought additional challenges for our Data Science Team, which we have dynamically overcome, adopting new technologies and new working paradigms. Here are some of the lessons we have learnt from working on large data science projects remotely:
Version control refers to the practice of keeping a master version of code written for a project on a remote repository, that dynamically synchronises with the local versions that could be on each developer’s computer. This is possible thanks to version control management software. Artesia has worked with GitLab for a very long time, but not all staff had adopted it fully in all its potential, until now. The advantage of version control is two-fold: on the one hand all written versions of a piece of code can be recovered, allowing the team to experiment with different possibilities and go back without losing even one line of code and without the need of saving multiple files for the same piece of code; on the other hand, it allows multiple developers to work on the same project and potentially on the same files simultaneously, integrating the different versions and possibly resolving conflicts in a smart and efficient way. This has always been an important tool for our Data Science Team but resolving coding problems using GitLab is now a fundamental part of our everyday work at Artesia.
Functional programming has become a buzz word when it comes to big data, and it is often explained with obscure technical jargon, but it is actually a relatively simple concept (although the implementation can be far from simple). In traditional programming, code is written to accomplish a specific task for the given data. If the data changes and/or the problem changes, a similar, but separate piece of code is written. When there are many datasets to test and many similar problems to address this can become confusing and redundant. Functional programming aims at writing only one piece of code that takes the data and the instructions as an input and is “smart” enough to understand what needs to be done in each specific case to return the desired output. It is a black box approach, where the black box can be as sophisticated as necessary. To explain it with a culinary metaphor, traditional programming is like baking different cakes separately, one by one; functional programming is like building a cooking robot that can bake the right type of cake if you give it the right ingredients and settings. It takes longer to build the robot, but it is much more efficient and automatised. The reason why this is becoming a key feature of programming in the big data era is that it allows for parallelisation: if you give a “robot” to each of your computing processing cores or to multiple computers, you can “bake multiple cakes” at the same time. This way, even large datasets and large number of problems can be processed and analysed much faster and more efficiently. In a remote-working setting, this enables the “robot” to correspond with many people in different places who can test different data and problems at the same time, without the risk of doing things slightly differently (as all the settings and data can be centralised).
Planning is not a data science specific task, but it is an essential one. The combination of working remotely and the size of the projects we have recently been working on has exasperated the need to meticulously plan every detail. Every single file that is produced needs to have a standard name that has been agreed, every spreadsheet column needs to have the correct header, every file containing code needs to follow the correct structure, so that when files are passed from person to person the process works smoothly. This is a consequence of functional programming as well. In order to achieve the required level of planning, at Artesia we have learnt that initial planning needs to be as accurate as possible, but no one can plan every single detail of a project from the beginning. Therefore:
Operating in a large and dynamic data science team during these last few months has not been without its challenges. But by allowing ourselves to appreciate and be open to the possibility that some of these changes will generate permanent alterations to working paradigms, we can make subtle changes using these lessons learnt to ensure the same level of collaboration, rigour and communication that we’ve become accustomed to at Artesia, even whilst working remotely. We don’t yet have all of the answers, and it’s likely that lockdown part 2 will deliver more learnings and lessons. However, our openness and willingness to adapt should ensure that we are in the best position to come out of the other side as the same successful and dynamic team, eager to tackle any data science problems with enthusiasm.