Delivering large data science projects remotely: lessons learnt

Delivering large data science projects remotely: lessons learnt

2020 will be remembered by all of us, mostly for the Coronavirus pandemic and all the consequences it brought along; a national lockdown, an economic crisis, social distancing, and challenges to the way we do, well, everything. At Artesia, we have been fortunate enough for our business to remain largely unaffected by the economic crisis, but we have had to move to a complete working-from-home approach during lockdown(s), and the majority of our staff have been working remotely since March and will continue to do so for the foreseeable future. This coincided with some major projects; in particular, Artesia began working on the demand forecasts for the Water Resource Management Plans 2024 (WRMP24) for 11 water companies, for both Household (HH) and Non-Household (NHH) water demand. These would be challenging projects under normal circumstances but coordinating large data science projects while working remotely brought additional challenges for our Data Science Team, which we have dynamically overcome, adopting new technologies and new working paradigms. Here are some of the lessons we have learnt from working on large data science projects remotely:

Version control

Version control refers to the practice of keeping a master version of code written for a project on a remote repository, that dynamically synchronises with the local versions that could be on each developer’s computer. This is possible thanks to version control management software. Artesia has worked with GitLab for a very long time, but not all staff had adopted it fully in all its potential, until now. The advantage of version control is two-fold: on the one hand all written versions of a piece of code can be recovered, allowing the team to experiment with different possibilities and go back without losing even one line of code and without the need of saving multiple files for the same piece of code; on the other hand, it allows multiple developers to work on the same project and potentially on the same files simultaneously, integrating the different versions and possibly resolving conflicts in a smart and efficient way. This has always been an important tool for our Data Science Team but resolving coding problems using GitLab is now a fundamental part of our everyday work at Artesia.

Functional programming

Functional programming has become a buzz word when it comes to big data, and it is often explained with obscure technical jargon, but it is actually a relatively simple concept (although the implementation can be far from simple). In traditional programming, code is written to accomplish a specific task for the given data. If the data changes and/or the problem changes, a similar, but separate piece of code is written. When there are many datasets to test and many similar problems to address this can become confusing and redundant. Functional programming aims at writing only one piece of code that takes the data and the instructions as an input and is “smart” enough to understand what needs to be done in each specific case to return the desired output. It is a black box approach, where the black box can be as sophisticated as necessary. To explain it with a culinary metaphor, traditional programming is like baking different cakes separately, one by one; functional programming is like building a cooking robot that can bake the right type of cake if you give it the right ingredients and settings. It takes longer to build the robot, but it is much more efficient and automatised. The reason why this is becoming a key feature of programming in the big data era is that it allows for parallelisation: if you give a “robot” to each of your computing processing cores or to multiple computers, you can “bake multiple cakes” at the same time. This way, even large datasets and large number of problems can be processed and analysed much faster and more efficiently. In a remote-working setting, this enables the “robot” to correspond with many people in different places who can test different data and problems at the same time, without the risk of doing things slightly differently (as all the settings and data can be centralised).

Planning

Planning is not a data science specific task, but it is an essential one. The combination of working remotely and the size of the projects we have recently been working on has exasperated the need to meticulously plan every detail. Every single file that is produced needs to have a standard name that has been agreed, every spreadsheet column needs to have the correct header, every file containing code needs to follow the correct structure, so that when files are passed from person to person the process works smoothly. This is a consequence of functional programming as well. In order to achieve the required level of planning, at Artesia we have learnt that initial planning needs to be as accurate as possible, but no one can plan every single detail of a project from the beginning. Therefore:

  • A clear hierarchy of decision making is necessary; when a new aspect of the process needs to be decided, who will take responsibility to decide what is the best way to do it? This can save numerous meeting-hours to agree if the spreadsheet headers need to be capitalised or smaller case.
  • Sometimes some initiative is sufficient: some decisions do not have a right or wrong answer, if one person proposes a new standard, the others will follow.
  • It is important to have centralised records of all decisions and standards so that everyone can consult it when in doubt.

What now?

Operating in a large and dynamic data science team during these last few months has not been without its challenges. But by allowing ourselves to appreciate and be open to the possibility that some of these changes will generate permanent alterations to working paradigms, we can make subtle changes using these lessons learnt to ensure the same level of collaboration, rigour and communication that we’ve become accustomed to at Artesia, even whilst working remotely. We don’t yet have all of the answers, and it’s likely that lockdown part 2 will deliver more learnings and lessons. However, our openness and willingness to adapt should ensure that we are in the best position to come out of the other side as the same successful and dynamic team, eager to tackle any data science problems with enthusiasm.

Delivering large data science projects remotely: lessons learnt

Doing data science working from home is challenging: here is what we have learnt...

Findings from the Artesia and University of Manchester research project are published

Study reveals how water use has changed in lockdown...

Artesia and University of Manchester to research water use in the Coronavirus lockdown

An innovative social science research project with the University of Manchester...

New report provides insights into what drives peak water demand

The exceptionally hot and dry summer of 2018 revealed some fascinating insights...

New Waterwise article! The effect of the coronavirus lockdown on water use

Data visualisation shows how major social changes (and weather) affect water use...

Artesia Consulting and i2O Water announce new Supply Interruption Detection Service: SIDS

Incorporating Artesia's eVader software as a module into iNet...

Artesia Consulting and i2O Water announce strategic partnership

Innovating together to help water companies reduce water loss...

Remembering Simon

A few brief words in memory of SGW...

Another article for Waterwise

Advances and challenges in forecasting water demand...

Coronavirus business update – 17 March 2020

Artesia continues to update plans in response to Covid-19...

Artesia’s first Waterwise newsletter article!

Reviewing long-term patterns in water use...

Artesia becomes a Waterwise affiliate!

An exciting new partnership between Artesia and Waterwise...

Rob's back at Westminster!

Appearing before the EFRA Select Committee...

A new approach in water resources modelling

Artesia's new approach to analysing demand...

Artesia's data science leads attend EARL conference

The 2018 Enterprise Applications of the R Language conference...

Artesia in the national news

Rob Lawson seems to be getting everywhere...

Northumbrian Water Innovation Festival 2018

Dene gets innovative in a big tent...

Announcing hAQUAthon 2018!

The second Artesia-Decision Lab hAQUAthon is on 27th & 28th Nov 2018...

Artesia celebrates 10 years in business

Artesia celebrates a decade of consultancy and data science services...

Evidence of our expertise

Rob gives evidence to select committee...