Context is king

Context is king

It probably comes as no surprise that like many other careers, data scientists, mathematicians and statisticians face numerous day-to-day challenges. For instance, lack of data, difficulties with version control, poor data quality, and model overfitting. All of these things have the potential to impact the quality of the results and are therefore the typical issues that most people associate with the term “data scientist” or “statistician”.

However, one could argue that there is a more pertinent challenge that we face almost every day, most certainly on every project, that even good data scientists may not appreciate. That is that context is everything.

To explain, I am going to use a well-known example from World War II that is taken from the book “How not to be wrong” by Jordan Ellenberg, as well as extracts from “Black Box Thinking” by Matthew Syed.

In 1943 the US military were losing too many aircraft to enemy fighters. They would often be subjected to enemy fire, and they knew that armour was the answer. The problem was that too much armour makes the plan heavier. Heavier planes are less manoeuvrable and use more fuel. Equally, armouring the planes too little leaves them vulnerable to enemy hits. Somewhere, there is an optimum.

A Hungarian mathematician by the name of Abraham Wald was tasked with determining the precise level and position of the armour required.

In order to help Abraham, the military provided him with data they thought might be useful. Often, when the planes returned, they were covered in bullet holes, so they provided data on the number of bullet holes per section of plane, indicating the areas that were most commonly hit by enemy fire.

Immediately, the military sought to strengthen the most commonly damaged parts of the planes to reduce the number that were shot down. This meant that the additional armour would mainly cover the fuselage.

Abraham instantly disagreed. “The armour shouldn’t go where the bullet hole are, it should go where the bullet holes aren’t. Around the engines and cockpit”.

He had realised that the military were only considering data from the planes that actually returned. The reason planes were coming back with fewer hits to the engine is because the planes that got hit in the engine weren’t coming back! The bullet holes revealed the parts of the aircraft that demanded the least amount of armour.

Jordan Ellenberg goes further to explain that “if you were to visit the recovery room at a battlefield hospital, you’d see a many more people with bullet holes in their limbs than people with bullet holes in their chests. That’s not because people don’t get shot in the chest; it’s because the people who get shot in the chest don’t typically make the recovery room”.

The absence of information is often not information of absence. The key is sometimes in the missing data. These are examples of what is called ‘survivorship bias’.

A good data scientist will carefully handle data. They will look for relationships and correlations and apply machine learning models in their sleep. A great data scientist will question the data. They will separate fact from assumption as they know the importance that this has on the outcome. They will strive to understand all of the potential variables and use domain knowledge and expert opinion to apply logic and reasoning to the result. A great data scientist not only knows that the model is robust, but that it makes sense in the context of the original problem.

To illustrate, our data scientists spend a lot of time analysing water meter data, and it is not uncommon to have to look at distributions of flow to inform our analyses. It is typical to have no flow below a certain threshold. A good data scientist will look at the distribution and spend time checking units, outliers and erroneous data before deciding if the data is valid. This may indicate that the threshold is real. A great data scientist will spend extra time understanding why this is possible. Is it that households have near zero chance of using water at such low flow rates? It is more likely that there is a physical limitation to the meter so that this flow is not detected. Both scenarios explain the data, but only the latter makes sense in the context of the data.

At Artesia, our data scientists are great data scientists because of the emphasis we place on the context of the data in delivering the right solution. We achieve this through the collaboration with our industry experts who are equally important to our success. By looking beyond the data, challenging our assumptions and separating facts, we are able to ensure that our solutions are not only robust, accurate and technically sound – all the things one expects from a data scientist – but that they are also logical, and utilise all other possible sources of information that may not initially seem relevant. Sometimes it is the missing information that makes all the difference to understand what is truly going on.

Water the way nature intended

Growing our understanding of gardening for greater water efficiency...

We are recruiting ...again!

Artesia looks to expand its team of water industry specialists...

Context is king

A single statistic often doesn't tell the whole story...

Give (water) data to data scientists!

Give (water) data to data scientists...

Collaborative study report on the impact of COVID-19 on water use published

Impact of COVID on water use...

We are recruiting!

We're looking for a friendly and motivated person to help organise us...

Delivering large data science projects remotely: lessons learnt

Doing data science working from home is challenging: here is what we have learnt...

Findings from the Artesia and University of Manchester research project are published

Study reveals how water use has changed in lockdown...

Artesia and University of Manchester to research water use in the Coronavirus lockdown

An innovative social science research project with the University of Manchester...

New report provides insights into what drives peak water demand

The exceptionally hot and dry summer of 2018 revealed some fascinating insights...

New Waterwise article! The effect of the coronavirus lockdown on water use

Data visualisation shows how major social changes (and weather) affect water use...

Artesia Consulting and i2O Water announce new Supply Interruption Detection Service: SIDS

Incorporating Artesia's eVader software as a module into iNet...

Artesia Consulting and i2O Water announce strategic partnership

Innovating together to help water companies reduce water loss...

Remembering Simon

A few brief words in memory of SGW...

Another article for Waterwise

Advances and challenges in forecasting water demand...

Coronavirus business update – 17 March 2020

Artesia continues to update plans in response to Covid-19...

Artesia’s first Waterwise newsletter article!

Reviewing long-term patterns in water use...

Artesia becomes a Waterwise affiliate!

An exciting new partnership between Artesia and Waterwise...

Rob's back at Westminster!

Appearing before the EFRA Select Committee...

A new approach in water resources modelling

Artesia's new approach to analysing demand...

Artesia's data science leads attend EARL conference

The 2018 Enterprise Applications of the R Language conference...

Artesia in the national news

Rob Lawson seems to be getting everywhere...

Northumbrian Water Innovation Festival 2018

Dene gets innovative in a big tent...

Evidence of our expertise

Rob gives evidence to select committee...

Announcing hAQUAthon 2018!

The second Artesia-Decision Lab hAQUAthon is on 27th & 28th Nov 2018...

Artesia celebrates 10 years in business

Artesia celebrates a decade of consultancy and data science services...