A google search for ‘data science’ yields 74.1 million results as of today. Picking on the same buzz, many organizations believe that merely promoting ‘data science’ as an offering, or a job requirement will keep them relevant and market-savvy. Few things could be further from the truth.
Don’t get that wrong. Research suggests that the hype around data science is justified. In 2015, PWC predicted that there would be demand for 2.7 million data science and analytics roles by 2020. In 2017, Forbes quoted IBM’s prediction of a 28% increase, also by 2020. Between 2013 and 2019, the demand for data scientists soared by 344%, a number that is likely to rise as ML and AI capabilities grow. We just believe the approach some organizations take towards data science may be in the wrong direction. Far from saying their data investment is misspent, we believe it needs to be redirected.
#1 Revisiting The Data Management Process: Remember Carl Sagan’s caution against bad questions? The same applies when planning outcomes from data management. The emphasis for most firms seems to be on the output—the end-product, the numbers, the charts, the results.
This approach ignores the age-old GIGO approach—i.e. poor input will result in poor output.
Authenticity and usefulness of any data science project depends on the data points captured in the beginning.
Although data gathering and cleansing is emphasized in both CRISP-DM model and ASUM-DM, in practice, do we ask enough questions before embarking on a study? Is enough done to collect the right participants? Are responses cleansed to minimize the incidence of user bias?
In an excellent article published in Parametric Press, “…for every neural network that can defeat Jeopardy champions and outplay Go masters, there are other well-documented instances where these algorithms have produced highly disturbing results. Facial-analysis programs were found to have an error rate of 20 to 34 percent when trying to determine the gender of African-American women compared to an error rate of less than one percent for white men. ML algorithms used to predict which criminals are most likely to reoffend tended to incorrectly flag black defendants as being high risk at twice the rate of white defendants. A word embedding model used to help machines determine the meaning of words based on their similarity likewise associated men with being computer programmers and women with homemakers.”
Studies suggest that data wrangling is an underrepresented component in data science spend. One source suggests data wrangling consumes 50-80% of a data science team’s time, leaving very little for analysis or engineering. And the costs of error, as we see above, are great.
So, while the business may be clear on the problem to be solved, they start a vicious circle of directionless data modelling by:
- – Asking the wrong questions
- – Making the wrong assumptions about both the respondents and the type of data requirement
- – Hypothesizing incorrectly with inaccurate baselines
- – Targeting the wrong respondents
- – Collecting the wrong type of data, with the potential for sampling and non-sampling errors
- – Modelling it based on the assumptions
- – Producing results that may be quantifiably sound, but present the wrong picture of the scenario
- – Interpreting misleading ‘insights’ based on the data
Business managers often underestimate just how misleading these insights are. Although there are many biases that feed into insights, survivorship bias is a big one, illustrated famously by Abraham Wald’s WWII Surviving Airplane Engine premise. Not only does this disguise the real problem but shifts focus (and budget) from the issues that really matter.
#2 Overlapping Roles, Underutilized Capabilities: While observing some technology exporters at a conference, I was surprised by how often their project manager used “data scientist”, or “BI analyst” when he clearly meant “market researcher.”
He wasn’t alone. Even organizations tend to use “data engineer”, “data scientist” and “data analyst” interchangeably, although the responsibilities of each are quite different, and demand different expertise.
Data engineers are concerned with the ‘behind-the-scenes’ data game—the part where development and DevOps come into action. They’re the ones who put coding to play. Data scientists deal with the quantitative aspect of data science, the “Maths/Stats” bit. Data analysts put it all together—deriving meaning and telling stories with data.
When it comes to building a data team, most investments are directed downstream—i.e. towards the data scientists and the data analysts. An analogy might help explain why this is an issue: Imagine you’re trying to improve the quality of meals from your existing kitchen. You keep buying more and more exotic ingredients, while doing little to maintain or enhance the stove or oven. Data engineering is the stove/oven in question. Sure, data scientists and analysts need the inflow of working capital. But data engineering capabilities? Those are fixed assets and need to be treated that way.
While building a data science team, many businesses also fail to foresee emerging needs and roles. They fail to account for how automation will unload the need for some resources, while accelerate demand for others. For instance, as early as 2015, automated prediction was a hot topic in data science circles. It hasn’t lost its momentum, but the prediction of automated prediction models freeing up “80%” of data scientists’ time is yet to be realized.
Organizations have also been decrying the extreme compartmentalization within their data science/business intelligence function—and the marked lack of softer strengths, like empathy and imagination, especially amongst the quants. This year, HBR published an article about how compartmentalization within data science is actually bad for business. It increases coordination costs, wait time and more.
#3 The Most Significant Data Science Investments Are Hardest To Measure: For all the hype surrounding data breach and system vulnerabilities, it’s surprising how little organizations proactively do in building robust, secure and fluid data management cultures within themselves.
According to Security Intelligence, the global average cost of a data breach is now $3.92 million. But only 39% of organizations express some level of confidence between data breach incidents and their control mechanisms. Another finding, published in the same article, mentions how a survey of “500 or fewer employees found the majority have no dedicated cybersecurity staff or an incident response plan. Only 7 percent of SMB CEOs say a cyberattack is “very likely,” despite the fact that 67 percent of smaller organizations were targeted over the past year.”
Building that culture refers to more than the hard “infrastructural” investments for storage and processing, which are certainly important, but also equally important are the “soft” aspects of data management—that is, recognizing the sanctity of data, especially personal data in the wake of the GDPR. It means creating structural resilience via DPOs, control mechanisms and even data erasure policies.
Is your organization doing enough to train people on securing their data assets? What control does the organization exercise over access via portable devices? Is there any quality benchmark associated with compliance?
Also, and harder still to develop and manage, are data governance structures. If restrictions on the movement of data are treated like a bottleneck instead of a responsibility, data governance will be a failed practice. Data collection and storage are both going to get harder in the wake of internet privacy and new data protection regulation around the world. Have organizations—and by that I mean retailers, FMCGs and other B2C practitioners, done enough to fortify their BI structures against this?
To build a culture that truly treats data like an asset is a top-down job. Disciplining infringement and making an example of both violators and champions is something that gets a lot of space in management rhetoric, but not enough in practice. Mainly because data-friendly systems are viewed as a hindrance rather than an intrinsic part of gleaning insights and developing data-led business direction.
Data science jobs will grow. Its capabilities will grow, and naturally, the spend that companies direct towards their data programs will grow. To ensure that value derived justifies spend, decision makers need to be tolerant of failure and false starts. More than that, they need to be openminded about the road ahead.