A Critique of the Data Science Process

5 min readOct 21, 2020

Data Science Process As An Iceberg — An Iceberg Representing The Data Science Process

If you take a look at the iceberg that the Titanic hit, you will find that it doesn’t really seem that big. It was just high, while the Titanic itself stood at 53.3 meters. That iceberg is what I compare the data science process to whenever I am faced with the every-so-often question, “What do data scientists actually do?”

Many people critique that data science is easy (relatively) or that it doesn’t require as much work, and to be honest if I didn’t have such an up-close-and-personal look at it, I would have been inclined to agree. This is mostly because of the results that data science gives.

To the normal eye, it seems like data scientists don’t really do much except pull out their computer, manipulate data, press a magic button (mostly just the enter key), and produce detailed predictions. It seems like magic, but from personal experience, but I can tell you that this magic has a lot going in the background (such as data cleaning/formatting, exploratory data analysis, and feature engineering) just to mention a few.

This ‘magic’ is a result of a deep understanding of a blend of different sciences. Do you know how they say “jack of all trades, master of none”? Normally, a master of none is better than a jack of all but here, these ‘jacks’ hold the ability to combine others’ mastery to improve functionality.

Here, we’ll look at the data science process and the critiques directed toward it in detail to explain why I compare the data science process to icebergs.

What’s Behind the Curtain of Data Science?

Data science is mostly powered by many different sciences, but the most important bits are the four major components:

1. Data Strategy

2. Data Engineering

3. Data Analysis/Modelling

4. Data Visualization/Operationalization

These mostly involve the use of statistics (regression analysis and classification), creation of new features (feature engineering), programming (mostly Python and R, but may include other languages as well), data visualization using Tableau, and building machine learning algorithms[h1].

When applied systematically, they yield predictive results, such as natural disaster prediction and relief models[h2], driver drowsiness detection, web traffic time forecasting, fake news detection, and other exciting projects too numerous to mention.

The Data Science Process

What Data Scientists Do & What Is Expected From Us

One of the most common complaints that I have heard (and experienced) as a data scientist is people not really understanding what we do and what they can use us for. For example, on your first day at work, you might be given the task of optimizing sales funnels or conversions.

This isn’t part of what you do, right? What do you analyze, where do you begin and what sort of a timeframe are you working with? As a data analyst, I suggest you buck up and face this ambiguity head-on. Simply stick to your data-science processes and see where you go from there.

Let’s apply the data science process here quickly to see how you’d tackle the sales funnel problem.

Applying the Data Science Process

Chances are that the manager who asked you to do such an ambiguous task is not technical, so arguing isn’t really a good idea. What they see is the tip of the iceberg. It’s up to you to navigate through what lies under the water’s surface.

You’ll be creating the problem and solving it as well. Here is a typical workflow to consider:

· Define the problem. The problem isn’t usually clear at first and has to be framed as a problem statement. This helps to identify the problem and the business benefit that would be obtained from it.

· Collect raw data. Once you are familiar with the problem, see where it’s currently coming from, why it’s coming, and what it’s coming for. Collect all relevant raw data and the resources you need to make it usable.

· Data wrangling. Convert the collected raw data into usable data for further analysis.

· Explore. Now that you have contextual data, it’s time to look closely at the iceberg that is the data science process. Find any obvious and obscure correlations or trends.

· Go all data science-y. Here, you unleash your data science beast. From the simple regression analysis, all the way to complicated forecasting models like neural nets, you could implement everything relevant to your goal. Find valuable insights and when you find anything that can help you achieve your task, automate and teach it to the machine (machine learning).

· Communicate results. Now that you know what the problem is and have a potential solution, you have two options. Either give your manager an overview of the problem and recommend action or present all your analysis and technical results to them and explain it.

And the last step is where most data scientists face critique; doing what anyone else can do but taking more time to do it. The following section discusses this in more detail.

The Critical View on Data Scientists

There are several critiques on data science, both from technical and non-technical aspects.

· The data you collect has massive data storage requirements

· Manually spotting and labeling data is a tedious process and takes a lot of time

· The answer is there but the technique and the model isn’t self-explanatory

· Data is often predisposed. If the training data isn’t neutral, results may be very biased, especially with the K Nearest Neighbor algorithms[h3]

· Algorithms don’t ordinarily collaborate

Data science is a huge, separate science that, if used on smaller projects such as the one mentioned above, isn’t worth the investment. It takes time and considerable resources to come to a conclusion, which would even then be fallible.

To ensure the data science process is well worth the effort, a high-end data repository is needed, a strong labeling logarithm, self-explanatory modules, collaborative learning algorithms, and more.

Non-technical individuals usually just see the tip of the data science process iceberg and find that data science isn’t that big of a deal, or as useful. It is only you, who took a deep dive into the waters surrounding said iceberg, who knows how deep the iceberg is.

Think of the Titanic; just 10% of the iceberg’s mass was visible. If even just 25% of its total mass had been above the sea, Titanic’s crew would have seen it from afar and maneuvered out of its way. 90% of the iceberg wasn’t seen by anyone, yet it was supporting the 10% above. Similarly, people often see just 10% (or most often, even less) of what the data science process looks like. The 90%; only you and I are aware of.

[h1]Link to “KMEANS and DBSCAN clustering”

[h2]Link to “Natural Disaster Prediction and Relief Models”

[h3] Link to “3. K Nearest Neighbor (KNN) Algorithms — A Quick Overview”