Why industrial data is often unusable (and how to fix it)

In many industrial companies, data is everywhere:

  • Sensors
  • Machines
  • Supervision systems
  • Historical databases
  • Excel files
  • etc…

There’s a lot of talk about Industry 4.0, artificial intelligence, and digital transformation.
Numerous studies, particularly those by McKinsey & Company, highlight the considerable potential of data-driven approaches in industry.

On paper, it looks like a real revolution.

But as soon as an engineer starts working concretely with this data, the reality is often very different.

👉 The data exists… but it’s difficult to use.
👉 And sometimes, it’s simply unusable.

Contrary to popular belief, the problem in industry is not a lack of data, but its quality.

As IBM emphasizes in its work on data quality, incomplete or inconsistent data can lead to flawed decisions and significant losses in value.

Image with magnifying glass highlighting the quality of industrial data

The most frequent problems

In practice, engineers very often encounter:

  • Incorrectly calibrated sensors,
  • missing values,
  • inconsistent units,
  • different acquisition frequencies,
  • ambiguous variable names,
  • manually modified Excel files

Taken individually, these problems seem minor.
But taken together, they make any analysis complex… or even impossible.

A very common example

Imagine that you want to analyze the temperature of an industrial equipment.

The data comes from several sources:

  • a monitoring system
  • a CSV export
  • an Excel history

You quickly discover that:

  • some values are missing,
  • the timestamps are not aligned,
  • the units change depending on the source,
  • some columns have been modified manually.

👉 Result: before you even start the analysis, you spend hours cleaning the data.

This observation is widely shared in the world of data: a large part of the time is spent preparing data, as can be seen in many educational resources offered on Kaggle.

Why this problem is underestimated

Most discussions about data in the industry focus on:

  • algorithms,
  • machine learning,
  • artificial interlligence

But all these approaches are based on one fundamental principle:
👉 a model cannot be better than the data it uses

If the data is inconsistent or incomplete, the results will be:

  • unreliable
  • difficult to interpret
  • even misleading

The temptation to jump straight to models

In many projects, there’s a desire to move too quickly:

  • building a model
  • training an algorithm
  • testing a predictive approach

But this approach often fails.

Why?

👉 Because the most important step has been neglected:
understanding the data.

Understand before calculating

Before any analysis, ask yourself these questions:

  • Where does the data come from?
  • How is it collected?
  • What transformations has it undergone?
  • What errors are possible?

It may seem simple.
But it’s often what makes all the difference.

👉 This is also the philosophy of De Facto Data:
understand before calculating.

How to improve the situation ?

Good news: simple actions can already make a huge difference.

1. Document data sources

Each dataset should include:

  • data origin
  • units
  • acquisition frequency
  • possible transformations

2. Standardize formats

Consistent formats help avoid many errors:

  • standardized timestamps
  • consistent units
  • explicit variable names

3. Automate cleaning

  • Anomaly detection
  • Duplicate removal
  • Harmonization

👉 You can use tools like Python or Pandas for this.

4. Monitoring quality over time

  • Sensor drift
  • System evolution
  • Human error

👉 Data quality is a continuous process, not a one-off action.

An opportunity for engineers

Working with industrial data can sometimes be frustrating, but it is also an opportunity.

Engineers who know how to:

  • Structure the data
  • Automate the processes
  • Improve the reliability of the analyses

👉 quickly become indispensable.

So, how exactly can we save time?

If you work with technical data, you’ve probably already experienced this:

copying and pasting data for hours to produce reports…

👉 Good news: it’s not a fatality.

In the following article, I will show you how to transform 50 test reports in 10 seconds.

➡️ Read the article: 50 test reports, 10 seconds: stop copy-pasting, start analyzing

Leave a Reply