Why industrial data is often unusable (and how to fix it)
In many industrial companies, data is everywhere:
- Sensors
- Machines
- Supervision systems
- Historical databases
- Excel files
- etc…
There’s a lot of talk about Industry 4.0, artificial intelligence, and digital transformation.
Numerous studies, particularly those by McKinsey & Company, highlight the considerable potential of data-driven approaches in industry.
On paper, it looks like a real revolution.
But as soon as an engineer starts working concretely with this data, the reality is often very different.
👉 The data exists… but it’s difficult to use.
👉 And sometimes, it’s simply unusable.
Contrary to popular belief, the problem in industry is not a lack of data, but its quality.
As IBM emphasizes in its work on data quality, incomplete or inconsistent data can lead to flawed decisions and significant losses in value.

The most frequent problems
In practice, engineers very often encounter:
- Incorrectly calibrated sensors,
- missing values,
- inconsistent units,
- different acquisition frequencies,
- ambiguous variable names,
- manually modified Excel files
Taken individually, these problems seem minor.
But taken together, they make any analysis complex… or even impossible.
A very common example
Imagine that you want to analyze the temperature of an industrial equipment.
The data comes from several sources:
- a monitoring system
- a CSV export
- an Excel history
You quickly discover that:
- some values are missing,
- the timestamps are not aligned,
- the units change depending on the source,
- some columns have been modified manually.
👉 Result: before you even start the analysis, you spend hours cleaning the data.
This observation is widely shared in the world of data: a large part of the time is spent preparing data, as can be seen in many educational resources offered on Kaggle.
Why this problem is underestimated
Most discussions about data in the industry focus on:
- algorithms,
- machine learning,
- artificial interlligence
But all these approaches are based on one fundamental principle:
👉 a model cannot be better than the data it uses
If the data is inconsistent or incomplete, the results will be:
- unreliable
- difficult to interpret
- even misleading
The temptation to jump straight to models
In many projects, there’s a desire to move too quickly:
- building a model
- training an algorithm
- testing a predictive approach
But this approach often fails.
Why?
👉 Because the most important step has been neglected:
understanding the data.
Understand before calculating
Before any analysis, ask yourself these questions:
- Where does the data come from?
- How is it collected?
- What transformations has it undergone?
- What errors are possible?
It may seem simple.
But it’s often what makes all the difference.
👉 This is also the philosophy of De Facto Data:
understand before calculating.
How to improve the situation ?
Good news: simple actions can already make a huge difference.
1. Document data sources
Each dataset should include:
- data origin
- units
- acquisition frequency
- possible transformations
2. Standardize formats
Consistent formats help avoid many errors:
- standardized timestamps
- consistent units
- explicit variable names
3. Automate cleaning
- Anomaly detection
- Duplicate removal
- Harmonization
👉 You can use tools like Python or Pandas for this.
4. Monitoring quality over time
- Sensor drift
- System evolution
- Human error
👉 Data quality is a continuous process, not a one-off action.
An opportunity for engineers
Working with industrial data can sometimes be frustrating, but it is also an opportunity.
Engineers who know how to:
- Structure the data
- Automate the processes
- Improve the reliability of the analyses
👉 quickly become indispensable.
So, how exactly can we save time?
If you work with technical data, you’ve probably already experienced this:
copying and pasting data for hours to produce reports…
👉 Good news: it’s not a fatality.
In the following article, I will show you how to transform 50 test reports in 10 seconds.
➡️ Read the article: 50 test reports, 10 seconds: stop copy-pasting, start analyzing

Leave a Reply