Why industrial data is often unusable (and how to fix it)
In many industrial companies, data is everywhere:
- Sensors
- Machines
- Supervision systems
- Historical databases
- Excel files
On paper, it looks like a gold mine.
We talk about Industry 4.0, AI, predictive maintenance, but when an engineer actually starts working with this data, the reality is often very different.
The data exists, but it is difficult to use and sometimes simply unusable.
Contrary to popular belief, the problem in industry is not a lack of data. The problem is the quality of the data.
Let’s take a few common examples:
- incorrectly calibrated sensors,
- missing values,
- inconsistent units,
- data recorded at different frequencies,
- ambiguous variable names,
- manually modified Excel files
and so on !
These problems may seem minor, but when they accumulate, they make any analysis complex or even impossible.
A very common example
Imagine that you want to analyze the temperature of an industrial equipment. The data comes from several sources:
- a monitoring system
- a CSV export
- an Excel history
You quickly discover that:
- some values are missing,
- the timestamps are not aligned,
- the units change depending on the source,
- some columns have been modified manually.
Before you even begin the analysis, you spend several hours cleaning the data. This phenomenon is so common that many engineers spend more time preparing data than analyzing it!
Why this problem is underestimated
Most discussions about data in the industry focus on:
- algorithms,
- machine learning,
- artificial interlligence
But these technologies are all based on a simple principle:
a model cannot be better than the data it uses
If the data is inconsistent, incomplete, or poorly structured, the results will be unreliable. In some cases, they may even be misleading.
The temptation to jump straight to models
It is very tempting to quickly move on to the next step:
- building a model
- training an algorithm
- testing a predictive approach
But in many industrial projects, this approach fails.
Why?
Because the most important step has been overlooked:
understanding the data.
Understand before calculating
Before launching complex analyses, it is essential to answer a few simple questions:
- Where does the data come from?
- How is it collected?
- What transformations has it undergone?
- What errors are possible?
This step may seem basic, but it often helps to avoid many problems.
This is also De Facto Data’s philosophy:
understand before calculating.
How to improve the situation
Fortunately, there are several simple practices for improving the quality of industrial data.
1. Document data sources
Each dataset should be accompanied by a clear description:
- data origin
- units used
- acquisition frequency
- any transformations
2. Standardize formats
Consistent formats help avoid many errors:
- standardized timestamps
- consistent units
- explicit variable names
3. Automate cleaning
Certain repetitive operations can be automated:
- detecting outliers
- removing duplicates
- harmonizing formats
This saves time and reduces human error.
4. Regularly check data quality
Data quality must be monitored over time.
Sensors can drift.
Systems can evolve.
Regular checks prevent problems from accumulating.
An opportunity for engineers
Working with industrial data can sometimes be frustrating, but it is also an opportunity.
Engineers who know how to:
- understand data
- structure information
- automate analyses
can transform this data into powerful decision-making tools.
Going further
If you regularly work with technical data and certain tasks take you several hours a week, you can start with a simple exercise.
Identify a repetitive task in your workflow:
- cleaning files
- preparing reports
- consolidating data
Then figure out how to improve it.
To help you take that first step, I’ve created a 30-day challenge.
The goal is simple:
automate a repetitive task related to your data.
You can join him here: 30-day challenge
Welcome to De Facto Data.
