I was able to participate in a Webinar hosted by the company Talend with the theme: "Cleaning data in open source." This topic interested me particularly, having found this tool on a recent project with the same objective. I will summarize in a few sentences what was said during the conference of 45 min:
Data quality is what?
This is a vast subject indeed ... We can talk about measures of data quality, data cleansing of poor quality ... In general, quality of data depends on three things: accurate, complete and coherent.
The quality of data on several levels:
- Technique: Referential Integrity, data type, this control is mainly at the DBMS itself
- Logical data with the Functional Specifications (Each data corresponds to a specific area)
- Semantics: Data formatted correctly (addresses, phone numbers ...)
Why clean the data?
It is obvious that one needs to have quality data if you can not make the right decisions. The most annoying is that these data will then spread into the system, from application to application as a virus. That is why it is essential to control our data upstream, and therefore, to integrate these cleaning routines in the integration process.
Finally, we were treated to a demonstration of Talend on this issue and I must say I was amazed. In addition to all the components of "base", Talend offers powerful tools to extract duplicate records, or from reference data, to correct our records on the spelling or pronunciation.
To illustrate this, consider the example of a very beautiful name "Gregory"
Using these components, we can correct names like "Gregor", "Grgory", or worse ... before reinjecting them into a database, we can also put the data that could not be handled by Talend in CSV file, XML or other table for manual processing.
In summary, a comprehensive ETL avoided for many hours of specific development. And a conference very useful to take more perspective on best practices for using the platform.


