Automatic translation

Archives

June 2008
L My Me J V S D
"May July "
A
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30

Contributors

Data cleansing

I was able to participate in a Webinar hosted by the company Talend with the theme: "Cleaning data in open source." This topic interested me particularly, having found this tool on a recent project with the same objective. I will summarize in a few sentences what was said during the conference of 45 min:

Data quality is what?

This is a vast subject indeed ... We can talk about measures of data quality, data cleansing of poor quality ... In general, quality of data depends on three things: accurate, complete and coherent.

The quality of data on several levels:

  • Technique: Referential Integrity, data type, this control is mainly at the DBMS itself
  • Logical data with the Functional Specifications (Each data corresponds to a specific area)
  • Semantics: Data formatted correctly (addresses, phone numbers ...)

Why clean the data?

It is obvious that one needs to have quality data if you can not make the right decisions. The most annoying is that these data will then spread into the system, from application to application as a virus. That is why it is essential to control our data upstream, and therefore, to integrate these cleaning routines in the integration process.

Finally, we were treated to a demonstration of Talend on this issue and I must say I was amazed. In addition to all the components of "base", Talend offers powerful tools to extract duplicate records, or from reference data, to correct our records on the spelling or pronunciation.

To illustrate this, consider the example of a very beautiful name "Gregory" :-) Using these components, we can correct names like "Gregor", "Grgory", or worse ... before reinjecting them into a database, we can also put the data that could not be handled by Talend in CSV file, XML or other table for manual processing.

In summary, a comprehensive ETL avoided for many hours of specific development. And a conference very useful to take more perspective on best practices for using the platform.

  • http://www.smile.fr Olivier Cousin

    These functions of cleaning the data ("data cleansing") are - to my knowledge - rarely used in real life. They respond to specific problems call center in most cases where an agent enters "Smith" with a T instead of "Smith" with D. Hence the arrival of many Dupontds in the database a number of which are the same. Another example is the formatting of phone numbers or postal codes. We'll find these same issues for the entry forms on the web or over the counter.
    This data cleaning in my opinion has only a marginal impact on the decision.
    He probably has much more value in transactional mode where any error or uncertainty is instantly detected, and where a list of solutions is available at the time of the seizure. This is an important contribution to the practice of B2C CRM.
    But this assumes operation of the data cleansing tool in real time (response time <1000 ms). With a number of simultaneous users sometimes dozens or more at peak times. Are we there?
    The rest is just "show-off" to me.
    Olivier Cousin - SMILE Consulting

  • Gregory http://www.ippon.fr Sti & e

    Many applications that use files in various formats (CSV, Excel, ...) generated no people or applications. For this feature, it is interesting to use Talend which will then be cleaned prior to insertion in the base. I gave an example for spell checking, but cleaning can go further, by setting the values ​​to zero for example, concatenate several fields, standardize values, interest ... is to adapt data "external" to database application.