Clean data and data integration from several sources is a major pain
We often need to use several related data sets to prepare for running analysis. The differences might be subtle but this is where it costs the most. It seems that the 80-20% rule applies here. Some clean data is obtained rather fast, while 80% of the effort is spent on cleaning the dirtiest 20% of data. Sometimes it looks as simple as to merge two tables with slightly different versions of names, or different formats of dates. Sometimes the source data is using coding systems but they don’t match. Using relational schema based ETL tools offer poor help for discovering the gaps and anomalies and are not flexible to handle exceptional cases. The worst thing is we can’t get the overview of how difficult the cleaning is before we actually have tried to do it. Which forces us to start over again several times.