Data Cleansing Tools

It is very common to receive information in a format that cannot easily be formatted and processed as you want it to be. For example, you may have an unstructured list of information that you need in a table; or data in a table that you need as XML; or data in a web page (HTML) that you need in Excel. So many information professionals maintain a tool-kit that allows them to manipulate all sorts of information in different formats.

Here is a small list of tools aimed at cleansing and structuring data.

  • Text editors [Free and commercial; All platforms] Often the first port of call for manipulating un- or semi-structured text. Anything beyond simple search/replace and manual editing is likely to use Regular Expressions (see below).
  • Spreadsheets [Excel – Commercial; Windows, Mac. OpenOffice – Free; Windows, Linux, Mac. Many similar free and commercial software tools] This is often my second tool of choice as they are more structured than plain text but not as restrictive as a full database.
  • Regular Expressions – Quick Reference. Often incomprehensible but very powerful text transformation “language”. Commonly found in good text editors though it is sometimes easier to use one of the web training tools as long as the text to transform is not too big. Well worth learning as many, complex transformations can be built up using a series of regular expressions and they are available from many tools. Also available in any decent code editor or IDE, Microsoft Office (via VBA code), PHP, JavaScript, Java, Linux command line tools, etc.
  • Google Refine [Free software; Windows, Linux, Mac] Run locally to manipulate tabular data. Quite powerful but not especially well documented. Part of the Freebase project. Support is provided by the mailing list. Can save and re-run, easily undo steps. Cannot run in batch mode as a transformation engine. Useful when transformations and cleaning in a Spreadsheet don’t cut it.
  • DataWrangler (Alpha) [Free; Web] From Stanford University, they have a demo version online. Similar to Refine.
  • TextPipe Pro [Commercial US$ 30-400; Windows] “industrial strength text transformation, conversion, cleansing and extraction workbench”

Once you have structured data, there are a number of types of tools that are used to extract, transform and process data – ETL (Extract, Transform and Load), Business Information (BI) and Data Mining. Typically though, these are enterprise class tools and only really worth bothering with when you have large-scale data and need repeatability. Here is a small list of such tools for reference.

  • XSLT [Free and Commercial, All platforms] This is an XML transformation language. It is used for transforming an XML scheme into another one. Perhaps may be thought of as regular expressions for XML.
  • SQL [] Structured Query Language is the standard language for manipulating relational databases.
  • Microsoft PowerPivot for Excel 2010 [Free?; Windows] Desktop BI add-in for Excel 2010.
  • Pentahoe [Free and Commercial]
  • RapidMiner [Free and Commercial]
  • Orange [] Visual programming, visualisation, data analysis
  • Talend Open Studio [Free and Commercial] Aimed at data integration
  • CloverETL [Free and Commercial]
  • IBM Data Stage [Commercial]
  • Ab Inicio [Commercial]

It might also be worth looking at the BI Verdict web-page for the latest analysis of the strongest players amongst the BI vendors and products