It is very common to receive information in a format that cannot easily be formatted and processed as you want it to be. For example, you may have an unstructured list of information that you need in a table; or data in a table that you need as XML; or data in a web page (HTML) that you need in Excel. So many information professionals maintain a tool-kit that allows them to manipulate all sorts of information in different formats.
Here is a small list of tools aimed at cleansing and structuring data.
- Text editors [Free and commercial; All platforms] Often the first port of call for manipulating un- or semi-structured text. Anything beyond simple search/replace and manual editing is likely to use Regular Expressions (see below).
- Spreadsheets [Excel – Commercial; Windows, Mac. OpenOffice – Free; Windows, Linux, Mac. Many similar free and commercial software tools] This is often my second tool of choice as they are more structured than plain text but not as restrictive as a full database.
- Google Refine [Free software; Windows, Linux, Mac] Run locally to manipulate tabular data. Quite powerful but not especially well documented. Part of the Freebase project. Support is provided by the mailing list. Can save and re-run, easily undo steps. Cannot run in batch mode as a transformation engine. Useful when transformations and cleaning in a Spreadsheet don’t cut it.
- DataWrangler (Alpha) [Free; Web] From Stanford University, they have a demo version online. Similar to Refine.
- TextPipe Pro [Commercial US$ 30-400; Windows] “industrial strength text transformation, conversion, cleansing and extraction workbench”
Once you have structured data, there are a number of types of tools that are used to extract, transform and process data – ETL (Extract, Transform and Load), Business Information (BI) and Data Mining. Typically though, these are enterprise class tools and only really worth bothering with when you have large-scale data and need repeatability. Here is a small list of such tools for reference.
- XSLT [Free and Commercial, All platforms] This is an XML transformation language. It is used for transforming an XML scheme into another one. Perhaps may be thought of as regular expressions for XML.
- SQL  Structured Query Language is the standard language for manipulating relational databases.
- Microsoft PowerPivot for Excel 2010 [Free?; Windows] Desktop BI add-in for Excel 2010.
- Pentahoe [Free and Commercial]
- RapidMiner [Free and Commercial]
- Orange  Visual programming, visualisation, data analysis
- Talend Open Studio [Free and Commercial] Aimed at data integration
- CloverETL [Free and Commercial]
- IBM Data Stage [Commercial]
- Ab Inicio [Commercial]
It might also be worth looking at the BI Verdict web-page for the latest analysis of the strongest players amongst the BI vendors and products