It can be frustrating when you can see useful information in the ‘free text’ column…
During data cleaning & preparation activities, it is not uncommon to identify columns that contain free text data, where patterns of words are observable, but not in a form able to be used for analysis. For example, the data sample shown here contains a potentially useful column of Comments, where the word ‘wire’ is appearing often.
The ‘how-to’ practice shown below is a manual workaround and admittedly a last resort, however, when you are left with no other options, you just have to make the best of what you have got!
The discovery of free text columns in datasets is often an ironic twist, as this data is, more often than not, the very data that is needed to investigate & solve the problem at hand!
I most often encounter free text in the ‘comments’ field; for example associated with process delay records, customer surveys, and safety incident reports. A similar situation exists when free text ‘description fields’ are used within transactional data; for example Purchase Order description, product delivery instructions, and sales refund notes.
The free text within Posts on Socials are also very useful data to analyse, as the content of what people are posting about is often just as important as the associated user demographics and page statistics.
The analyst is then faced with the ‘free text dilemma’; spend more time collecting new data, which will take delay conducting the data analysis and also require a concerted effort to engage others and complete the collection of fresh data, or make-do with what is at hand to try and bring extract some value from what has already been collected.
The ‘how-to’ workaround to wrangle ‘free text’ data into a more useful form
SigmaXL & Minitab software provide useful column formatting tools; cells can be colour-coded when matching a key word, and a search can then be conducted on each cell colour to ‘Find and Replace’ the free text with standardised text. Follow the steps using this Purchase Order data example.
The Purchase Order data includes a column of “description”, (Column 5 highlighted), this is free text of what the Purchase Order was issued for. While not in a useable form as-is, it would be very useful to identify patterns of purchases, to understand where volume discounts with Suppliers could be leveraged for example. A quick scan of the column reveals there are repeated words, the word “Labour” for example, however as the free text is not standardised, it is difficult to summarise the column by totalling up the number of times each word has appeared.
It is going to require some manual effort to make this data useful, and in every case some judgement of the balance between the effort needed to clean the historical data, (while accepting that the integrity of the information will be compromised to some degree) vs the effort & elapsed time it would take to collect fresh data with a known integrity needs to be made.
Note: Before starting it is good practice to create a copy of your data in its original raw form, incase it all goes pair-shaped and you need to start again. My preferred practice here is to duplicate the worksheet, renaming it as ‘Raw’.
SigmaXL allows the native Excel function to conditionally format the cells using ‘text that contains’ option, which will conditionally format the cells in the column with a chosen fill colour. (If you are using Minitab, colour coding of cells can be achieved using conditional formatting by ‘text that contains’, which will provide the same outcome).
At this point, scan down the column, colour coding makes it easy to identify if the cells highlighted are infact the ones you intended to identify using ‘text that contains’. Repeat the conditional formatting, using a different fill colour each time, to highlight other cells that contain other desired text fields. This will rely on some keen observation and diligence on your part, to repeatedly scan through the free text in the column and identify the observable patterns of words you wish to include.
The final step is to standardise the free text within each conditional format colour. This is easily done in SigmaXL by sorting the data by each cell colour, then using copy/paste in your desired standard text, just remember to select all columns before sorting! (Minitab provides a ‘Find and Replace’ function to search for each conditional format colour, and replacing the free text with your preferred standardised text).
Repeat the text standardisation for each of the conditional format colours used. There will be cells where no conditional format ‘treatment’ has been applied during the Find and Replace effort, these remain in original free text form, and can be ignored. I often conduct a final scan down the column & recheck these for any new word patterns that might be useful, before using the Find and Replace function to set the free text in these cells to ‘Other’
Its now very easy to conduct analysis on this column, a great starting point is Pareto Charts to highlight the comparative totals for each category. The Purchase Order data in this case also includes the spend amount, so Pareto Charts on Frequency (count) and also Sum (amount) would provide useful insights.
While manipulating free text data in this way is a workaround, it provides a useful and practical way to ‘deal with the reality’ when working with less than ideal datasets. Techniques to clean up free text, such as this one, are more than just a useful ‘tool’ to have at hand when analysing data. More often than not, its likely no-one has examined the free text content previously, so new insights and patterns are just waiting to be found, and the effort invested will most likely be well rewarded!