Word clouds as a visualization tool sometimes get a hard time (and sometimes justifiably so) but they have functionality that makes them an effective way to look for errors in large textual-based datasets.

wordcloud1
Output showing barley varieties. The size represents the number of times that variety is mentioned in a collation of pedigree data.

In this example we used the on-line word cloud based service Wordle http://www.wordle.net after collecting as much pedigree data as we could gather on trialled barley cultivars pulling out counts of the number of times a particular cultivar is defined in a pedigree. This gives an overall indication of the relative importance of a variety in the UK breeding process in that is shows which cultivars have been most widely used in crosses; the larger the name appears in the visualization the more often it has been used.

What is interesting about this is that it is a quick and dirty method of identifying problematic data. Taking over half a million rows of data and plotting it in this way it was immediately obvious that there were spelling mistakes in variety names, something which would be more difficult to spot when browsing through spreadsheets.

wordcloud2
Output with cultivars sized base don the number of data points in a large trials dataset.

The second example shown here is similar to the pedigree contribution example above but instead used National List trials data. In this example the larger the variety name the more trials data that exists. Varieties which were historically used as controls such as Halcyon are immediately obvious as having a larger representation in the visualization.

So while word clouds are often over used and heavily criticised in the data visualization community they do have their uses!

 
PrintFor further information on this work please contact Paul Shaw (paul.shaw@hutton.ac.uk) from the James Hutton Institute.