También disponible en Español

Inf@Vis!

The digital magazine of InfoVis.net

John Wilder Tukey: in memoriam
by Juan C. Dürsteler [message nº 6]

Last July, 25 John Wilder Tukey died at the age of 85. Tukey has been one of the great statistical talents of the XXth century, having a noticeable influence on Information Visualisation.

Maybe his best known contribution is the Fast Fourier Transform (FFT). Nevertheless, Tukey has contributed to modern statistics in many other ways. Among them Exploratory Data Analysis (EDA) is one of the most outstanding.

His book Exploratory Data Analysis (1977) is the classic reference on the topic. EDA is a philosophy of statistical data exploration basically of graphic nature. For this reason sometimes it is confused with graphical statistics, although EDA goes far beyond this.

It's worth looking at the excellent web site on EDA in the statistics manual of the National Institute of Standards & Technology to have an idea of the extension and interest of this development.

The interest of EDA, for our purposes, relies on the power that graphics add to the statistical tools. Graphics provide a great help in understanding the meaning of the data.

It's worth spending some time to take a look to some of the plots invented by Tukey as the Box-and-Whisker Plot or the Stem-and-Leaf Diagram, among others.

In the Stem & Leaf diagram, each element of data represents its own value and, at the same time, occupies a space in a way in which we obtain simultaneously the profile of a univariate distribution and the presentation of the data themselves. Moreover, repetitive information is reduced to a minimum.

As an example of this, I have prepared a train timetable from a leaflet of the line Castelldefels-Barcelona(Sants) gathered at Renfe's railway station.

Originally, the timetable occupies a 10 rows by 9 columns table with an additional "widow" column for the 22:38 train. A total of 91 fields with hh.mm format, 455 characters.

Original timetable Castelldefels -> Barcelona-Sants 
5.03  7.32   9.02  11.07  13.32  15.07  16.50  18.32  20.07  22.38
6.02  7.37   9.07  11.32  13.37  15.20  17.02  18.37  20.20
6.18  7.50   9.24  11.37  13.50  15.32  17.07  18.50  20.32
6.37  8.02   9.32  12.02  14.02  15.37  17.20  19.02  20.37
6.48  8.05   9.37  12.07  14.07  15.50  17.32  19.07  20.50
6.55  8.20  10.02  12.32  14.20  16.02  17.37  19.20  21.02
7.02  8.24  10.07  12.37  14.32  16.07  17.50  19.32  21.07
7.07  8.32  10.32  13.02  14.37  16.20  18.02  19.37  21.20
7.20  8.37  10.37  13.07  14.50  16.32  18.07  19.50  21.32
7.25  8.51  11.02  13.20  15.02  16.37  18.20  20.02  21.37

In the Stem & Leaf diagram we represent the hours to the left of the separation bar | and minutes of each train departure to the right. The train frequency can be easily deduced from the length of the rows. Moreover it's very easy to identify the pattern of departure of the trains.

Castelldefels -> Barcelona-Sants Stem & Leaf diagram Castelldefels -> Barcelona-Sants reduced Stem & Leaf diagram  
05 | 03
06 | 02 18 37 48 55
07 | 02 07 20 25 32 37 50
08 | 02 05 20 24 32 37 51
09 | 02 07 24 32 37
10 | 02 07 32 37
11 | 02 07 32 37
12 | 02 07 32 37
13 | 02 07 20 32 37 50
14 | 02 07 20 32 37 50
15 | 02 07 20 32 37 50
16 | 02 07 20 32 37 50
17 | 02 07 20 32 37 50
18 | 02 07 20 32 37 50
19 | 02 07 20 32 37 50
20 | 02 07 20 32 37 50
21 | 02 07 20 32 37
22 | 38
                                   05 | 03
                                   06 | 02 18 37 48 55
                                   07 | 02 07 20 25 32 37 50
                                   08 | 02 05 20 24 32 37 51
                                   09 | 02 07 24 32 37
                             10 11 12 | 02 07 32 37
              13 14 15 16 17 18 19 20 | 02 07 20 32 37 50
                                   21 | 02 07 20 32 37
                                   22 | 38

On the other hand, given that at some hours the frequency is exactly the same, (for instance between 13 and 20), we can even reduce further the timetable without losing any information and increasing the clarity and ease of use (see the reduced Stem & Leaf diagram to the right)

Finally we have 59 2-digit fields that add up 118 characters plus the separator bars. This is 4 times less digits than with the original timetable. Less space and more clarity.

This tells us that an appropriate disposition of data can be twice as informative and that graphic representation can contribute enormously to pattern perception and to the understanding of the nature of phenomena.

Does anybody imagine looking at sales evolution without representing it graphically?. Who hasn't seen the stock exchange evolution as a wavy line?. Or an illness frequency histogram by population age?.  

Surely many of us are using EDA in a daily basis without even knowing it.

Links of this issue:

http://www-groups.dcs.st-and.ac.uk/~history/Mathematicians/Tukey.html  
http://www.itl.nist.gov/div898/handbook/eda/eda.htm  
http://www.itl.nist.gov/  
http://mathworld.wolfram.com/Box-and-WhiskerPlot.html  
http://www.renfe.es  
© Copyright InfoVis.net 2000-2014