También disponible en Español


The digital magazine of InfoVis.net

TextArc, visualising text.
by Juan C. Dürsteler [message nº 103]

Visualising the structure of the raw text of a document helps greatly in its analysis and compliments techniques like computational linguistics by using the pattern finding capability of the human brain.
TextArc.gif (69916 bytes)
TextArc   View of Alice in Wonderland by Lewis Carroll. You can see the double spiral that contains the full text of the book. In the middle you find the most frequently used words, located in the "center of gravity" of their positions in the text. 
Image courtesy of Digital Image Design Inc.
It's recommended to see it enlarged.
To do see click on the image (68 KB).

Reading a book is an inspiring but lengthy process. Having to deal with many of them is by no means an easy task. Nowadays search engines ease the finding of the right book or the right part of a book, but they don’t help us understand them nor give us the possibility to discover patterns and concepts in arbitrary text.

TextArc is an experimental tool that produces an alternate visualisation of a text. Designed by W. Bradford Paley, from Digital Image Design Incorporated with the idea of allowing the user to “get an overview of a medium-sized body of raw text, eg. the amount one could receive in one single day” of ASCII raw text, like e-mail, news, etc.

Indices, summaries, concordances, lexicons and other structured lists have been available and used for many years. Computational linguistics has produced many interesting techniques capable of automatically producing summaries, abstracts and identify key ideas

Graphical techniques have also been developed to show the prevalence of certain words in large collections of documents. As examples we have treemaps and Kohonen maps (see issues 39 and 51). We have already seen other techniques devoted to showing focus and context in a single view (numbers 3 and 85).

TextArc, unlike other approaches, takes into account the original linear order that texts exhibit. In order to do this, it shows the entire text as two concentric spirals on the screen made up of many lines written in a one pixel height font. 

Each line corresponds to its counterpart in the text, including all its words. Spacing, chapters, sections, typography, poetry and all the “geometrical” features of a text are preserved so that they become visual landmarks helping the user to identify particular sections of the text.

The spiral occupies only the periphery of the drawing leaving the centre for the most used words (see the attached drawings). This way, the words that appear more than once are drawn inside the spiral, in their average position, at the “gravity centre” of the different places in the text where they belong . 

A word, for example, that has more appearances on the left side of the spiral than on the right one will be closer to that side. By selecting one of these words with the mouse we can see all the lines that link the word to the places where it belongs in the text. Pointing to a line in the drawing shows its contents. You can see all the lines highlighted in the outer spiral where the word appears. 

TextArcRabbit.gif (94921 bytes) TextArcConcor.gif (85855 bytes)
Links of a word. The selected word, "Rabbit" shows its distribution by means of lines that link it to its appearances in the text laid out as a spiral and in the  extracted  text overlapped onto the screen, The lines where the word appears are highlighted in green. 
Image courtesy of Digital Image Design Inc.
It's recommended to see it enlarged.
To do see click on the image (93 KB)
Concordancias.   Concordancies show how many times each word is used. A thesaurus can be build where you can consult word and its associated frequencies. Concordant words are drawn in red. 
Image courtesy of Digital Image Design Inc.
It's recommended to see it enlarged.
To do see click on the image (84 KB)

Words get bolder and brighter proportionally to their use in the text. Type size encodes the frequency of use in the printed version. There are more features of this intriguing piece of Java code that would deserve more space than what we have here. It’s worth playing with it using any text of the Project Gutenberg

Particularly interesting is the front end to search text in Project Gutenberg’s database. Once we’ve selected the text of our interest, you mustn’t forget to drop it on the appropriate box to see it in TextArc mode.

After playing for some time with this elegant tool with several texts of Project Gutenberg some sensations appear to me: TextArc provides an unusual way to look at text. You can locate the relevant terms, search for word associations and build lists of the most used words in an instant. Seeing which characters appear most in a novel and where in the text they do, is simple and very intuitive. 

You can see, for example, that in one book the most used word only appears in 3 chapters while in other books the most used word is scattered more or less regularly throughout the text. Loading a large text and building the picture takes a while, nonetheless. A price worth paying to get into the full text in a “random access mode” that lets you analyse the document both visually and effectively. 

I’m not sure whether this tool is the right one for indexing information on everybody’s desktop or not. The end user will tell us once it is released. For sure that its elegant clock metaphor and the easy way of finding patterns in documents makes it a very good example of a fine piece of Information Visualisation.

See also issue number 25 that covers software visualisation. To some extent it shares some common features with SeeSoft a software visualisation tool.

Links of this issue:

http://www.textarc.org/Stills.html   Digital Image Design Inc. image gallery
http://www.textarc.org   TextArc website
http://www.didi.com   Digital Image Design Incorporated
http://www.infovis.net/printMag.php?num=39&lang=2   Number 39 Kohonen Maps
http://www.infovis.net/printMag.php?num=51&lang=2   Number 51 Treemaps
http://www.infovis.net/printMag.php?num=3&lang=2   Number 3 Bifocal Displays
http://www.infovis.net/printMag.php?num=85&lang=2   Number 85 Focus and Context
http://www.textarc.org/Thousands.html   TextArc Visualisation with texts of project Gutenberg
http://www.infovis.net/printMag.php?num=25&lang=2   Number 25 Software Visualisation
© Copyright InfoVis.net 2000-2014