También disponible en Español

Inf@Vis!

The digital magazine of InfoVis.net

Web Content Mining Visualisation
by Juan C. Dürsteler [message nº 175]

The visualisation of the content of the worldwide web is possibly the most complex part of web mining visualisation. Not only due to the vast and diverse nature of the contents, but also due to the complexity of its semantics. In this article we try to scratch the surface of this problem showing some of the content search and visualisation techniques.

In the article number 172 we spoke about the three paradigms that support web mining: Structure, Usage and Contents. Today we focus on the visualisation of the contents existing in the web.

The worldwide web, the web for short, is most probably the widest information repository that exists nowadays. Its misleading ease of access through any browser hides the intrinsic difficulty of finding what we are looking for.

The generic problem of content mining in what refers to visualisation lies in how to represent the vast amount of information in such a way that what is relevant becomes immediately comprehensible for us, most of which lack a visual education and even less in computer science. 

Two well known facts come to the rescue:

  • the large bandwidth of the human visual system that enables the recognition of large amounts of information in just a few seconds. 

  • it's possible to represent data so that it matches, in a natural way, our perceptive skills, a fact closely related to the former one.

For example, face recognition, a typical case of difficulty and complexity is done in just a split second for a human being. We need only seconds to effectively orientate ourselves in a complex map that can contain hundreds of thousands of references.

The key resides in performing a conversion transforming the information space into a representation of it that can be perceived (hopefully in optimal form) by the user. This is usually called "rendering".

We are specially interested in visual rendering, the one that uses the visual channel to convey information. Following Kimani, Catarci and Cruz (chapter 5 of Visualizing the Semantic Web*) visual rendering has three outstanding components:

  • Visual encoding.  Is the set of techniques enabling the representation of data on a screen or on paper. It comprises of the use of shapes, colours, areas and other visual variables that make it possible to associate data variables to graphic variables. 

    Beyond these elementary variables, it's quite usual to use compound structures like graphs and networks that show the connective structure of data in a more or less direct way. It's not very common to find visual encoding without a visual metaphor supporting it.  As examples of visual encoding onto a 3D map of the Earth you can consider the images  of the traceroutes of packets through Internet, depicted using lines, text and colour.

    geoboy2.gif (97826 bytes) XtraceRoute.jpg (59971 bytes)
    Geoboy: Representation of the routes that follow the datapackets.
    Source: Image as can be seen in the Atlas of Cyberspaces website.
    Click on the image to enlarge it.
    XtraceRoute: Another example of a visual encoding showing the traceroutes through Internet.
    Source: Image as can be seen in the Atlas of Cyberspaces website.
    Click on the image to enlarge it.

     

  • Visual Metaphor.  We have already seen in article number 91 how visual metaphors make it possible for us to represent an unknown system using a correspondence with another already known by the user.  We have seen examples of visual metaphors in number 168 about the landscape metaphor, that uses the familiar structure of a topographic map to code the distribution of documents or other elements of a repository, easing their search. Not less interesting are the metaphors of the document Galaxies, by the Pacific Northwest Lab or Web Forager and Web Book, that we already saw at number 154 
    galaxy.gif (96777 bytes)
    Galaxies: Visualisation of more than half a million abstracts in literature about cancer.
    Source: Image as can be seen in the website of the Pacific Northwest National Laboratory.
    Click on the image to enlarge it.


  • Conceptual techniques. We refer here to the cusp of the pyramid of visualisation elaboration. All of the conceptual techniques aim in one form or another to discover the semantics underlying the data in order to translate it then into a visual metaphor or a visual encoding. There are many different techniques  that we can't enumerate here for the sake of brevity. We will remain, nevertheless, with the most important ones:

    • Clustering. It attacks the problem of automatic classification and / or grouping (clustering) of the contents depending on how similar the documents are. The idea is to group objects in a way that reveals the structure of the set, on one hand, classifyingthe elements without ambiguities, on the other. For example in a bibliographic database we will group in the same cluster the documents with more coinciding keywords.

      Typically this is done using mathematical algorithms that represent documents as vectors whose dimensionality is equal to the size of the vocabulary used. Every document has a "point" associated to its vector in this multidimensional vectorial space. We can then define a distance between any two vectors, which will be shorter the more similar the documents are. 

      One of the fundamental problems of these techiques is that its visualisation in a two or three dimensional space requires sophisticated dimensionality reduction techniques like eigenvalue / eigenvector value decomposition or principal components analysis (see for example the introduction to this in Statsoft web page)

    • Conceptual Maps, we spoke about these in number 141, and they are used for structuring information and finding relations between concepts of a certain knowledge domain. An example of the generalisation of concept maps to the web is WebMap
    • Latent Semantic Analysis: When analysing text, we find the problem of synonyms, different words that mean the same, and polisemy, unique words that can have different meanings depending of the context. This technique uses an algorithm called singular value decomposition that allows you to find the "latent" terms that are expressed through different words. With this technique it's possible to find text where none of the queried keywords is present explicitly, yet they are semantically relevant for the proposed search.

These three categories aren't disjointed. In many cases a conceptual technique serves as the base for a visual metaphor that eventually uses a particular visual encoding to express itself. The amount of techniques for finding semantics in the contents of the Web are increasing constantly.

Each year new metaphors and different examples of visual encoding are added to the collection. Many of those that were considered promising news have been abandoned although not forgotten. Some of them are still used in more or less restricted circles. But we haven't yet found the definitive application in this field, that will allow us to dive in an easy and intuitive way in the vast ocean of the web.


* Geroimenko, V y Chen, C. (eds.) Visualizing the Semantic Web, XML-based Internet and Information Visualization. 2nd edition, Springer, 2003

Links of this issue:

http://www.infovis.net/printMag.php?num=172&lang=2   Num 172 about Web Mining
http://www.infovis.net/printMag.php?num=173&lang=2   Num 173 about Web Structure Visualisation
http://www.infovis.net/printMag.php?num=174&lang=2   Num 174 about Logfile Analysis
http://www.cybergeography.org/atlas/routes.html   Atlas of Cyberspace page about traceroutes
http://www.infovis.net/printMag.php?num=91&lang=2   Num 91 about Visual Metaphors
http://www.infovis.net/printMag.php?num=168&lang=2   Num 168 about The Landscape Metaphor
http://www.infovis.net/printMag.php?num=154&lang=2   Num 154 about Web Forager
http://www.pnl.gov/infoviz/technologies.html   PNNL page about technologies (Galaxies)
http://www.statsoft.com/textbook/stfacan.html   Statsoft's web page about dimensionality reduction
http://www.infovis.net/printMag.php?num=141&lang=2   Num 141 about Conceptual Maps
http://ksi.cpsc.ucalgary.ca/articles/WWW/WWW4WM/   WebMap
© Copyright InfoVis.net 2000-2014