También disponible en Español

Inf@Vis!

The digital magazine of InfoVis.net

Text Mining
by Juan C. Dürsteler [message nº 27]

Text Mining is an (another) emerging technology whose goal is to find knowledge in large collections of unstructured documents.

Approximately 80% of the organisations’ information is stored in unstructured textual form: reports, e-mail, meeting minutes, etc.

In contrast with what we commented on last week on the Semantic Web, text mining operates on unstructured text databases with the objective of detecting non trivial patterns and information on the knowledge stored in the same.

The Semantic Web intends to build a whole metadata structure, information about the structure and meaning of the stored data, in order to include it in the documents. In this way the documents can be navigated, identified and "understood" without human operation. On the other hand text mining intends to extract metadata from textual, not necessarily structured data. In this sense it could be a helpful part of the Semantic Web.

In this way, text mining systems are able to perform lexical analysis and especially the automatic building of classification and categorisation that are codified in the form of a Thesaurus. It is not a coincidence that Thesaurus comes from the ancient Greek thesaurós, treasure. The key point of the thesaurus is that each one of its terms is used (at least in principle) to denote a concept, the basic semantic unit to convey ideas.

Text Mining systems can be helpful in the categorisation of the existing information of an organisation; in the filtering and routing of information like e-mail; in the detection of duplicated, related or similar information.

Some applications have evolved from the laboratory into practice. For example it appears that some companies are using these systems to identify the contents of the e-mail that their customers sent to them in order to redirect them automatically to the appropriated departments. In other cases, when the system is able to identify a query that is already in the Frequently Asked Questions database, it automatically sends the answer to the query, without human intervention.

Where these techniques have been applied for longer is possibly in the Business and Technology Intelligence in order to follow the evolution of the other players in the market by diving in the press releases, scientific journals and other textual databases. Another interesting application is Market Research. By gathering statistics about the use in the Web of certain concepts and / or topics in the Net, the demography and demand curves associated to certain products can be estimated.

Nevertheless, I haven't been able to find reliable sources that allow you to evaluate to what extent the former applications are really efficient or even satisfactory. An interesting page is that of de James Lawson. It contains information and links on the topic, along with a series of evaluations of commercial systems like Data Junctions Cambio (with a free demo downloadable) or Semio Map among others

The Institute for Autonomous Intelligent Systems AiS also has interesting information and an active group on the topic.

Text Mining is still in its childhood. It represents an attack from another angle to the common problem of finding relevant information and surviving the infoxication. It's still early to say where the solution is. Probably it will come from the multidisciplinary approach.

Links of this issue:

http://www.infovis.net/printMag.php?num=26&lang=2  
http://allen.comm.virginia.edu/jtl5t/index.htm  
http://www.datajunction.com/  
http://www.semio.com  
http://set.gmd.de/KD/textmining.html  
© Copyright InfoVis.net 2000-2014