Text Mining 





Data mining usually deals with numeric values grouped for dimensions. So we often have Sales for different product categories, Profits from different geographies, Marks obtained by students etc. We have numeric measures given for dimensions in a tabular fashion. Text on the other hand is usually in the form of a set of documents, comments, reviews, emails, news etc. Although Text data may appear quite different from conventional numeric data, lot of data mining techniques find application in Text Mining too. In addition to them, Text Mining makes use of several other techniques too, owing to the unique nature of Text.


Very often Text data is transformed into a spreadsheet model of data, where each row represents a unique document, review, comment etc. and each column represents a word. The cells may have binary values 1,0 to indicate presence or absence of a word in a particular document (or comment, email etc). Or the cells may have numbers indicating frequency of a word in a particular document. Once we have structured the Text data in this format, we can apply several well-known techniques like SVM, Naïve Bayes, K-means Clustering etc. on them. Needless to say, we would encounter a huge and sparse table, with thousands or millions of rows and columns.


The following objectives can be served by mining Text Data.

  • Document Classification: A document can be classified into set of predefined categories based on its attributes, which are the presence of key words, phrases etc. Hence set of documents can be identified as Financial, Technical or Human Capital documents and routed to respective departments or folders. 
  • Information Retrieval: This is akin to a search engine, where all documents containing a certain information (words, phrase, group of words etc.) is retrieved. This can be purely a query based on a word or group of words. It can be made more intelligent by means of synonyms or Ontology. 
  • Document Clustering: Classification is similar to supervised learning, where we have predefined classes to put the documents into. However we may want to group documents into their natural groupings too, where the groups are not known beforehand. Hence we may use unsupervised learning technique like Clustering in this case. 

In addition to the above, one can perform the following on Text data.

  • Sentiment Analysis: Sentiment Analysis can be construed as a Classification exercise, where we classify documents into “positive”, “negative” or “neutral” sentiments. But the diversity of languages, writing styles, use of sarcasm etc. warrant more intelligent analysis rather than a purely mechanical and automated classification exercise. 
  • Frequent Mentions (Word cloud): Once we have the spreadsheet model, the easiest analysis is to find the most frequent words in general or in different categories. One can then use them as “search clue” for information retrieval as discussed above.
  • Word Association: One can find the most associated (by proximity or co-occurrence) documents or words, with a “word” of interest.
  • Word Network: Somewhat similar to word association, a word network analysis can be performed to see key themes appearing in the Text data.

Having said this, Text Analytics is experiencing lot of research and we may have many new techniques or more accurate techniques in coming days. In general, the entire Text Mining exercise can be depicted by the following process.


























Document Collection: Documents could be XML files, ASCII Text, emails, Survey Inputs or mix of them. The first step is to collect them into a repository. 


Document Standardization: Before the analysis, the Text data from varied sources need to be standardized into a common format, palatable to the software or tool used for analytics.


Tokenization: It refers to extraction of the words from the stream of characters in the Text data. Hence all the delimiters like "white space" are used to separate the "tokens" and extract them from the Text. 


Stemming: Stemming or Lemmatization refers to transforming all the words to their root for uniformity. As an example, "landed", "landing" etc. would be transformed to their root word "land". In certain cases, stemming may impair the analysis and can be opted out. 


Stop-word removal: Words like "the", "is", "are" and many others do not carry much value in terms of providing context or insight from Text data. Hence they are removed from the Text. Standard stop-word lists are available for each languages and they can be used to remove the redundant words from the Text data. In addition to that, standard punctuation marks and numbers (not always) can also be removed to gather only relevant words. 


Lower-case conversion: "English" and "english" refer to same meaning and Text data can be replete with such words. Hence the entire corpus is converted to lower-case to avoid them being treated as different words. 


Parts-of-speech (POS) Tagging: Repository of words tagged with POS can be used in multitude of ways. As an example, most frequent "proper nouns", "base Verbs" and "adjectives", when looked in combination, can provide the overall context or theme of the Text data. 


Term-document Matrix: This is the spreadsheet model of the data, where terms (word) could be in rows and Documents in columns or vice versa (called Document-Term Matrix). Several results could be derived from this matrix, Word-cloud, Word-network etc. being the obvious ones. 



Two measures related to Text Mining need special mention, as they are unique to this area.

  • Term frequency–inverse–Document frequency (tf-i-df): Importance of a word can be measured by its frequency in the overall text corpus. Hence term frequency (tf) is a measure, which is often used, especially in the form of Word Cloud. However the words which are present in abundance in multiple documents may be considered commonplace and unimportant too. One way to weigh down the presence of a word in too many documents is to use tf-i-df, which is defined as follows.

             tf-i-df = tf * log(N/df), where tf is the overall frequency of a word, N is total number of documents and df is number of documents where                the word appears


             As an example, if “delay” appears 20 times in total and in 4 out of 10 customer emails sent to Customer Care, then

             tf-i-df = 20 * log(10/4) ~ 7.95

             If “delay” appears in all the emails, then the term in log would become zero and "delay" would be considered unimportant.

  • Document Similarity: We discussed Euclidean Distance in our discussion in Clustering and KNN. There are several other similarity measures used in Data Mining. Text Mining especially uses Cosine Similarity as a measure of document similarity. It is defined as the “dot” product of two vectors, as follows. 

        If V1 = {X1, X2, X3} and V2 = {Y1, Y2, Y3) then 






The following pictures show a Term-Document-Matrix (TDM) and a Wordcloud derived out of it. The TDM is based on 8 text files corresponding to Cockpit Voice Recorder transcripts for different airlines. The TDM is partially shown, although there are 630 relevant words (columns) in the actual file after removing stop-words etc. Please note how, stemming has resulted in "activ" and "advis" in the following picture. Sometimes, stemming is opted out to retain the actual words. 













































As a final remark, Text Mining is a specialized area, and the results are highly dependent on quality of Text. As an example, a properly written formal document is expected to provide much better result than a colloquial and informal comment on Social Media etc. Misspelt words may be hard to analyze with standardized processes. 


There is so much of diversity in writing style, that complete accuracy in areas like Sentiment Analysis is extremely difficult to achieve. The complete tone and context of a sentence may be "negative", which with all the step-by-step process may be lost.