Search this blog

Wednesday 12 April 2017

Zipf's law for text

I haven't posted for a while, I've been busy with work related data science topics using R. However, I'm revisiting text mining for a work related topic and I thought I would revisit some of the things I used to do.

One fascinating topic (and the subject of my Master's dissertation) is Zipf's law. It basically says that for a text corpus there is a simple relation between the rank of a word and its frequency of occurrence. The most common word is given rank 1, the second rank 2 and so on. Zipf's law says that if you multiply the rank of a word by the number of times it appears, you will get a constant. In concrete terms, this means if the most common word appears 100 times, the second most common will appear 50 times, and the third most 33 times and so on.

Of course, it's not precise and this is where it's interesting to see how different texts by different authors vary. It's also possible to calculate an expected probability to see how close to the law a real text corpus actually is. To remind myself how to do this, I made a process here that calculates the observed and expected probabilities of a document corpus.

Here's the picture showing log(rank) against log(observed probability).



It's a log-log plot because the formula relating rank to probability is of the form.

rank = K/probability

and taking the log of both sides leads to

log(rank) = log(K) - log(probability)

which is a straight line with a negative slope.

The graph shows the expected probability in red and the observed in blue. There is a reasonably nice straight line for the blue points which shows there is something in the law.

The process works as follows...

The process requires the Text Mining Extension to be installed so ensure you have that if you want to run it. The process points to the RapidMiner Studio license agreements on the local disk, so ensure you change the location for the "Loop Files" operator in order to run it yourself. This operator reads all the documents it finds and then calls "Process Documents" to process them. Very light tokenizing and filtering is done inside this operator and the resulting word list is used to feed into the rest of the process. The word list gives the words and the number of times they appear across the whole corpus. Some processing of this makes an example set containing observed and expected probabilities. Of interest is the "Normalize" operator that makes probabilities and the "Generate Attributes" operator that calculates an expected probability using a macro containing the number of words found.

The plot can be recreated using the advanced plotting capabilities.

From here, more advanced things can be done such as measuring differences between authors and texts although care is needed to make sure the different texts have certain similarities to avoid getting slightly wrong conclusions. It's also possible to try and model a different law to the distribution of words. One such is the Zipf-Mandelbrot modification which adds some additional parameters and which you can read about here (shameless plug).

In summary, I recreated the process in about half an hour. This shows how easy it can be to create powerful data mining processes using RapidMiner Studio without needing to write software.