Search this blog


Friday, 5 April 2013

Finding text needles in document haystacks

I had to find how many times a sentence occurred within a large set of documents recently and rather than use a search tool or write some software I used RapidMiner.

Here are the bare bones XML of the process to do this with pictures to help explain (the numbers are shown by clicking on the operator execution order within the RapidMiner GUI).

 The basic elements are
  1. A document is created to contain the text-to-look-for - the text needles.
  2. A word list is created from these using the process documents operator. 
  3. A document containing text to search through is created - the document haystack.
  4. The document is processed and only the provided word list items are included in the resulting document vector. This is set to output term-occurrences so the end result is a count of the number of times the text-to-look-for appeared in the document.
There are some points to note.

The text-to-look for is shown as the parameters to the first create document operator (labelled 1 above) shown here.

The document to look in contains a fragment of text copied from page 391 of the RapidMiner manual (labelled 3 above).

The first process documents operator (labelled 2) itself contains the following operators.

The tokenize operator simply uses anything but alphanumeric and space as a token boundary. This has the effect of creating each of the provided phrases as valid tokens. The replace tokens operator replaces all occurrences of space with underscore to match what the n-gram generation operator will produce later.

The final process documents operator (labelled 4) contains the following operators.

This tokenizes but by virtue of using the word list from the previous operator, only these will be considered in the final output example set once the generate n-gram operator has combined tokens together.

The end result is shown below.

The end result shows how many times the text appears in the document.

One advantage this approach has is that it seems to execute very quickly.

No comments:

Post a Comment