Search this blog

Wednesday 29 October 2014

Using Groovy to extract the last part of a folder structure

Imagine you are using "Loop Files" to find files one by one and import them perhaps using the "Read CSV" operator. The "Loop Files" operator provides macros such as file_path, file_name and so on to allow you to create meta data with the example set.

So if you have a folder name like this...
c:\users\andrew\bigdata\lotsofdata\subregion\
where each subregion contains many files and there are many different subregions. It makes sense to label all the files for a subregion. This can be done by using the folder name which is contained in the parent_path macro provided by the "Loop Files" operator. There is a lot of redundant information that it would be sensible to get rid of and I suppose it would be possible using some heavy combination of macro and attribute manipulation operators but I decided to write some Groovy to do it. The resulting script is simple.
String filePath = operator.getProcess().macroHandler.getMacro("parent_path")
String lastPart = filePath.tokenize('\\').last()
operator.getProcess().getMacroHandler().addMacro("subregion", lastPart);
It assumes a macro called parent_path which contains the folder name. The tokenize function splits this into tokens separated by "\" and the last one is returned using the last function. A macro called subregion is then created. This can be used as a normal macro.

Saturday 25 October 2014

Windowing and Processing Documents

The text mining extension contains an operator called "Window Document". It takes a document that has been split into tokens (typically words) and creates a collection of new documents from it. Each new document contains a fixed number of tokens corresponding to a "window length" parameter and the movement of the window that moves through the document is dictated by a "step size" parameter. A meta data attribute called "window" is created for each new document; this corresponds to the window within the original document.

So for example, this text

"The cat sat on the mat"

could be split into three windows each of size two if window length is set to two and step size is set to two.

window: 0 - "The cat"
window: 2 - "sat on"
window: 4 - "the mat"

Here's a simple process that illustrates windowing and processing. It's worth noting that the "Process Documents" operator is able to take a collection of documents as input. Note that the process uses version 6.1 of RapidMiner studio so some manual version number editing would be needed to run it in older versions. Note too that you must have the Text Processing extension installed.

The process illustrates a tiny pitfall for the unwary. If one of the tokens is "window" and if the parameter "add meta information" is set to true for the "Process Documents" operator, the resulting example set contains an attribute with the name "window_0". This is because the meta data for the window creates a special attribute in the final example set with name "window" and this would clash with the attribute corresponding to the token. If the parameter "add meta information" is set to false, the attribute corresponding to the token is called "window". In other words, the example set changes in a subtle way depending on the setting of a parameter which can lead to problems.

It's a very small point but I happened to stumble over it recently as I was preparing my contribution to an upcoming text mining book. Here's a teaser because it looks nice :). It is comparing three novels by Jane Austen and how the shape of word frequencies varies for consecutive windows through the books.

The red line is for Mansfield Park, the blue is for Sense and Sensibility and the green is for Pride and Prejudice.


Monday 6 October 2014

RapidMiner Resources advanced videos

After a bit of work. I'm pleased to say I've completed the RapidMiner Resources advanced videos and they'll be available on the RapidMiner Resources site soon.

I maintain meta data about the videos and operators and for fun, I've made a process using this data and a new operator I've discovered called "Transition Graph". This is a candidate for "operators that deserve to be better known" because it allows pretty graphs to be drawn.

The meta data I keep records the main operators each video uses as well as the overall running time of the video and the course which it is classified as. Here's a process that takes this data and allows different graphs to be drawn to show which operators are used in which video as well as which video uses which operator.

A brief note on the names - I've prepended "o" for operator and "v" for video to make things clear.

Here is a graph showing that the "Generate Macro" operator is important in 5 videos


Here is another graph that shows the most important operators used by the video called "Macros".

Here's another that shows which operators are covered by each course and what overlap there is.

The process reads a CSV file (here) to generate these graphics. Of course, as time goes on, I will add new videos so the data in the process is a snapshot as at early October 2014. Nonetheless, please feel free to download the process and data and play around with the results to see the videos I have created and the operators that are covered.

The next videos to do are about text mining...