Search this blog

Thursday 21 July 2011

Ignoring many attributes

The "set role" operator has a nice feature that lets you set the role of an attribute to free text.

By setting the role to something like "ignore" subsequent modelling operators will not process the attribute.

Here's an example that creates some fake data with 4 attributes and then sets three of them to be ignored. This causes the clustering operator to perform more poorly because it ignores some of the attributes.

Note that the roles have to be different so in this case they are "ignore01", "ignore02" and "ignore03". If you set them all to "ignore", an error happens.

Note too that the "set additional roles" dialog is a bit fiddly as it loses focus after each character is typed but it does work.

Thursday 7 July 2011

Using regular expressions with the Replace (Dictionary) operator

The "Replace (Dictionary)" operator replaces occurrences of one nominal in one example set with another looked up from another example set.

By default, this replaces a continuous sequence regardless of its position in the nominal.

For example, if an attribute in the main example set contains the value "network" and the dictionary example set contains the value pair "work", "banana", the result of the operation would be "netbanana".

This is fine but if you want to limit to whole words only then you can use the "use regular expressions" parameter in the replace operator. To make this work, you also have to change the text within the dictionary for the nominal to be replaced with "\b" at the beginning and the end. In regular expression speak, this means match a whole word only.

In addition, if the word to be replaced contains reserved characters (from a regular expressions perspective) then "\Q" and "\E" have to be placed around the word.

One way to do this is to use a "generate attributes" operator and create a new attribute in the dictionary example set using the following expression.

"\\b\\Q"+word+"\\E\\b"
In this case, "word" is the attribute containing the word to be replaced. The "\" must be escaped with an additional "\" in order for it all to come out correctly.

The new attribute would then be used in the "from attribute" parameter of the "Replace (Dictionary)" operator. The "to attribute" would be set to the attribute within the example set dictionary that is the replacement value.

Wednesday 6 July 2011

Initial notes about installing RapidAnalytics in the Cloud and locally

Taking advantage of the Amazon EC2 free year long trial, I installed RapidAnalytics in their Cloud on a micro instance running Ubuntu 10.10 with MySQL. Port 8080 needs to be opened in the instance firewall to allow incoming requests to the RapidAnalytics web page.

Installation of the Ubuntu desktop, a VNC server and Java was the time consuming part. Note that installing Java does not work on a micro instance; there is a known error. The workround is to run on a small instance and then install Java in that environment. Having done that, the image can be saved and re-run on a micro instance.

The memory available in the micro instance is insufficient for the default JBoss settings. Reducing this to 512M allows everything to start but after 30 minutes it does not run properly with many timeout like errors.

Sad conclusion: the micro instance is too small - this is a shame since it means free Cloud practice is not possible.

Changing to a small instance allow things to start and after about 5 minutes, RapidAnalytics starts OK and is usable. This costs money - not a lot - but enough for my mean streak to kick in. A medium instance would presumably start more quickly - it would cost a bit more so I didn't try it.

The IP connectivity to allow the server to be found so that a local RapidMiner client can use it is the next step although money is likely to be required to assign an IP address that can be seen on the Internet.

Conclusion: RapidAnalytics will work in the Cloud but some cost conscious people will choose to install and play on local machines.

Installation on a laptop running XP SP3 and SQL Server is also OK as is installation on a 64 bit laptop running Windows 7 enterprise with MySQL. Note that some SQL Server components steal port 8080 necessitating a change to the JBoss port to something like 8081 in the server.xml file contained in the folder rapidAnalytics\rapidanalytics\server\default\deploy\jbossweb.sar.

Next steps will be to find a way to backup a RapidAnalytics installation and restore on a different machine.