Search this blog

Sunday 28 July 2013

Aggregating attributes with parentheses

I stumbled upon a feature of the Aggregate operator just now that took me far too long to understand; I should have known better. In the spirit of altruism, I hope the following post will save others a bit of time.

It's well known that if attribute names contain parantheses or certain other mathematical symbols, the "Generate Attributes" operator will have problems. Users can get frustrated by this but it's easy to workaround simply by renaming the attributes. To retain backward compatibility I believe it would be extremely disruptive for the RapidMiner product to be changed so we have to live with it.

I discovered that the "Aggregate" operator behaves similarly. The following illustrative process builds a model on the Iris data set and then applies it to the original data (purists will wince at the over-fitting). The process then aggregates by the attributes "label" and "prediction(label)" and counts the number of examples for these combinations. The process also aggregates using a renamed attribute without the parantheses. I have selected "count all combinations" so I am expecting to see 9 rows in the output.

The first output looks like this.


Notice how the "prediction(label)" attribute is missing.

The second output looks like this.


Now we see all 9 expected rows (and continue to wince at the overfitting).

Unfortunately, there is no warning message for the absence in the first case. This probably explains why it took me a while to understand what the problem was. Arguably this could be a bug but I subscribe to the view that the only issues that matter are the ones you don't know about. We know about this one so we can work around it.

As an aside, I have invented a little Groovy script that bulk renames attributes to a standard form but crucially it outputs a second mapping example set which can be stored so the renaming can be reversed later. It's a bit rough and ready so time prevents me from polishing it enough so I can feel good about posting it.

Sunday 21 July 2013

Scaling attribute values using weights

Here's a process that multiplies each value of an attribute within one example set by a constant in another example set. The constants are specific for each attribute and the process uses weights derived from the example set. In effect, a matrix multiplication is happening.

At a high level, the process works as follows.

  1. The Iris data set is used with weights being produced using "Weight By Information Gain"
  2. These weights are transformed into an example set and stored for later use inside a Loop operator
  3. A subprocess is used to make sure everything works in the right order (this technique is also used inside the Loop).
  4. A "Loop Attributes" operator iterates over all attributes and generates a new attribute based on multiplying the existing value by a weight. The attribute name is required to be contained in the weights example set. 
  5. The weight for each example is calculated with a combination of filtering and macro extraction.

Monday 15 July 2013

De-normalizing

Here's a process to reverse the effects of normalizing. The key point is that the normalize operator produces a model that can be applied to an unseen example set. This is important when making the attribute ranges the same in training and test data.

The De-Normalize operator takes a normalized model as input and reverses it so that when this is applied to a normalized example set, a de-normalized version is produced.

In the process, the result is the iris data set which is identical to the original.