Search this blog

Saturday 6 May 2023

Reading more examples than your licence allows

Recently, I found a way to read more examples than your license allows.

With the free version of RapidMiner Studio, example sets are limited to 10,000 rows. Using the Python or R scripting operators, it is of course possible to read more than this but as soon as the example sets are returned to RapidMiner, the license limit is imposed.

However, if the data is processed into 10,000 row batches, it is possible to place these batches into a collection. Common processing can be applied to each batch by using a loop collections operator. 

Of course, if you append the collection entries and the result is greater than your license limit, restrictions will happen. 

The Python code looks a bit like this.

df = pandas.read_csv('mybigdata.csv')
batch1 = df[0:10000]
batch2 = df[10000:20000]
return batch1, batch2

Make sure you connect two outputs from the Python operator to a Collect operator and you will have 20,000 rows in your collection consisting of 2 x 10,000 rows.

I could have written the whole thing in Python of course.

Needless to say, RapidMiner might get upset with such breaches of their licencing, so you should not use this unless you are willing to take any consequences.

Wednesday 8 June 2022

Fetching stock data using a parameterised Execute R operator

I'm currently delivering data science lectures at the University of Chichester and RapidMiner is part of what I use to teach. And very good it is too. I recently found myself helping my students to get some up to date stock market data. Rather than manually downloading this, I thought I would use RapidMiner with the tidyquant R package and do it automatically. The Finance and Economics Extension seems to be out of date so isn't an option.

My idea was to define a list of stock symbols such as "AAPL", "BTC-USD" and so on and run the Execute R operator in a loop with each symbol individually.

It turns out there isn't a way to parameterise the Execute R operator so I had to invent one.

Basically, I use the Loop Parameters operator to set multiple values for a macro located inside it. This macro is used to create a one row example set with the value of the macro. This example set is then passed to the Execute R operator where the R script uses it as a parameter to drive the rest of the script. It's clunky but it works.

This approach could be adapted to allow R scripts to be run as part of a more complex modelling process. Relatively tough to do but feasible.


Here's a link to the process.

You'll need the R Scripting extension and you will also need to ensure that R is running on your machine with the data.table, tidyverse and tidyquant R packages all installed.

If RapidMiner enhances the Execute R operator to take parameters, (which would be a good enhancement), then this work around will not be needed anymore.


Wednesday 25 May 2022

Parties at 10 Downing Street in 33 words

The Sue Gray report was published today. I made a word cloud of some of the more frequent words to try and summarise what it's about. 

This one uses 33 words and seems to do a reasonable job.