Search this blog

Wednesday 22 November 2017

Enriching flowfiles in Apache NiFi using Mongo

I am using NiFi a lot for scalable processing of large data flows. NiFi is powerful but there are frustrations for what should to be the simplest of activities. The fact that there are foibles do betray a slight lack of maturity in the product.

Anyway, to help the World, I will describe what I had to do to solve the common problem of enriching JSON flow files containing some form of id with a looked up text value.

For example, if there is a small fragment of JSON like so...

{
    "Id": 7
}

and we want to make this by looking up the value 7 and add the new field Name with the value Me

{
    "Id": 7, "Name": "Me"
}
   
then we can in many ways using various lookup services. The way I had to choose involved reading from a Mongo database containing the Id and Name value pairs and I found most of what I needed here

The issue I found was that the type of the Id field in the Mongo database must be a string for everything to work correctly. Naively using an integer causes the matching to fail (not mismatch which is another issue) and there seems to be no way to work around this in the NiFi operator parameter settings; it always assumes a string when querying.

So, it's a bit of rough edge although in NiFi's defence, the feature I am using is quite new. Hopefully, this mini blog entry will help.


Wednesday 8 November 2017

Evolutionary process that helped me to win $1000

As the recent successful winner of the RapidMiner farming on Mars competition, I thought I would post the evolutionary process that helped lead me to the best solution.

I've tidied it up a bit and in fact there are 4 processes. The main one calls the others using "Execute Process" which ensures that common processing is placed in one location to avoid making errors.

The four processes are.

  1. EvoExample.rmp
  2. ReadAllData.rmp
  3. FilterHourAndSelectAttributes.rmp
  4. ImportData.rmp
When saving these, ensure the names are as above and they are all saved in the same repository. It's also important to point the processes at the locations of the training and test files. Download the training data from here and the test data from here. Unzip in the normal way and enter the locations into the "ReadAllData" process by changing the macros associated with the "ExecuteProcess" operator that runs the "ImportData" process.

When all the dust has settled, run the "EvoExample" process and observe the log output that writes a row each time a test has been performed with the specific settings of hour and misclassification cost.

These two parameters are contained in the depths of the process and the evolutionary process chooses values for these parameters and determines how they affect performance. 

The process has a couple of interesting features. Firstly, the performance is extracted from a calculation to match the scoring used in the competition. The operator "Extract Performance" is used to do this. Secondly, the process shows a way of varying the value of a macro inside the evolutionary operator. This is a bit of a hack and involves using the varying parameter of the evolutionary operator to control the execution of the example set made by a "Generate Data" operator and then using other operators to extract the details of this dummy example set to place into a macro. Maybe there's an easier way; I couldn't find one.

The contest did reveal an issue with the way nominal values are read and assigned which show up when running in Windows and in Linux environments. I hope this will lead to an enhancement to allow more control to be exerted. In the meantime, the process tries to get round these by sorting and being explicit in assigning positive and negative labels. Despite this, there is a chance that different environments will yield different results. The interested reader is referred to this process.

Have fun!...

Monday 6 November 2017

R packages and Shiny

Despite this blog's title containing RapidMiner, most of what I have been doing recently involves R. I maintain a GitHub repository and at the last count there are more than 50 R packages stored there. Most are private but a few are public.

I can't reveal the private ones but there are a couple of play repositories that I have published as Shiny applications.

The first one is the POTUS Progress Pie - originally posted as an idea on the HalfBakery - a site I visit a lot - see the original idea here and the R Shiny application here.

The second one shows a genetic algorithm finding the maximum to a complex function and again uses Shiny. Here's the application. Move the "audio 1" through "audio 4" sliders to try and maximise the score. Brute force is usually not an option so selecting the Find the Best option will show you the slider settings. Selecting the "Random Choice" button chooses a new function to maximise.

Coming up and in the spirit of mentioning RapidMiner, I will post the evolutionary process I used to help me win RapidMiner's recent contest...