Data Science With RapidMiner: 2015

Sunday 30 August 2015

Using RapidMiner to read data from HBase

HBase is a database within the Hadoop ecosystem. Here's a very simple example RapidMiner process that connects to an HBase server and reads a value.

The process uses the RapidMiner Python operator and a package called 'happybase'.

As always when integrating systems together, there is a lot of leg-work to do to get things working. This starts with a running Hadoop cluster with HBase as well as some data. For this toy example, I created the world's simplest table called 'test' containing two rows. For example, from the HBase shell, the 'scan' command yields the following.

hbase(main):002:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1440837877452, value=value1
row2 column=cf:b, timestamp=1440837887539, value=value2
2 row(s) in 0.0290 seconds

To allow remote access, Thrift must be started to allow remote connections to get to HBase. This is typically done by running the following command within the HBase installation on the machine running HBase.

./bin/hbase thrift start

The final step is to ensure that remote requests to the default Thrift port (9090 by default) are not blocked by the firewall on the HBase machine.

The RapidMiner process can now be run. The Python code within the RapidMiner process is shown below. Change the script to match the values in your environment as you need.

import pandas as pd

import happybase

def rm_main():

def dict_to_dataframe(d):

df=pd.DataFrame(d.items())

df.set_index(0, inplace=True)

return df

# use the name or IP address where HBase is running

connection = happybase.Connection('192.168.1.76')

# use a table name in the database

table=connection.table('test')

# this scans the database and prints to the log

for key, data in table.scan():

print key, data

# this selects a row containing row1

row1 = table.row('row1')

return dict_to_dataframe(row1)

I'm by no means a Python expert so I don't expect this is the world's best example. Nonetheless, it shows the possibilities.

When run in my environment, the returned example set is as follows.

I've only scratched the surface of what could be done using the 'happybase' package but I hope this gives you some ideas about what you might be able to do.

Thursday 9 July 2015

Finding quartiles

Here's a process that finds the upper, middle and lower quartiles of a real valued special attribute within an example set and discretizes all the real values into the corresponding bins. It assumes there is one special attribute only. Additional special attributes would need to be de-selected as an extra step before being processed.

The process works as follows. After sorting the example set, it uses various macro extraction and manipulation operators to work out how many examples there are, determine the index corresponding to the quartile locations and from there the values of the attributes at these locations. These values are set as macros that are used in the "Discretize by User Specification" operator as boundaries between the quartile ranges in order to place each example into the correct bin.

The main work happens in a subprocess which makes the process easier to read and allows the operators to be moved to other processes more easily. The very useful operator "Rename by Generic Names" is used. This allows the macro manipulation operators to work without having to be concerned about the name of the special attribute which again allows the operators to be more portable when used in other processes.

Monday 4 May 2015

Which UK political party is happiest? Update

There is a general election this coming Thursday in the UK and I thought it would be most interesting to compare the manifesto sentiment of 6 of the parties involved.

Firstly, I downloaded the manifestos and chopped them into sequential 50 word chunks and calculated an average sentiment of each chunk.

For each party, I also created a random manifesto by shuffling the original and again chopping into 50 word chunks to calculate an average sentiment for each chunk. I repeated this on 50 different random manifestos for each party for statistical reasons coming later.

I then placed the sentiments into bins of width 0.04 to create the "histogram of happiness". By plotting the result we can see how the manifestos vary from random as illustrated with this plot for one of the parties.

The random points, shown in red, represent the average of the 50 random manifestos with one standard deviation shown as the red bar while the blue bars are the sentiment as measured by the intended word order. The variations are more than can be explained by random chance and we can see for the sentiment between -0.16 and -0.12 there are more 50 word chunks and between 0 and 0.04 there are fewer.

We can calculate a z score for each bin by subtracting the manifesto score from the random result and dividing by the standard deviation of the random result. For the graph above this results in the following graph.

Generally speaking, anything with an absolute z-score greater than 2 has a 1 in 20 chance of happening so the graph above shows that the variations are definitely not random. This is just as well since I'm sure the political parties want to persuade with something that is not just random.

It's quite tricky to compare the 6 parties in a neat way because the graph gets a bit messy. So I decided to focus only on the negative z scores. These represent chunks that happen more often than random and are likely to get noticed more. In other words, uttering something negative or positive is more noticeable than not uttering something.

With this in mind, I combined all the 6 parties to see how they compare.

This graphic is showing only those parts of the manifesto distribution which are more represented than random sampling by two standard deviations. Note that the x axis is not continuous and the smallest circle represents a z score of -2.04 (for the SNP).

What can we see from this? The Green Party has sections of chirpiness but offsets this with sections of negativity. The SNP is both positive and negative but to a lesser extent than the Greens. The Liberal and Labour parties are mostly negative while the Conservatives are showing slight positivity. By a process of elimination, UKIP has the most positive manifesto. The likelihood of finding a 50 word chunk in their manifesto with a sentiment between 0.24 and 0.28 is significantly greater than random. I declare them the happiest.

Is this going to predict the election? I doubt it but it's likely there are teams of policy wonks drafting these manifestos so it would be funny to make sentiment another thing for them to worry about.

Update: it turns out the Conservatives unexpectedly won. I refined the picture above to bring out the differences between positive and negative: green means more positive than random, red means more negative. It shows that the Conservative manifesto is resolutely the most middle of the road. Given that elections in Britian are fought on the middle ground I really should have predicted this.

Tuesday 24 February 2015

Finding those useless attributes and making sure they are really useless

The "Remove Useless Attributes" operator does what it says and removes attributes that are useless. The default for numbers is to remove those that have zero deviation. This is fair enough since it means these attributes are the same for all examples; there's nothing they are bringing to the party. For nominal values, the default is to remove an attribute where all its values are the same. Again, fair enough.

What happens if you remove some attributes and you want to know which ones? You might ask why and that's a good question. All I can say is that it turns out that there are situations where no one will believe you. The conversation goes like this.

"Where are those attributes that I lovingly made?"
"They don't add any value"
"What?! Noooo"

Anyway, you get the picture.

Here's a process that finds the useless attributes and outputs an example set so that you can confirm that they really should be allowed to leave.

It uses the "Data to Weights" operator on the example set after the useless attributes have been sent home. The "Select by Weights" operator is then applied to the original example set containing all the attributes but with the "Weight Relation" set to be less than 1.0 and crucially "deselect unknown" is unchecked. This has the nice effect that the returned example set contains the attributes that were marked as useless.

Thursday 8 January 2015

RapidMiner Server and Elasticsearch with Lucene

Elasticsearch, Logstash and Kibana: a set of most excellent open source tools that are very good at consolidating log files and other data into a central location (the Logstash part), storing and indexing them to make a scalable search platform (Elasticsearch and Lucene) and providing a neat Web front end (Kibana).

RapidMiner Server produces log files and sometimes when running processes, errors can be hard to find so I decided to import the server log files into Elasticsearch to see if the Lucene search engine capability that would result could speed me up a bit.

I am not able to share the exact technical details but it is relatively easy and involves using Logstash Forwarder on the RapidMiner Server host machine. Logstash Forwarder is set up to monitor the RapidMiner Server log file and will forward any new lines to the Logstash server.

Logstash is set up to receive all events and applies filtering to these to combine some log entries into single events. The Multiline rule I used is that any line that does not begin with a valid timestamp should be included with the previous line (hints; the regular expression pattern is "^%{TIME}", "negate" is true and "what" is set to "previous"). This step alone has a tremendous benefit as it neatens up the log file so that each line now has a timestamp and sorting by time will not miss any straggler or blank lines.

Once in Elasticsearch, the events have become documents and the Kibana Web interface can be used to search them. One of the cool things is that the Lucene information retrieval library is built into Elasticsearch. This means it is possible to do queries like this.

"Marking successfully" ~ 3

This matches if the two words are within three words of each other.

It turns out that RapidMiner Server reports this if a process succeeds...

"Marking process as successfully completed"

and reports this if a process failed...

"Marking process as completed with exception"

This allows two queries to be defined to filter out everything except the log lines containing these error messages.

In fact, here's a screenshot showing the results of me running some RapidMiner Server tests where some fail and some succeed.

The histogram view shows the count of the matched events aggregated by minute and the table view gives the log details.

I can safely say that no developer was harmed in the making of this dashboard.

All in all, it makes it a bit easier to spot when something has gone wrong but of course, I am only scratching the surface of what is possible. Elasticsearch has an impressive array of text indexing capabilities and it has a completely open JSON interface. I could imagine connecting the output of Elasticsearch to RapidMiner and making models to prescribe some corrective medicine when a problem is detected in RapidMiner Server. As time and motivation permits I will attempt this although it might become too cool to share for free ;)

Thursday 1 January 2015

English stop words

I wondered recently what is the exact word list used by the Filter Stopwords (English) operator.

I consulted the code for version 5.3 and I made the following file. It contains 395 words in total including some interesting ones ones like "wert" - the imperfect subjunctive of "were" found in Shakespearean English and "summat" - Yorkshire dialect for "something". I'm not sure these would always be stop words.

I assume the operator hasn't changed in the latest version but if it has, the list can be used with the Filter Stopwords (Dictionary) operator to make a facsimile of version 5.3. It would also be possible to use the list as the basis for your own stop word filtering operator and publishing the list would make research more reproducible.

Data Science With RapidMiner

Search this blog