The comparison between randomForest and ranger

Forest. Source: Here

A Couple days ago I had a chance to be a speaker on internal data scientist meeting at the company that I work for: Stream Intelligence. The meeting is usually held on monthly basis, and the last meeting in October was 6th meeting. We used Skype for Business to connect between the Data Scientists in Jakarta and in London.

I delivered a topic titled Random forest in R: A case study of a telecommunication company. For those who do not know Random Forest, an Indian guy, Gopal Malakar, had made a video uploaded in Youtube. He elaborated the definition of random forest. First of all, check the video out!

Based on the video, one important thing that you have to remember about random forest is that, it is a collection of trees. It was built by a number of decision trees. Each decision trees is formed by random variables and observations of the training data.

Supposed that we have trained a random forest model, and it was made from 100 decision trees. One test observation was inputted on the model. The decision tree outputs will result 60Y and 40N. Hence the output of random forest model is Y with score or probability 0.6.

OK, let’s practice how to train random forest algorithm for classification in R. I just knew it couple weeks ago from Datacamp course, that there are two random forest packages: 1) randomForest and 2) ranger. They recommend ranger, because it is a lot faster than original randomForest.

To prove it, I have created a script using Sonar dataset and caret package for machine learning, with methods: ranger / rf, and tuneLength=2 (this argument refers to mtry, or number of variables that was used to create trees in random forest). In random Forest, mtry is the hyperparameter that we can tune.

Output of ranger training

Output of random forest training

So, the random forest training with ranger function is 26.75-22.37 = 4.38 seconds or 25% faster than original random forest (Assume we use user time).

However, if I tried to change tuneLength parameter with 5. It reveals that the original randomForest function is faster than ranger. Hmmm… seems that I have to upload a question to stackoverflow or Datacamp experts.

Please follow and like us:

Import from a Database in R

Importing database to R
Importing database to R

When you use R for data analysis, sometimes you have to import data from a Database (e.g. SQL), instead of just import data by reading the csv, excel, or mat file. To do this, we have to do several steps. Good thing Datacamp course provide us step-by-step guidance.

There are 5 steps you need to do if you did not install RMySQL package yet. If you have installed the package, you could skip step 1. In step 2, you need to establish connection with database with dbconnect() function. Then, in step 3 you can list the database tables using dbListTables(), in which three tables are available: users, tweats, and comments. In Step 4, you can import the table and assign into a data frame variable. Last but not least, if you have finish importing data, the polite way must be performed is disconnecting the database using dbDisconnect() function.

You can find the R-script described as follow:

Hope it helps! ;D

Please follow and like us: