K-Means and Hierarchical Clustering in Python

Tonight, 5 June 2020, I was assigned by IYKRA to deliver “Clustering” online class training at Data MBA Batch #3 program. I would like to show you the summary of the class. Here the agenda is agenda. Fyi, it will be delivered in 2 hours.

  • Pre-Quiz (19:00 – 19:05)
  1. Theory of unsupervised learning: Clustering (19:05 – 19:20)
  2. K-Means Clustering (19:20 – 19:40)
  3. Hierarchical Clustering (19:40 – 20:00)
  4. Measurement Parameters for Clustering (20:00 – 20:15)
  5. Hands on and Exercise of K-Means Clustering and Hierarchical Clustering with Python (20:15 – 20:45)
  • Q & A (20:45 – 20:55)
  • Post-Quiz (20:55 – 21:00)

Note: The hands on will be delivered in Python (Point 5).

Continue reading “K-Means and Hierarchical Clustering in Python”

Kembali Mengajar, karena Diundang

Tahun 2018 sudah memasuki bulan Maret. Tak terasa. Jakarta masih saja panas, sehingga membuat sebagian besar penduduknya harus menyamankan diri dengan Air Conditioner (AC).  Tak terkecuali saya. Sebagai catatan pada saat saya menuliskan postingan ini, AC kamar tidur bocor dan belum sempat dibersihin karena tukangnya sibuk. Sudah hampir sebulan kejadian ini berlangsung, sehingga harus kita tampung dengan ember. lol. Dan pagi dini hari ini ceritanya epik, posisi embernya nggak pas… alhasil lantai basah dan mesti nge-pel di tengah tidur nyenyak ku.

Back to the topic about work.

Sejak bulan kedelapan di tahun yang lalu, saya mulai meniti karya di tempat yang baru. Artinya sudah tujuh bulan saya berkarya di tempat tersebut. Tidak mudah memang menjalaninya terutama dari sisi birokrasi yang terkesan kaku dan rempong dengan tetek bengeknya, per-HR-an yang tidak sesuai ekspektasi, dan beberapa alat penunjang kerja yang awalnya tidak memuaskan.

Tapi saya akan bertahan karena banyak positifnya. Orang-orang setim yang mau maju, mau memperbaiki, dan saling mendukung. Sepertinya Tuhan menjawab doa saya, mengingat di tahun 2017 saya tiga kali pindah kerja. Ini dia quote yang membuat saya kuat.

Saya percaya Tuhan pasti selalu menunjukkan jalan bagi orang yang percaya. Ibarat Musa ditunjukkan jalan oleh Tuhan untuk membebaskan kaum Israel dari perbudakan Mesir, demikian pula Tuhan menunjukkan saya jalan terbaik.

Kembali diminta mengajar.

Awal tahun 2017 ini saya mulai diminta untuk mengajar setelah rehat kira-kira hampir 9 bulan. Topiknya ya nggak jauh-jauh dari kerjaan tentang Data, R, SQL, Machine Learning, dan sebangsanya. Thank You IYKRA (Fajar dan Zizah) atas undangannya. Awalnya agak sulit, tapi kalau dijalani, ya OK juga. Saya sendiri merasakan dampak positif yang luar biasa, karena saya “dipaksa” untuk belajar, karena memang hal ini wajib dilakukan sebagai persiapan sebelum mengajar.

Selama dua bulan terakhir ini, saya sudah mengajar selama 4x. Lumayan banyak ya… Dan nanti bakalan ada lagi bulan April dan Mei. Terus terang saya senang mengingat hasilnya bisa dipakai untuk ganti dan pasang AC. 🙂 But, lebih dari itu, saya merasa senang kalau orang lain merasakan manfaat dari apa yang saya bagikan.

I believe I don’t have to wait until I am reach to share with others.

– BAK, 2018 –

Pertama, saya ngajar workshop di future force fair. Itu dihadiri oleh 180 peserta. Formatnya workshop tentang R. Anda bisa lihat materinya gratis tanpa dipungut pajak ataupun se-sen rupiah pun di sini. (Anda tinggal klik kanan, lalu “save Link as” atau “save target as”.

Ini beberapa foto saya ketika mengajar. Guanteng yo?

Kedua, saya ngajar ggplot2 dan dplyr. Materinya bisa didownload di sini.

Ketiga, saya ngajar sql for data analisis. Materinya bisa didownload di sini.

Keempat, saya ngajar advanced sql. Materinya bisa didownload di sini.

There will be more interesting story to tell when I teach and give speech. So, stay tuned!

 

The comparison between randomForest and ranger

Forests
Forest. Source: Here

A Couple days ago I had a chance to be a speaker on internal data scientist meeting at the company that I work for: Stream Intelligence. The meeting is usually held on monthly basis, and the last meeting in October was 6th meeting. We used Skype for Business to connect between the Data Scientists in Jakarta and in London.

I delivered a topic titled Random forest in R: A case study of a telecommunication company. For those who do not know Random Forest, an Indian guy, Gopal Malakar, had made a video uploaded in Youtube. He elaborated the definition of random forest. First of all, check the video out!

Based on the video, one important thing that you have to remember about random forest is that, it is a collection of trees. It was built by a number of decision trees. Each decision trees is formed by random variables and observations of the training data.

Supposed that we have trained a random forest model, and it was made from 100 decision trees. One test observation was inputted on the model. The decision tree outputs will result 60Y and 40N. Hence the output of random forest model is Y with score or probability 0.6.

OK, let’s practice how to train random forest algorithm for classification in R. I just knew it couple weeks ago from Datacamp course, that there are two random forest packages: 1) randomForest and 2) ranger. They recommend ranger, because it is a lot faster than original randomForest.

To prove it, I have created a script using Sonar dataset and caret package for machine learning, with methods: ranger / rf, and tuneLength=2 (this argument refers to mtry, or number of variables that was used to create trees in random forest). In random Forest, mtry is the hyperparameter that we can tune.

Output of ranger training

Output of random forest training

So, the random forest training with ranger function is 26.75-22.37 = 4.38 seconds or 25% faster than original random forest (Assume we use user time).

However, if I tried to change tuneLength parameter with 5. It reveals that the original randomForest function is faster than ranger. Hmmm… seems that I have to upload a question to stackoverflow or Datacamp experts.

Comparing The Machine Learning Methods with ROC Curve

Yesterday I continued a Datacamp online course named Introduction to Machine Learning. Frankly, this course is very useful to strengthen my understanding in machine learning! Plus, I am a big fan of R! The more you repeat the course, the more you understand the meaning of it. Well, the topic was about “comparing the methods”. It is part of chapter 3 – Classification topic, precisely at the end of the chapter. It says that the powerful tool to compare the machine learning methods, especially classification, is ROC Curve. FYI, out of the record, this ROC curve analysis was also requested by the one of the client. 😉

What is ROC?

ROC stands for Receiver Operating Characteristic. In statistics, it is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.  Electrical engineers and radar engineers during World War II firstly developed the ROC curve for detecting objects of enemy, then soon used by psychologist to account for perceptual detection of stimuli. At this point, ROC analysis has been used in medicine, radiology, biometrics, machine learning, and data mining research. (Source: here).

The sample of ROC curve is illustrated in the Figure 1. The horizontal axis represents the false-positive rate (FPR), while vertical axis represents the true-positive rate (TPR). The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 −specificity).

ROC Curve - Source: Wikipedia
Figure 1. ROC Curve – Source: Wikipedia.

How to create this curve in R?

You need:

  • Classifier that outputs probabilities
  • ROCR Package installed

Suppose that you have a data set called adult that can be downloaded here from UCIMLR. It is a medium sized dataset about the income of people given a set of features like education, race, sex, and so on. Each observation is labeled with 1 or 0: 1 means the observation has annual income equal or above $50,000, 0 means the observation has an annual income lower than $50,000. This label information is stored in the income variable. Then data split into train and test. Upon splitting, you can train the data using a method e.g. decision tree (rpart), predict the test data with “predict” function and argument type=”prob”, and aha… see the complete R code below.

The plot result is as follow:

ROC result of DT
Figure 2. ROC result of Decision Tree

How to interpret the result of ROC?

Basically, the closer the curve to the upper left corner, the better the classifier. In other words, the “area under curve” should be closed to maximum value, which is 1. We can do comparison of performance based on ROC curve of two methods which are Decision Tree (DT) and K-Nearest Neighbor K-NN as seen in Figure 3. It shows that the DT method that represented by red line outperforms K-NN that represented by green line.

ROC Result of DT and KNN
Figure 3. ROC Result of Decision Tree and K-Nearest Neighbor

The R Code to draw Figure 3 is represented by the following code:

Area under curve (AUC) parameter can also be calculated by running this command below. It shows that the AUC of DT is 5% greater than K-NN.

 

Summary

  • ROC (Receiver Operator Characteristic) Curve is a very powerful performance measure.
  • It is used for binomial classification.
  • ROCR is a great package to be used in R for drawing ROC curve
  • The closer the curve to the upper left of area, the better the classifier.
  • The good classifier has big area under curve.