Comparing The Machine Learning Methods with ROC Curve

Yesterday I continued a Datacamp online course named Introduction to Machine Learning. Frankly, this course is very useful to strengthen my understanding in machine learning! Plus, I am a big fan of R! The more you repeat the course, the more you understand the meaning of it. Well, the topic was about “comparing the methods”. It is part of chapter 3 – Classification topic, precisely at the end of the chapter. It says that the powerful tool to compare the machine learning methods, especially classification, is ROC Curve. FYI, out of the record, this ROC curve analysis was also requested by the one of the client. šŸ˜‰

What is ROC?

ROC stands for Receiver Operating Characteristic. In statistics, it is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.Ā  Electrical engineers and radar engineers during World War II firstly developed the ROC curve for detecting objects of enemy, then soon used by psychologist to account for perceptual detection of stimuli. At this point, ROC analysis has been used in medicine, radiology, biometrics, machine learning, and data mining research.Ā (Source: here).

The sample of ROC curve is illustrated in the Figure 1. The horizontal axis represents the false-positive rate (FPR), while vertical axis represents the true-positive rate (TPR).Ā The true-positive rate is also known as sensitivity, recall or probability of detectionĀ inĀ machine learning. The false-positive rate is also known as the fall-out or probability of false alarmĀ and can be calculated as (1 āˆ’specificity).

ROC Curve - Source: Wikipedia
Figure 1. ROC Curve – Source: Wikipedia.

How to create this curve in R?

You need:

  • Classifier thatĀ outputs probabilities
  • ROCR Package installed

Suppose that you have a data set called adult that can be downloaded here from UCIMLR. ItĀ is a medium sized dataset about the income of people given a set of features like education, race, sex, and so on. Each observation is labeled with 1 or 0: 1 means the observation has annual income equal or above $50,000, 0 means the observation has an annual income lower than $50,000. This label information is stored in the income variable. Then data split into train and test. Upon splitting, you can train the data using a method e.g. decision tree (rpart), predict the test data with “predict” function and argument type=”prob”, and aha… see the complete R code below.

The plot result is as follow:

ROC result of DT
Figure 2. ROC result of Decision Tree

How to interpret the result of ROC?

Basically, the closer the curve to the upper left corner, the better the classifier. In other words, the “area under curve” should be closed to maximum value, which is 1. We can do comparison of performance based on ROC curve of two methods which are Decision Tree (DT) and K-Nearest Neighbor K-NN as seen in Figure 3. It shows that the DT method that represented by red line outperforms K-NN that represented by green line.

ROC Result of DT and KNN
Figure 3. ROC Result of Decision Tree and K-Nearest Neighbor

The R Code to draw Figure 3 is represented by the following code:

Area under curve (AUC) parameter can also be calculated by running this command below. It shows that the AUC of DT is 5% greater than K-NN.

 

Summary

  • ROC (Receiver Operator Characteristic) Curve is a very powerful performance measure.
  • It is used for binomial classification.
  • ROCR is a great package to be used in R for drawing ROC curve
  • The closer the curve to the upper left of area, the better the classifier.
  • The good classifier has big area under curve.
Please follow and like us:
0

Online Video Editors

Wevideo

Within this year I recorded several videos. The videos are mostly about my daughter. She was born in December last year and all of the raw videos on the phone seem need to be gathered, edited, and created some stories. Cute. Funny.

Problem

I come up with the problem: phone data storage shortage! my phone is getting lag. Slower. Unresponsive. Should I just put on the Dropbox? Can I edit them? Well, my phone that only has 32GB internal storage cannot store “big data”. šŸ˜€ lol.

Then, today I found the video editor software online. It does’t need to be fully customised or like Adobe Premiere. AndĀ I don’t want to put some loads on my office laptop or my mac just for the space memory for the video editor. Raw data for data science-things are enough. šŸ™‚

Solution

I started googling and found this link search: http://filmora.wondershare.com/video-editor/free-online-video-editor.html. Well, I think it is useful for me in the future for documenting my daughter’s growth on video. If you guys have the recommended video editor, let me know. I will definitely try it. I required light but with High Quality result in term of resolution and speed. Aha, probably, a friend of mine (Danny Wirjadi, Twitter: @dannywirjadi) who is a video editor and works at the luxury videographer in Indonesia can give me recommendation.

Btw, I used the Wevideo to edit the first day of my daughter below. Thanks Google, my laptop, and the internet. Awwwwh… So cute.

Please follow and like us:
0

Ganjil vs Genap: Menang Mana?

Diambil dari detik.com
Sumber Ilustrasi: Detik.com

Seperti yang Anda tahu, warga DKI Jakarta mengalami permasalahan kota yang sangat klasik, yaitu kemacetan. Untuk mengatasi permasalahan tersebut, pemerintah sudah banyak melakukan usaha, seperti penambahan armada busway, pembatasan kendaraan dengan 3 in 1 di lokasi tertentu, dan peraturan yang paling baru resmi diterapkan pada 30 Agustus 2016, yaitu ganjil genap. Dengan adanya peraturan ganjil genap ini, maka 3 in 1 sudah tidak ada lagi.

Saya tak akan menganalisis pengaruh peraturan ini terhadap tingkat kemacetan di Jakarta, tetapi ingin menjawab pertanyaan yang tadi pagi terlintas dalam benak saya, yaitu:

Mana yang lebih untung, punya kendaraan berplat ganjil atau genap?

Untuk menjawab pertanyaan tersebut, kita asumsikan yang menjadi patokan “lebih untung” adalah jumlah hari kendaraan bisa berkendara di jalan-jalan yang diterapkan ganjil-genap, maka kita harus menghitung berapa hari masing-masing kendaraan ganjil dan genap dapat melintas dalam kurun waktu tertentu.

Asumsi kurun waktu yang diambil adalah 1 tahun sejak 30 Agustus 2016. Berarti, interval waktunya adalah 30 Agustus 2016 hingga 29 Agustus 2017. Jumlah hari dalam kurun waktu tersebut adalah 365 hari. Kita kurangkan dengan hari Sabtu dan Minggu yang mana sebanyak 104 hari, sehingga tinggal 261 hari. Kita kurangkan lagi dengan jumlah hari libur yang tidak bertepatan pada hari Sabtu dan hari Minggu, yaitu sebanyak 11 hari, sehingga tinggal 250 hari. Jika kita asosiasikan dengan tanggal, maka jumlah tanggal ganjil di luar dari hari Sabtu, Minggu, dan libur adalah 129 hari. Sedangkan jumlah tanggal genapnya adalah 121 hari. Dengan demikian, Anda yang memiliki kendaraan plat nomor ganjil lebih diuntungkan 8 hari dibandingkan dengan yang memiliki kendaraan plat nomor genap.

Pertanyaan saya terjawab.

Oiya, untuk mengerjakan hal ini, saya dibantu oleh R programming sederhana berikut.

Note: Daftar Hari Libur saya daftar di file csv (daftarlibur.csv).

Date,Holiday
9/12/2016,1
10/2/2016,1
12/12/2016,1
12/24/2016,1
12/25/2016,1
12/26/2016,1
1/1/2017,1
1/28/2017,1
3/28/2017,1
4/14/2017,1
4/24/2017,1
5/1/2017,1
5/25/2017,1
6/1/2017,1
6/25/2017,1
6/26/2017,1
8/17/2017,1

Akan tetapi, sebenarnya tanpa bantuan script pun bisa dilakukan dengan cepat. Satu tahun ada 12 bulan. Ada 7 bulan yang memiliki 31 hari, 4 bulan yang memiliki 30 hari, dan 1 bulan yang memiliki 28 atau 29 hari. Berarti dalam satu tahun, tanggal ganjil akan 7-8 hari lebih banyak ketimbang genap. Make sense. Ganjil, you are the winner!

Please follow and like us:
0