Tantangan Menulis Hari ke-165
Oleh: Bernardus Ari Kuncoro
Beberapa bulan yang lalu saya mendapat pertanyaan dari seorang mentee. Lewat WA.
Percakapannya begini. A adalah mentee. B adalah saya.
A : Mas, mau tanya
A : Ttg normalisasi data dan standarisasi data
A : Normalisasi data kpn digunakan, Mas ari?
A : Standarisasi data juga
A : Scaling data kpn digunakan?
A : Makasih mas ari
B : OK, saya jawab ya.
Istilah standardization, normalization, dan scaling sering dipertukarkan di dunia NON akademis. Ketiga hal tersebut intinya bertujuan untuk menghasilkan variabel-variabel ‘rasa’ baru agar cocok dan berguna dengan algoritma yang nanti akan Anda aplikasikan dalam analisis prediktif. Entah itu supervised atau unsupervised learning.
Dari berbagai sumber yang aku baca, dapat dipahami bahwa scaling itu kata umum dari standardization dan normalization. Beberapa sumber menyebutkan standardization berbeda dengan normalization, dan bisa dipahami dari kata dasarnya, ya Mbak. Bahwa standard ini men-standard-kan mean dan varians. Sedangkan untuk normalisasi, yang dinormalkan adalah rentangan.
Kapan digunakan? Standardization cocok ketika data Anda mau menggunakan algoritma yang butuh asumsi datanya terdistribusi normal. Contohnya adalah regresi linear, regresi logistik, dan analisis diskriminan linear. Sedangkan normalisasi cocok digunakan untuk algoritma yang memanfaatkan fungsi jarak seperti K-Means, KNN, PCA, dan Gradient Descent.
Semoga menjawab, ya.
Berikut ini kutipan dari sumber yang bisa Anda baca lebih lanjut. Di bawahnya juga ada tautan yang bisa Anda copas.
1. Standardization
Standardization (or Z-score normalization) means centering the variable at zero and standardizing the variance at 1. The procedure involves subtracting the mean of each observation and then dividing by the standard deviation.
Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.
2. Normalization
The data is scaled to a fixed range — usually 0 to 1.
Characteristic of Normalization: In contrast to standardization, the cost of having this bounded range is that we will end up with smaller standard deviations, which can suppress the effect of outliers. Thus MinMax Scalar is sensitive to outliers.
Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
Examples of Algorithms where Feature Scaling matters
1. K-Means uses the Euclidean distance measure here feature scaling matters.
2. K-Nearest-Neighbours also require feature scaling.
3. Principal Component Analysis (PCA): Tries to get the feature with maximum variance, here too feature scaling is required.
4. Gradient Descent: Calculation speed increase as Theta calculation becomes faster after feature scaling.
Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models are not affected by feature scaling. In Short, any Algorithm which is Not Distance based is Not affected by Feature Scaling.
Sumber:
1. https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data
2. https://developers.google.com/machine-learning/clustering/prepare-data
3. https://www.geeksforgeeks.org/python-how-and-where-to-apply-feature-scaling/
Postingan ini pernah terbit di sini.
Kalideres, 14 September 2021