Essential data cleaning steps for machine learning algorithms

DATA CLEANING IN PYTHON AND R

Data cleaning is the most crucial part in preparing data for machine learning algorithms.

Data cleaning involves various steps to be conducted in order to work with a more consistent dataset. We are aware that most machine learning algorithms cannot work with missing features, so let’s take care of the missing value in the data. Visit https://github.com/Jeremiah-Katumo/feature-engineering for coded illustrations. Before we proceed let’s state the steps in data cleaning:

1. Filtering out zero and near zero variance features.

Zero variance variables are features which only contain a single unique value and provides no useful information to a model. Also near zero features offer little information to a model. This causes problems during resampling as there will be a high probability that a particular sample contains a single unique value or dominant value for that feature.

2. Perform Imputation if required.

Imputation is a process of replacing missing values with an assumed value which best fits in the data. It should be the first step to carry out in feature engineering as it will affect any downstream preprocessing. The most common methods of imputation include KNN imputation with a Euclidean distance measure and tree-based imputation.

The KNN imputes values by identifying missing observations and then identifying observations that are most similar based on the features. The values from the nearest neighbour observation is used to impute/fill the missing values. For quantitative values, predictions are based on the average while for qualitative values, predictions are based on the mode.

Tree-based imputation is done using bagged trees as it is more accurate than using single trees or random forests. Here, observations with missing values are identified and the feature containing a missing value is considered a target and can be predicted using bagged trees.

3. Normalize to resolve numeric feature skewness.

Normalization refers to rescaling features ranging from 0 to 1. In Python, the MinMaxScaler from scikit-learn is used for this by tuning its feature_range hyperparameter. In R you can either use the Box-Cox when features are strictly positive or use the Yeo-Johnson when the feature values are not positive.

4. Feature Scaling: Standardize (center and scale) numeric features.

Unlike min-max scaling, standardization does not bound values to a unique range, which may results to some problems to other algorithms

Standardization in R entails centering and scaling so that the numeric variables have a zero mean and a standard deviation 1, which provides a common comparable unit of measure across all the variables. The feature columns will have the same parameters as a normal distribution with zero mean and unit variance, making it easier to learn weights.

In Python you can use the StandardScaler from scikit-learn. The StandardScaler is fit once on the training data and use those parameters to transform the test data or you can decide to use a new data point for the same.

5. Perform Dimension reduction (such as PCA) on numeric features.

An alternative way to filtering features with non-informative without manually removing them is Dimension reduction. You can decide to remove the features with Principal Component Analysis (PCA) and retain a certain number of features required to explain, say 95% of the variance and use these components as features in modelling downstream.

6. Encoding: One-hot or dummy encode categorical features.

Categorical columns/variables are transformed using OneHotEncoder.