Topic > Handling Missing Values: One of the Most Important Issues in Data Preprocessing

Handling Missing Values ​​(MV) is an important issue in data preprocessing in data mining. One reason is that data attributes can be aggregated from different sources. Cases may not exist in all data sources. The other reason is due to the omission of reporting. The simplest way to handle VMs is to discard cases that contain at least one VM. However, this is only practical when the data contain a limited number of cases with MV and when analysis of complete cases will not lead to severely biasing results for inference. For example, in our study, between 10% and 30% of students did not achieve high school GPA or SAT scores. It is impossible to simply discard these students, as most of them are international students or transfer students who make up a major subset of the population. Furthermore, it is impractical to discard these variables, as they have been shown to be important predictors for predicting student achievement. Therefore, it is important to apply an appropriate imputation strategy on the data. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay There are also several data mining methods. Unlike traditional explanatory models where the goal is to explore the relationship between an outcome variable and explanatory variables, the goal of the data mining model is to make predictions on a new data set. There is a target variable, which can be continuous or categorical. There are also predictors, called characteristics, that measure a set of characteristics of sample members. By applying different data mining models, you can create a prediction model based on current data. The model can be applied to new data, where a new set of feature values ​​is used to make predictions. Different data mining methods have different algorithms and will therefore result in different prediction performance. Based on Luengo, imputation methods can improve data mining methods for different categories, as there may be an interaction between imputation strategies and data mining methods. We would like to explore how this works on our data. In this chapter we will first introduce the imputation strategies applied in this thesis. Then, we will introduce the data mining methods applied to our data. Third, a commonly used oversampling method, SMOTE, will be introduced to address the problem of imbalanced data. Imbalanced data typically refers to a problem with classification problems where classes are not equally represented. For example, in our dataset, there are approximately 3000 students in total, of which 90% are labeled as passing students and the remaining 10% are labeled as failing students. Most machine learning methods do not work well on imbalanced data. Therefore, it is necessary to use techniques to address the problem of imbalanced data. SMOTE is one of them.