Three matchings to be considered when preprocessing data
1. Propensity Score between Treated and Untreated Samples
- Propensity Score Matching
- Make observation resemble experiment
#If all confounding variables are identified and included as input variables along with treatment variables, is there no need to do propensity score matching?
#Even if propensity score matching helps measuring unexaggerated effect of treatment variables on target variables, would't it undermine predictive power? Say, there is a model (treatment: male/female, target: promotion) to predict whether a person would be promoted or not based on gender. Although the gender itself doesn't affect getting promoted or not, it still is a good predictor of promotion (not as a cause but as a correlation) since there are other confounding variables such as social prejudice, education, confidence, and so on.
#Machine Learning: On Correlation, Statistics: On Causation, ... something to ponder upon.
2. Ratio between Target Labels
- Data Balancing
- Each label is equally distributed (1:1, 1:1:1, 1:1:1:1, etc.)
- This is not the only solution to data imbalance(skewed data). I prefer manipulating threshold in order not to lose data through down sampling.
3. Distribution of Target Labels between Train and Test sets
- (Is there a specific name assigned to this?)
- Yet to explain.
#Should they match?
#Should they match even if the data is collected by experiment?
#Should test set match real world distribution? (In credit scoring, although there is a tiny amount of defaults, in model development, there should be equal amount of default as non-defaults, though not always the case.)