2015년 10월 14일 수요일

Poisson Distribution as a Special Case of Binomial Distribution

I wondered that what's the difference between Poisson and Binomial Distributions.

Here's the difference, a very slight difference.

and actually, Poisson is a special case of Binomial distribution!

Poisson Distribution

  • n (the number of trials): huge
  • p (the probability of success): small

Check out this article for mathematical proof!

QQ plot


  • QQ plot = Normal Probability Plot = Quantile-Quantile Plot
  • Tests for normal distribution
  • Caution: Quantiles are points, not intervals.

  1. Suppose there are n samples. 
  2. Calculate n+1-quantiles given normal distribution. (Here, the number of n+1-quantiles is n. The quantiles can be calculated with NORMSINV function in excel. Therefore, the quantiles are expressed in z scores.)
  3. If i-th sample's z score is smaller than i-th quantile, it will be situated below the straight line. If larger, above the straight line. This is the way how to interpret the QQ plot. 
  4. If all sample values are above the straight line, it implies right skewness. For more examples, refer to this great question and answer from CrossValidated. (http://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot)



2015년 10월 10일 토요일

Three matchings to be considered when preprocessing data

Three matchings to be considered when preprocessing data


1. Propensity Score between Treated and Untreated Samples

- Propensity Score Matching
- Make observation resemble experiment

#If all confounding variables are identified and included as input variables along with treatment variables, is there no need to do propensity score matching?
#Even if propensity score matching helps measuring unexaggerated effect of treatment variables on target variables, would't it undermine predictive power? Say, there is a model (treatment: male/female, target: promotion) to predict whether a person would be promoted or not based on gender. Although the gender itself doesn't affect getting promoted or not, it still is a good predictor of promotion (not as a cause but as a correlation) since there are other confounding variables such as social prejudice, education, confidence, and so on.
#Machine Learning: On Correlation, Statistics: On Causation, ... something to ponder upon.


2. Ratio between Target Labels

- Data Balancing
- Each label is equally distributed (1:1, 1:1:1, 1:1:1:1, etc.)
- This is not the only solution to data imbalance(skewed data). I prefer manipulating threshold in order not to lose data through down sampling.



3. Distribution of Target Labels between Train and Test sets

- (Is there a specific name assigned to this?)
- Yet to explain.

#Should they match?
#Should they match even if the data is collected by experiment?
#Should test set match real world distribution? (In credit scoring, although there is a tiny amount of defaults, in model development, there should be equal amount of default as non-defaults, though not always the case.)



OpenIntro Statistics Chapter #1

OpenIntro Statistics Chapter #1

Statistics is a study of collecting and analyzing data.

Before data science, there already existed Statistics, science of collecting and analyzing data.

So, what is the difference?

Maybe where it is applied is different.

It never have been applied to business world, to be more specific, startup field, where people like to coin fancy new words.

By the way, one thing intrigued me in this chapter was a data collection part.

I never knew that regression can only be applied to data collected by random experiment.

Since my first step to data science was through machine learning, or computer science, my statistical background was bare. I applied regression to any data, whether it was collected by experiment or observations.

Therefore, in order to use regression for data collected from observations, we need to make it more alike those collected by experiment. Propensity score matching is what transforms data collected from observations to data collected from experiment, by making variables affecting the selection of treatment variable similar among treated and untreated sample.