2015년 7월 31일 금요일

[editing] Ensemble learning for trees





Problem: Trees can grow too noisy. (= high variance)
Solutions: Bagging, Boosting, Random Forests

General performance: Boosting > Random Forests > Bagging > Single Tree

1. Bagging (= bootstrap aggregation)

  • Random samples of equal size -> generate a fitted tree for each
  • Average them!


2. Random Forests

  • De-correlate! How? 
  • At each point of split, pick sqrt(number of features) random features as candidates


3. Boosting (= stage-wise additive modeling)

  • Adaboost
  • Gradient Boosting Machine


References:

2015년 7월 23일 목요일

R project management

References:
- Cross Validated: http://stats.stackexchange.com/questions/2910/how-to-efficiently-manage-a-statistical-analysis-project

Factor Analysis & Principal Component Analysis

References:
- blog: http://ai-times.tistory.com/112

SSE, SEE

SSE(Standard Squared Error): sum((actual y - expected y)^2)
SEE(Standard Error of Estimates): sqrt(SSE/n-k)     (where k is the number of variables. e.g. y = 1 + x => k = 2)

SST = SSE + SSR
R^2 = SSR/SST (how accurate regression line is)

Tricky, Tricky Credit Scoring System

1. In traditional CSS, it is not multivariate logistic regression, but multiple univariate logistic regressions.
We're running univariate logistic regression for each intervals.
If you visualize it, it looks like a 'line graph' with inflection points(?).

2. What logit in CSS is interested is not expected target value but the slope(=relationship with target =  how much score to give per unit increase in the variable).

3. Dummy variables are not used in a way they are used in typical machine learning problems. Don't get fooled.

4.

2015년 7월 17일 금요일

t-test

확률 계산이 아니라 가설 검정을 위해 사용되는 확률 분포
표본의 수가 30개 미만일 때만 사용


References:

- blog: http://math7.tistory.com/55

p-value


  • p-value: probability of supporting null hypothesis
  • If p-value < significance, null hypothesis is rejected.
  • p-value applies to not only chi-square distribution but also normal distribution



References:

- blog: http://statistics-cahn.blogspot.kr/2007/10/p-value.html

2015년 7월 16일 목요일

F-score & AUROC

#fscore#fmeasure#f-score#auroc#roc#threshold

I've been using F-scores when validating binary classifiers for imbalanced data set. Recently, I've just found out that using AUROC could make life a bit easier. If I use AUROC, I don't have to find optimal threshold for every new classifiers. The only time I optimize threshold is with the final final classifier.
  • F-score: a score for a given threshold
  • AUROC: a score for varying threshold


For a classifier with a given AUROC, there could be many different F-scores as threshold varies.

So,

first, find the classifier with the largest AUROC,

and then, find the threshold that yields the largest F-score.


- Cross Validated: http://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves



Dealing with Imbalanced Data Set

1. Use PR(precision-recall) curve instead of AUROC(area under receiver operatic characteristic curve).

- Stack Exchange: http://stats.stackexchange.com/questions/64047/effective-validity-of-auroc-as-performance-measure-what-about-very-high-auroc


2. Balance data by down sampling or up sampling

3. Optimize threshold (= cutoff)

4. ...and more to be explored

2015년 7월 14일 화요일

2015년 7월 12일 일요일

Discriminative model vs. Generative model


Discriminative model: P(y|x)

  • A conditional probability of y given x
  • learning h(x) = {0, 1}
  • As a rule of thumb, it's faster than generative models.
  • Examples: Logistic regression, SVM, Neural networks

Generative model: P(x, y)

  • A joint distribution of x and y
  • P(x|y) * P(y)
  • Possible to induce discriminative model by using Bayes' Theorem
  • Possible to generate new data by sampling from P(x, y)
  • Examples: Naive Bayes



References:

- Stack Overflow: http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

- Lecture note by Andrew Ng: http://cs229.stanford.edu/notes/cs229-notes2.pdf

- Machine learning lecture by Andrew Ng: http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=06.1-NaiveBayes-GenerativeLearningAlgorithms&speed=100

- Quora: http://www.quora.com/What-are-the-differences-between-generative-and-discriminative-machine-learning

2015년 7월 1일 수요일

Setting up Monary

#Monary#MongoDB#NumPy#Pandas#PyMongo


I had a hard, hard time setting up data science environment with MongoDB, Monary, Python, Numpy, Pandas.

The problem occurred with Monary! (It converts MongoDB database into NumPy arrays with much faster speed than PyMongo.)


First, install mongo-c-driver.
$ brew install git gcc automake autoconf libtool
$ git clone https://github.com/mongodb/mongo-c-driver.git
$ cd mongo-c-driver
$ ./autogen.sh
$ make
$ sudo make install 

Then, install monary.

$ hg clone https://bitbucket.org/djcbeach/monary


Hughhh, done!!