Creative Individuals: 7월 2015

2015년 7월 31일 금요일

[editing] Ensemble learning for trees

Problem: Trees can grow too noisy. (= high variance)
Solutions: Bagging, Boosting, Random Forests

General performance: Boosting > Random Forests > Bagging > Single Tree

1. Bagging (= bootstrap aggregation)

Random samples of equal size -> generate a fitted tree for each
Average them!

2. Random Forests

De-correlate! How?
At each point of split, pick sqrt(number of features) random features as candidates

3. Boosting (= stage-wise additive modeling)

Adaboost
Gradient Boosting Machine

References:

- FTPress: http://www.ftpress.com/articles/article.aspx?p=2248639&seqNum=15

2015년 7월 29일 수요일

장인 정신 - 안도 다다오

장인 정신.
본질을 기억하기.

- 관련 링크: http://navercast.naver.com/magazine_contents.nhn?rid=2324&contents_id=95257

2015년 7월 23일 목요일

R project management

References:
- Cross Validated: http://stats.stackexchange.com/questions/2910/how-to-efficiently-manage-a-statistical-analysis-project

Factor Analysis & Principal Component Analysis

References:
- blog: http://ai-times.tistory.com/112

SSE, SEE

SSE(Standard Squared Error): sum((actual y - expected y)^2)
SEE(Standard Error of Estimates): sqrt(SSE/n-k) (where k is the number of variables. e.g. y = 1 + x => k = 2)

SST = SSE + SSR
R^2 = SSR/SST (how accurate regression line is)

Tricky, Tricky Credit Scoring System

1. In traditional CSS, it is not multivariate logistic regression, but multiple univariate logistic regressions.
We're running univariate logistic regression for each intervals.
If you visualize it, it looks like a 'line graph' with inflection points(?).

2. What logit in CSS is interested is not expected target value but the slope(=relationship with target = how much score to give per unit increase in the variable).

3. Dummy variables are not used in a way they are used in typical machine learning problems. Don't get fooled.

4.

2015년 7월 22일 수요일

Logistic Regression & ln(Odds)

References:
- Harvard Lecture Notes: http://isites.harvard.edu/fs/docs/icb.topic1318818.files/Unit%2012%20-%20Logistic%20Regression%20-%201%20per%20page.pdf

2015년 7월 17일 금요일

t-test

확률 계산이 아니라 가설 검정을 위해 사용되는 확률 분포
표본의 수가 30개 미만일 때만 사용

References:

- blog: http://math7.tistory.com/55

p-value

p-value: probability of supporting null hypothesis
If p-value < significance, null hypothesis is rejected.
p-value applies to not only chi-square distribution but also normal distribution

"P-value in statistical significance testing" by User:Repapetilto @ Wikipedia & User:Chen-Pan Liao @ Wikipedia - https://en.wikipedia.org/wiki/File:P_value.png. Licensed under CC BY-SA 3.0 via Commons.

References:

- blog: http://statistics-cahn.blogspot.kr/2007/10/p-value.html

2015년 7월 16일 목요일

F-score & AUROC

#fscore#fmeasure#f-score#auroc#roc#threshold

I've been using F-scores when validating binary classifiers for imbalanced data set. Recently, I've just found out that using AUROC could make life a bit easier. If I use AUROC, I don't have to find optimal threshold for every new classifiers. The only time I optimize threshold is with the final final classifier.

F-score: a score for a given threshold
AUROC: a score for varying threshold

For a classifier with a given AUROC, there could be many different F-scores as threshold varies.

So,

first, find the classifier with the largest AUROC,

and then, find the threshold that yields the largest F-score.

- Cross Validated: http://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves

Dealing with Imbalanced Data Set

1. Use PR(precision-recall) curve instead of AUROC(area under receiver operatic characteristic curve).

- Stack Exchange: http://stats.stackexchange.com/questions/64047/effective-validity-of-auroc-as-performance-measure-what-about-very-high-auroc

2. Balance data by down sampling or up sampling

3. Optimize threshold (= cutoff)

4. ...and more to be explored

2015년 7월 14일 화요일

Lagrange Multipliers

While studying support section machine...

Lagrangian on optimization problem.

References:

- The Idea Shop: http://www.the-idea-shop.com/article/215/understanding-why-the-method-of-lagrange-multipliers-works

2015년 7월 12일 일요일

Discriminative model vs. Generative model

Discriminative model: P(y|x)

A conditional probability of y given x
learning h(x) = {0, 1}
As a rule of thumb, it's faster than generative models.
Examples: Logistic regression, SVM, Neural networks

Generative model: P(x, y)

A joint distribution of x and y
P(x|y) * P(y)
Possible to induce discriminative model by using Bayes' Theorem
Possible to generate new data by sampling from P(x, y)
Examples: Naive Bayes

References:

- Stack Overflow: http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

- Lecture note by Andrew Ng: http://cs229.stanford.edu/notes/cs229-notes2.pdf

- Machine learning lecture by Andrew Ng: http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=06.1-NaiveBayes-GenerativeLearningAlgorithms&speed=100

- Quora: http://www.quora.com/What-are-the-differences-between-generative-and-discriminative-machine-learning

2015년 7월 1일 수요일

Setting up Monary

#Monary#MongoDB#NumPy#Pandas#PyMongo

I had a hard, hard time setting up data science environment with MongoDB, Monary, Python, Numpy, Pandas.

The problem occurred with Monary! (It converts MongoDB database into NumPy arrays with much faster speed than PyMongo.)

First, install mongo-c-driver.

$ brew install git gcc automake autoconf libtool

$ git clone https://github.com/mongodb/mongo-c-driver.git

$ cd mongo-c-driver

$ ./autogen.sh

$ make

$ sudo make install

Then, install monary.

$ hg clone https://bitbucket.org/djcbeach/monary

Hughhh, done!!