Big data, a buzzword, attracts scientists and statisticians alike . Big Data promises huge findings for many. The lure of 'quantity' is enough for many to contemplate correlations which might not make sense other than realm of chance. When analysing huge amount of data, it is natural to test various hypothesis to find correlation between independent variables. Whole game lies on the effort to discern one's correlation from chance occurrences. It becomes pertinent to formulate a hypothesis on different pool of samples from the other pool which is to be used for testing statistical significance. Big data inherits correlation which are entirely due to chance. A meaningful study which prefers 'quality' over 'quantity' should avoid these pitfalls.
There is funny example: Suppose Ram and Sham are two individuals among 500 to be born on 7th May 1989. The co-incidence is totally acceptable as such a huge sample size can include two individuals with same birthday. Now, one begins to use 'infinity hypothesis model', in which countless hypothesis are tested including: Their birth place? their parents name? their hobby? their drinking habits ? etc. etc. When exhausted from all the hypothesis one finds out that both Ram and Sham opted for commerce in high school. Would it be valid to claim that "People born on 7th of May are more likely to opt for commerce in high school'. No. It would fail to hold if tested on other samples. Only this limited data will confirm the facts. It can't be validated or generalised on population.
Its legitimate to look for pattern in data. To get a hunch or a preliminary correlation in a given data is normal but its statistical test of significance should be explored in other samples. For example: 100 samples are used to study a certain correlation. Dividing it into two parts of 50 each, one for hypothesis generation and other for hypothesis testing could a fair strategy.
Now what is this this Statistical test of significance?
No comments:
Post a Comment