當前位置

首頁 > 英語閱讀 > 雙語新聞 > 大數據的近因偏差煩惱(上)

大數據的近因偏差煩惱(上)

推薦人: 來源: 閱讀: 2.91W 次

You may be familiar with the statistic that 90% of the world’s data was created in the last few years. It’s true. One of the first mentions of this particular formulation I can find dates back to May 2013, but the trend remains remarkably constant. Indeed, every two years for about the last three decades the amount of data in the world has increased by about 10 times – a rate that puts even Moore’s law of doubling processor power to shame.

大數據的近因偏差煩惱(上)

全世界90%的數據都是最近幾年生成的,人們對這個結論可能已經耳熟能詳。儘管我能找到的這個說法的最早出處是在2013年5月,但是,這種趨勢卻始終未曾發生變化。事實上,過去30年間,每隔兩年,全球總數據量就會增長大約10倍——這讓計算機行業的摩爾定律相形見絀。

One of the problems with such a rate of information increase is that the present moment will always loom far larger than even the recent past. Imagine looking back over a photo album representing the first 18 years of your life, from birth to adulthood. Let’s say that you have two photos for your first two years. Assuming a rate of information increase matching that of the world’s data, you will have an impressive 2,000 photos representing the years six to eight; 200,000 for the years 10 to 12; and a staggering 200,000,000 for the years 16 to 18. That’s more than three photographs for every single second of those final two years.

信息爆炸所帶來的問題之一在於,即便和不久之前相比,當前的信息量規模都會大到不可思議的程度。假如有一本信息影集代表了你從嬰兒到成年的前18年人生,並且照片數量的增長速度和全球數據量保持一致,如果頭兩年你只有兩張照片,那麼從6歲到8歲的兩年間你就會有兩千張照片,從10歲到12歲有20萬張,從16歲到18歲則有驚人的2億張,這意味着在16-18歲期間你每秒鐘就會拍3張照片。

This isn’t a perfect analogy with global data, of course. For a start, much of the world’s data increase is due to more sources of information being created by more people, along with far larger and more detailed formats. But the point about proportionality stands. If you were to look back over a record like the one above, or try to analyse it, the more distant past would shrivel into meaningless insignificance. How could it not, with so many times less information available?

當你回過頭去以更長遠的眼光來看待事物時,你會發現,你有太多太多近期的的事件,而較早的數據和事件是那麼的稀少。當然,全球數據不能這樣簡單類比。全球數據增長的主要原因在於更多的人口產生了更多信息源,以及更大的和更復雜詳細的信息結構。然而,如果試圖回顧或分析與上文所述影集類似的歷史記錄,你會發現一個相同點,越遙遠的歷史所留下的信息和記錄就會越稀少。怎麼會發生這種事情呢?

Here’s the problem with much of the big data currently being gathered and analysed. The moment you start looking backwards to seek the longer view, you have far too much of the recent stuff and far too little of the old. Short-sightedness is built into the structure, in the form of an overwhelming tendency to over-estimate short-term trends at the expense of history.

這就是目前大數據採集分析中存在的一項弊端。無論你在哪一個時間點開始回顧歷史,都會遇到同一個麻煩:近期數據的數量遠遠超過遠期歷史數據,由此,這個分析系統會過度重視短期趨勢而忽略長期趨勢,從而受到短視的困擾。

To understand why this matters, consider the findings from social science about ‘recency bias’, which describes the tendency to assume that future events will closely resemble recent experience. It’s a version of what is also known as the availability heuristic: the tendency to base your thinking disproportionately on whatever comes most easily to mind. It’s also a universal psychological attribute. If the last few years have seen exceptionally cold summers where you live, for example, you might be tempted to state that summers are getting colder – or that your local climate may be cooling. In fact, you shouldn’t read anything whatsoever into the data. You would need to take a far, far longer view to learn anything meaningful about climate trends. In the short term, you’d be best not speculating at all – but who among us can manage that?

爲了理解這個問題的重要性,需要考慮社會科學中有關“近因偏差”(recency bias,又稱近因效應)的研究發現。近因偏差是指:人們在判斷事物發展趨勢時,會認爲未來事件將會和近期體驗高度類似。這可以說是某種“可利用性法則”(availability heuristic)——不恰當地以最容易認知的信息來作爲思考的基礎。這還是一種普遍的心理學特徵。舉例來說,如果在你居住的地方,過去幾年的夏季氣溫都很低,那麼你可能會認爲夏季氣候正在變得更冷——或者說你當地的氣候正在變冷。但是,你不應該只根據少量數據分析長期趨勢。你需要有一個長遠視角,才能認識真正有意義的氣候趨勢。短時期內,最好不進行任何猜測。不過,我們之中又有誰能真正做到這點呢?