當前位置

首頁 > 英語閱讀 > 雙語新聞 > 大數據的近因偏差煩惱(下)

大數據的近因偏差煩惱(下)

推薦人: 來源: 閱讀: 1.92W 次

The same tends to be true of most complex phenomena in real life: stock markets, economies, the success or failure of companies, war and peace, relationships, the rise and fall of empires. Short-term analyses aren’t only invalid – they’re actively unhelpful and misleading. Just look at the legions of economists who lined up to pronounce events like the 2009 financial crisis unthinkable right until it happened. The very notion that valid predictions could be made on that kind of scale was itself part of the problem.

大數據的近因偏差煩惱(下)

現實生活中大部分複雜事物的現象正是如此:股票市場、經濟發展、企業的成功與失敗、戰爭與和平、國家關係、帝國的崛起和衰落等等。短期分析不僅不紮實、毫無益處,還會產生誤導。回頭看看,就在2009年全球金融危機襲來的時候,還有那麼多經濟學家信誓旦旦地宣稱這一事件不會發生。認爲根據那種短期時間尺度的數據就能做出紮實的預測,這種想法本身就有很大的問題。

It’s also worth remembering that novelty tends to be a dominant consideration when deciding what data to keep or delete. Out with the old and in with the new: that’s the digital trend in a world where search algorithms are intrinsically biased towards freshness, and where so-called link rot infests everything from Supreme Court decisions to entire social media services. A bias towards the present is structurally engrained in almost all the technology surrounding us, not least thanks to our habit of ditching most of our once-shiny machines after about five years.

我們還應當記住,在決定哪些數據該保存還是刪除的時候,新穎性往往會成爲主要的考慮因素。舊的淘汰,新的進來,在這個搜索算法本質上偏向於新鮮事物的數字世界中,這是一個明顯的趨勢。從最高法院的裁決,到所有社交媒體服務平臺,我們到處都可以看到已經失效的網址。我們身邊的幾乎所有技術都偏向於當前信息,人也一樣:大多數人已經習慣用個四五年就把原本光鮮亮麗的設備丟掉。

What to do? This isn’t just a question of being better at preserving old data – although this wouldn’t be a bad idea, given just how little is currently able to last decades rather than years. More importantly, it’s about determining what is worth preserving in the first place – and what it means meaningfully to cull information in the name of knowledge.

怎麼辦?這個問題已經不僅僅在於如何更好保存舊數據的範疇——儘管這並不是個壞主意,想想我們現在還有什麼東西能流行保留10年之久。更重要的是,這個問題關係到確定哪些東西值得優先保存,以及如何在知識的名義下,選擇哪些信息最有意義

What’s needed is something that I like to think of as “intelligent forgetting”: teaching our tools to become better at letting go of the immediate past in order to keep its larger continuities in view. It’s an act of curation akin to organising a photograph album – albeit with more maths. When are two million photographs less valuable than two thousand? When the larger sample covers less ground; when the questions that can be asked of it are less important; when the level of detail on offer instils not useful scepticism, but false confidence.

或許我們需要的是我所稱之爲的“智能性遺忘”:應該讓我們的工具更多地放棄最近的信息,從而在長遠視角上保持更高水平的連續性。這有點像是以數學方法重新整理一本影集。什麼時候兩百萬張照片的價值比兩千張照片更低?什麼時候較大的樣本量覆蓋範圍反而較小?哪些問題的重要性較低?哪個細節水平能提供有用的質疑證據,而不是虛假的信心?

Many data sets are irreducible and most precious when complete: gene sequences; demographic data; the raw, hard knowledge of geography and physics. The softer the science, however, the more that scale is likely inversely to correlate with quality – and the more important time itself becomes as a filter. Either we choose carefully what endures, matters and meaningfully captures our receding past – or its imprint is silently supplanted by the present’s growing noise.

許多數據集是無法縮減的,只有在完整的情況下才最寶貴,比如,基因序列、人口統計學數據、地理和物理學的原始觀測數據等等。數據的科學性越弱,數據規模與數據的質量就越可能呈現負相關,此時時間本身就成爲更加重要的過濾工具。我們如果不仔細選擇過去保存下來的有價值、有意義的事物,它們就會被迅速膨脹的信息洪流悄無聲息地吞沒掉。

Time cuts several ways, for there is another crucial sense in which it remains a limiting factor: the availability of human time and attention. Corporations, individuals and governments alike have orders of magnitude more information available today than they did even a few years ago. Yet they don’t have any more available attention, board members, chief executives, elected officials or hours in the day. Better and better tools exist to help decision-makers ask meaningful questions of the information they possess – but you can only analyse what remains accessible. Mere accumulation is no kind of answer. In an era of bigger and bigger data, what you choose not to know matters just as much as what you do.

能否考察長期歷史遺留下來的數據取決於考察者是否有足夠的時間和注意力。今天的企業、個人和政府機構都能夠獲得比以往(甚至就在幾年前)大許多數量級的數據,但是董事會成員、首席執行官、政府官員等決策者卻沒有足夠時間和注意力來應對這些數據。今天的決策者們有越來越高效的工具幫助他們就所持有的數據提出問題——但你只應該分析有意義的數據。單純的數量累積不是一個好的對策。在一個數據量越來越大的時代,如何選擇主動放棄哪些事情,與選擇做什麼事情一樣重要。