當前位置

首頁 > 英語閱讀 > 雙語新聞 > 新科技大數據遭遇數據淨化難題

新科技大數據遭遇數據淨化難題

推薦人: 來源: 閱讀: 1.95W 次

ing-bottom: 57.67%;">新科技大數據遭遇數據淨化難題

Karim Keshavjee, a Toronto physician and digital health consultant, crunches mountains of data from 500 doctors to figure out how to improve patient treatment. But it’s a frustrating slog to get a computer to decipher all the misspellings, abbreviations, and notes written in unintelligible medical shorthand.

卡里姆o科夏瓦傑是多倫多的一名醫生和網絡健康顧問,他要從500名醫生那裏反饋的海量數據中總結出怎樣才能更好地治療病人。但是衆所周知,醫生的“書法”本來就堪比天書,要想讓電腦識別出其中的拼寫錯誤和縮寫更是難於登天。

For example, “smoking information is very hard to parse,” Keshavjee said. “If you read the records, you understand right away what the doctor meant. But good luck trying to make a computer understand. There’s ‘never smoked’ and ‘smoking = 0.’ How many cigarettes does a patient smoke? That’s impossible to figure out.”

比如科夏瓦傑指出:“患者是否吸菸是個很重要的信息。如果你直接閱讀病歷,你馬上就能明白醫生是什麼意思。但是要想讓電腦去理解它,那就只能祝你好運了。雖然你也可以在電腦上設置‘從不吸菸’或‘吸菸=0’的選項。但是一個患者每天吸多少支菸?這幾乎是電腦不可能搞明白的問題。

The hype around slicing and dicing massive amounts of data, or big data, makes it sound so easy: Just plug a library’s worth of information into a computer and wait for valuable insights to pour out about how to speed up an auto assembly line, get online shoppers to buy more sneakers, or fight cancer. The reality is much more complicated. Data is inevitably “dirty” thanks to obsolete, inaccurate, and missing information. Cleaning it up is an increasingly important and overlooked job that can help prevent costly mistakes.

由於宣傳報道把大數據吹得神乎其神,因此很多人可能覺得大數據用起來特別簡單:只要把相當於一整個圖書館的信息插到電腦上,然後就可以坐在一邊,等着電腦給出精闢見解,告訴你如何提高自動生產線的生產效率,如何讓網購者在網上購買更多的運動鞋,或是如何治療癌症。但事實遠遠比想象複雜得多。由於信息會過時、不準確和缺失,因此數據不可避免地也有“不乾淨”的時候。如何把數據變“乾淨”是一個越來越重要但又經常被人忽略的工作,但它可以防止你犯下代價高昂的錯誤。

Although techniques are improving all the time, scrubbing data can only accomplish so much. Even when dealing with a relatively tidy set of information, getting useful results can be arduous and time-consuming.

雖然科技一直都在進步,但是人們在淨化數據上能想到的法子並不多。即便是處理一些相對較“乾淨”的數據,要想獲得有用的結果往往也是件費時費力的事情。

“I tell my clients that the world is messy and dirty,” said Josh Sullivan, a vice president at business consulting firm Booz Allen who handles data crunching for clients. “There are no clean data sets.”

博思艾倫諮詢公司(Booz Allen)副總裁約什o沙利文說:“我對我的客戶說,這是個混亂骯髒的世界,沒有完全乾淨的數據集。”

Data analysts start by looking for information that’s out of the norm. Because the volume of data is so huge, they typically hand the job over to software that automatically sifts through numbers and text to look for anything unusual that needs further review. Over time, computers can improve their accuracy in spotting what’s belongs and what doesn’t. They can also better understand what words and phrases mean by clustering similar examples together and then grading their interpretations for accuracy.

數據分析師一般喜歡先尋找非常態的信息。由於數據量太巨大,他們一般都會把篩選數據的工作交給軟件來完成,來尋找是否有些反常的東西需要進一步檢查。隨着時間的推移,電腦篩選數據的精確性也會提高。通過對類似案例進行分類,它們也會更好地瞭解一些詞語和句子的含義,然後提高篩選的精確性。

“The approach is easy and straightforward, but training your models can take weeks and weeks,” Sullivan said.

沙利文說:“這種方法簡單直接,但‘訓練’你的模型可以需要一週又一週的時間。”

A constellation of companies offer software and services for cleaning data. They range from technology giants like IBM IBM -0.24% and SAP SAP 0.12% to big data and analytics specialists like Cloudera and Talend Open Studio. A legion of start-ups are also trying to get a toehold as data janitors including Trifacta, Tamr, and Paxata.

有些公司也提供了用來淨化數據的軟件和服務,其中既包括像IBM和SAP一樣的科技巨頭,也包括Cloudera和Talend開放工作室從事等大數據和分析的專門機構。一大批創業公司也想爭當大數據的看門人,其中有代表性的包括Trifacta、Tamr和Paxata等。

Healthcare, with all its dirty data, is one of the toughest industries for big data technology. Electronic health records make medical information increasingly easy to dump into computers, but there’s still a lot room for improvement before researchers, pharmaceutical companies and hospital business analysts can slice and dice all the information they want.

由於“不乾淨”的數據太多,醫療業被認爲是大數據技術最難搞定的行業之一。雖然隨着電子病歷的普及,將醫療信息輸入電腦的難度已經變得越來越低,但是研究人員、製藥公司和醫療業分析人士要想把他們需要的數據盡情地拿來分析,在數據上要提高的地方還有很多。

Keshavjee, the doctor and CEO of InfoClin, a health data consulting firm, spends his days trying to tease out ways to improve patient treatment by sifting through tens of thousands of electronic medical records. Obstacles pop up all the time.

健康數據諮詢公司InfoClin的醫生兼CEO科夏瓦傑花了很多時間,希望數以萬計的電子醫療病歷中篩選有用的數據,以提高對病人的診療水平。但他們在篩選的過程中卻不斷遇到阻礙。

Many doctors neglect to note a patient’s blood pressure in their medical records, something that no amount of data cleaning can fix. Simply determining what ails patients—based on what’s in their files—is surprisingly difficult for computers. Doctors may enter the proper code for diabetes without clearly indicating whether it’s the patient who has the disease or a family member. Or they may just enter “insulin” without mentioning the underlying diagnosis because, to them, it’s obvious.

很多醫生在病歷中沒有記錄病人的血壓,這個問題是無論哪種數據淨化方法都修復不了的。光憑藉現有病歷的信息去判斷病人得了什麼病對電腦來說就已經是一項極其困難的任務。醫生在輸入糖尿病編號的時候,可能忘了清楚地標註究竟是患者本人得了糖尿病,還是他的某個家人得了糖尿病。又或許他們光是輸入了“胰島素”三個字,而沒有提到患者得了什麼病,因爲這對他們來說是再明顯不過的事情。

Physicians also use a lot of idiosyncratic shorthand for medications, illnesses and basic patient details. Deciphering it takes a lot of head scratching for humans and is nearly impossible for a computer. For example, Keshavjee came across one doctor who used the abbreviation”gpa.” Only after coming across a variation, “gma,” did he finally solve the puzzle—they were shorthand for “grandpa” and “grandma.”

醫生用來診斷、開藥和填寫病人基本信息時會大量用到一套獨特的速記字體。即使讓人類來破解它也要大爲頭痛,而對於電腦基本上是不可能完成的任務。比如科夏瓦傑提到有個醫生在病歷中寫下“gpa”三個字母,讓他百思不得其解。好在他發現後面不遠處又寫着“gma”三字,他才恍然大悟——原來它們是爺爺(grandpa)和奶奶(grandma)的縮寫。

“It took a while to figure that one out,” he said.

科夏瓦傑說:“我花了好半天才明白它們到底是什麼意思。”

Ultimately, Keshavjee said one of the only ways to solve the problem of dirty data in medical records is “data discipline.” Doctors need to be trained to enter information correctly so that cleaning up after them is less of a chore. Incorporating something like Google’s helpful tool that suggests how to spell words as users type them would be a great addition for electronic medical records, he said. Computers can learn to pick out spelling errors, but minimizing the need is a step in the right direction.

科夏瓦傑認爲,解決數據“不乾淨”的終極方法之一是要給病歷制定一套“數據紀律”。要訓練醫生養成正確錄入信息的習慣,這樣事後淨化數據時纔不至於亂得一團糟。科夏瓦傑表示,谷歌有一個很有用的工具,可以在用戶進行輸入時告訴他們如何拼寫生僻字,這樣的工具完全可以添加到電子病歷工具中。電腦雖然可以挑出拼寫錯誤,但是讓醫生摒棄不良習慣纔是朝着正確的方向邁出了一步。

Another of Keshavjee’s suggestions is to create medical records with more standardized fields. A computer would then know where to look for specific information, reducing the chance of error. Of course, doing so is not as easy as it sounds because many patients suffer from multiple illnesses, he said. A standard form would have to be flexible enough to take such complications into account.

科夏瓦傑的另一個建議是,在電子病歷中設置更多標準化的域。這樣電腦就會知道到哪裏去找特定的信息,從而減少出錯率。當然,實際操作起來並沒有這麼簡單,因爲很多病人同時身患好幾種疾病。因此,一個標準的表格必須擁有足夠的靈活性,把這些複雜情況全部考慮進去。

Still, doctors would need to be able to jot down more free-form electronic notes that could never fit in a small box. Nuance like why a patient fell, for example, and not just the injury suffered, is critical for research. But software is hit and miss in understanding free-form writing without context. Humans searching by keyword may do a better job, but they still inevitably miss many relevant records.

但是出於診療的需要,醫生有時需要在病歷上記下一些自由行文的東西,這些內容肯定不是一個小格子能裝得下的。比如一個患者爲什麼會摔倒,如果不是受傷導致的,那麼原因就非常重要。但是在沒有上下文的條件下,軟件對於自由行文的理解只能用撞大運來形容。篩選數據的時候,如果人們用關鍵詞搜索的話可能會做得更好些,但這樣也難免會漏掉很多有關的記錄。

Of course, in some cases, what appears to be dirty data, really isn’t. Sullivan, from Booz Allen, gave the example the time his team was analyzing demographic information about customers for a luxury hotel chain and came across data showing that teens from a wealthy Middle Eastern country were frequent guests.

當然,在有些案例中,有些看起來不乾淨的數並不是真的不乾淨。博思艾倫諮詢公司副總裁沙利文舉例說,有一次他的團隊爲一家豪華連鎖酒店分析顧客的人口統計數據,突然發現,數據顯示一個富有的中東國家的青少年羣體是這家酒店的常客。

“There were a whole group of 17 year-olds staying at the properties worldwide,’ Sullivan said. “We thought, ‘That can’t be true.’ “

沙利文回憶道:“有一大羣17歲的青少年在世界各地都住這家酒店,我們以爲:‘這肯定不是真的。’”

But after some digging, they found that the information was, in fact, correct. The hotel had legions of young customers that it didn’t even realize were there, and had never done anything to market to them. All guests under 22 were automatically logged as “low-income” in the company’s computers. Hotel executives had never considered the possibility of teens with deep pockets.

但做了一些挖掘工作後,他們發現這個信息其實是正確的。這家酒店有大量的青少年顧客,甚至連酒店自己也沒有意識到,而且酒店也沒有針對這部分顧客做過任何促銷和宣傳。所有22歲以下的顧客都被這家公司的電腦自動列入“低收入”羣體,酒店的高管們也從來沒有考慮過這些孩子的腰包有多鼓。

“I think it’s harder to build models if you don’t have outliers,” Sullivan said.

沙利文說:“我認爲如果沒有離羣值的話,構建模型會更難。”

Even when data is clearly dirty, it can sometimes be put to good use. Take the example, again, of Google’s spelling suggestion technology. It automatically recognizes misspelled words and offers alternative spellings. It’s only possible because Google GOOG -0.34% has collected millions and perhaps billions of misspelled queries over the years. Instead of garbage, the dirty data is an opportunity.

即便有時數據明顯不乾淨,它有時依然能派上大用場。比如上文提到的谷歌(Google)的拼寫糾正技術。它可以自動識別拼寫錯誤的單詞,然後提供替代拼寫。這個工具之所以有這樣神奇的功用,是因爲谷歌在過去幾年中已經收集了幾億甚至幾十億個拼寫錯誤的詞條。因此不乾淨的數據也可以變廢爲寶。

Ultimately, humans, and not machines, draw conclusions from the data they crunch. Computers can sort through millions of documents, but they can’t interpret the findings. Cleaning data is just one of step in a long trial and error process to get to that point. Big data, for all its hype about its ability to lift business profits and help humanity, is a big headache.

最終,從大數據中獲得結論的是人而不是機器。電腦雖然可以整理幾百萬份文件,但它並不能真的解讀它。數據淨化就是爲了方便人們從數據中獲取結論而反覆試錯的過程。儘管大數據已被奉爲能提高商業利潤、能造福全人類的神器,但它也是個很讓人頭痛的東西。

“The idea of failure is completely different in data science,” Sullivan said. “If you they don’t fail 10 or 12 times a day to get to where they should be, they’re not doing it right.”

沙利文指出:“失敗的概念在數據科學中完全是另一回事。如果我們每天不失敗10次或12次來試錯,它們就不會給出正確的結果。”