[CODATA-international] Cost of Data Wrangling

Margareta Hellström margareta.hellstrom at nateko.lu.se
Fri Dec 11 09:15:54 EST 2020

Dear Ernie,

In their 2018 report “Common Patterns in Revolutionary Infrastructures and Data” (https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf ), Peter Wittenburg and George Strawn write “Results from surveys and interviews indicate that current data management and processing mechanisms are highly inefficient. An RDA survey from 2013[16] stated that typically a data scientist is spending 75% of his time on “data wrangling”[17]. M. Brodie reported about an MIT study [3] indicating that data scientists spend 80% on data wrangling and a recent study from CrowdFlower[18] also came up with 79% of the time being spent on data wrangling in industry.”
[3] M. L. Brodie, Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery, keynote,
Proc.of the XVII Int’l Conf Data Analytics and Management in Data Intensive Domains (DAMDID’2015), Obninsk,
Russia, October 13-16, 2015.
[16] RDA EU survey: http://hdl.handle.net/11304/6e1424cc-8927-11e4-ac7e-860aa0063d1f
[17] "Data Wrangling includes all preparatory steps necessary to finally start the analytics.
[18] Crowdflower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html

Hope this helps!

From: CODATA-international <codata-international-bounces at lists.codata.org> On Behalf Of Johnson, Jon
Sent: Friday, December 11, 2020 10:00
To: Ernie Boyko <boykern at yahoo.com>; CODATA International <codata-international at lists.codata.org>
Subject: Re: [CODATA-international] Cost of Data Wrangling

Hi Eric

It’s a bit of an urban myth I think see https://blog.ldodds.com/2020/01/31/do-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no/, but it aligns with the Pareto Principle, so we are all willing to go with it!

I suppose it is not that important whether it is 80% or 60%, it’s still a massive problem and the takeaway is that it highlights where the source of most effort is being expended, and strongly suggests that it arises from poor data quality and lack of metadata to manage that.

Jon Johnson
CLOSER, UCL Institute of Social Research

From: CODATA-international <codata-international-bounces at lists.codata.org<mailto:codata-international-bounces at lists.codata.org>> on behalf of Ernie Boyko <boykern at yahoo.com<mailto:boykern at yahoo.com>>
Reply to: Ernie Boyko <boykern at yahoo.com<mailto:boykern at yahoo.com>>
Date: Friday, 11 December 2020 at 07:24
To: CODATA International <codata-international at lists.codata.org<mailto:codata-international at lists.codata.org>>
Subject: [CODATA-international] Cost of Data Wrangling

Hi all
A study conducted for the EU? is often quoted as being the source of a statement along the lines of
§  80% of effort in data intensive research is used on data wrangling; conservative estimate of 10.2 Bn Euro.
 Can anyone on this list point me to this study?
Many thanks in advance.  I am trying to make the case for the benefits of developing a career stream for data wranglers/data stewards.
Cheers, Ernie

  “Data is the new oil.” — Clive Humby
“Data really powers everything that we do.” – Jeff Weiner

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.codata.org/pipermail/codata-international_lists.codata.org/attachments/20201211/6f225d54/attachment.html>

More information about the CODATA-international mailing list