[CODATA-international] Cost of Data Wrangling

Ian Bruno bruno at ccdc.cam.ac.uk
Fri Dec 11 08:59:29 EST 2020


The 10.2 Billion Euro figure almost certainly comes from this EU report:

Cost-benefit analysis for FAIR research data: Cost of not having FAIR Research Data. https://op.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1/language-en.

To quote: “we found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year”.

Also from the report: “As a rule of thumb, in a data analysis project, data cleansing of poor quality data can take up to 80% of the total effort.” It is perhaps not clear from the report how the 80% figure was arrived at.

Ian Bruno, CCDC

From: CODATA-international <codata-international-bounces at lists.codata.org> On Behalf Of Johnson, Jon
Sent: 11 December 2020 09:00
To: Ernie Boyko <boykern at yahoo.com>; CODATA International <codata-international at lists.codata.org>
Subject: Re: [CODATA-international] Cost of Data Wrangling

Hi Eric

It’s a bit of an urban myth I think see https://blog.ldodds.com/2020/01/31/do-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no/<https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fblog.ldodds.com%2f2020%2f01%2f31%2fdo-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no%2f&c=E,1,KynW-_cnYJZ-XI6NoLJXo1m_vRsTrsFmPVaaUX93jsH1cg2uDh4Cbso6it49Zo5YogRhr2sSDU7iKHNhg0-GXAxng-GBFmx39uv7gd4j0Uw4LgI,&typo=1>, but it aligns with the Pareto Principle, so we are all willing to go with it!

I suppose it is not that important whether it is 80% or 60%, it’s still a massive problem and the takeaway is that it highlights where the source of most effort is being expended, and strongly suggests that it arises from poor data quality and lack of metadata to manage that.

Jon Johnson
CLOSER, UCL Institute of Social Research

From: CODATA-international <codata-international-bounces at lists.codata.org<mailto:codata-international-bounces at lists.codata.org>> on behalf of Ernie Boyko <boykern at yahoo.com<mailto:boykern at yahoo.com>>
Reply to: Ernie Boyko <boykern at yahoo.com<mailto:boykern at yahoo.com>>
Date: Friday, 11 December 2020 at 07:24
To: CODATA International <codata-international at lists.codata.org<mailto:codata-international at lists.codata.org>>
Subject: [CODATA-international] Cost of Data Wrangling

Hi all
A study conducted for the EU? is often quoted as being the source of a statement along the lines of

     *   80% of effort in data intensive research is used on data wrangling; conservative estimate of 10.2 Bn Euro.
 Can anyone on this list point me to this study?
Many thanks in advance.  I am trying to make the case for the benefits of developing a career stream for data wranglers/data stewards.
Cheers, Ernie

  “Data is the new oil.” — Clive Humby
“Data really powers everything that we do.” – Jeff Weiner

[CCDC] <https://www.ccdc.cam.ac.uk>
[LinkedIn]<https://www.linkedin.com/company/2683138?trk=cws-btn-overview-0-0>   [Twitter] <https://twitter.com/ccdc_cambridge>          [Facebook] <https://www.facebook.com/ccdc.cambridge>    [YouTube] <https://www.youtube.com/user/CCDCCambridge>
Dr Ian Bruno
Head of Strategic Partnerships

Phone: +44 1223 3-36013
Email: bruno at ccdc.cam.ac.uk

Unless expressly stated otherwise, information contained in this message is confidential. If this message is not intended for you, please inform postmaster at ccdc.cam.ac.uk and delete the message. The Cambridge Crystallographic Data Centre is a company Limited by Guarantee and a Registered Charity. Registered in England No. 2155347 Registered Charity No. 800579
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.codata.org/pipermail/codata-international_lists.codata.org/attachments/20201211/b83b8c24/attachment.html>

More information about the CODATA-international mailing list