[cdif-community] Definition of schema:Dataset
Donald Hobern
donald.hobern at adelaide.edu.au
Mon Oct 13 06:39:04 EDT 2025
Thanks, Stian.
I have been participating in Peter Sefton's Australian community calls, but (as we get further with developing our profile) I would be pleased to contribute to the process.
Best wishes,
Donald
Donald Hobern
Data Management Director, Australian Plant Phenomics Network
University of Adelaide - working from Canberra, ACT
P (04) 20511471 | plantphenomics.org.au<http://www.plantphenomics.org.au/> | subscribe to our news<https://www.plantphenomics.org.au/news/#news-from-our-blog>
[cid:b710c749-7d06-4f06-bb0e-01e0ebd85d36]
APPN acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. We pay our respect to their Elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.
The Australian Plant Phenomics Network (APPN) is supported by the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS<https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris>)
APPN National Head Office at the University of Adelaide<https://www.thewaite.org/> (UoA - CRICOS provider number 00123M). This email (and any attachment) is confidential and may also be privileged or otherwise exempt from disclosure. It is intended only for the addressee. If you are not the intended recipient, please delete it and do not send it on, copy it or disclose its contents. No assurance is given about the security of information sent electronically. Think green and read on the screen.
________________________________
From: Stian Soiland-Reyes <soiland-reyes at manchester.ac.uk>
Sent: Monday, 13 October 2025 9:10 PM
To: Donald Hobern <donald.hobern at adelaide.edu.au>; Peter Winstanley <peter.winstanley at semanticarts.com>; cdif-community at lists.codata.org <cdif-community at lists.codata.org>
Subject: Re: Definition of schema:Dataset
CAUTION: External email. Only click on links or open attachments from trusted senders.
Thanks, RO-Crate 2 is a longer term programme to revamp our approach and in particular we want to change how we do the specifications themselves, so that they are modular and more formalised, to generate validators and editor forms. There is no current draft for RO-Crate 2, but we are moving to an RFC-like approach for individual elements of the specification, starting with meta-RFC of how the specification itself should be structured.
We suggest RO-Crate 1.2 is the stable release everyone should use (it is also backwards compatible for 1.1-like use), but if you are particularly interested in profile/schema definition side (which I understand can overlap with CDIF activities), or have ideas for big changes, then feel free to join the RO-Crate community calls to help shape version 2!
--
Stian Soiland-Reyes, The University of Manchester
https://www.esciencelab.org.uk/<https://www.esciencelab.org.uk>
https://orcid.org/0000-0001-9842-9718<https://orcid.org/0000-0001-9842-9718>
Please note that I may work flexibly – whilst it suits me to email now,
I do not expect a response or action outside of your own working hours.
________________________________________
From: Donald Hobern <donald.hobern at adelaide.edu.au>
Sent: 13 October 2025 10:56
To: Stian Soiland-Reyes; Peter Winstanley; cdif-community at lists.codata.org
Subject: Re: Definition of schema:Dataset
Thanks, Stian.
That's very helpful - is there a timeline for RO-Crate 2?
Donald
Donald Hobern
Data Management Director, Australian Plant Phenomics Network
University of Adelaide - working from Canberra, ACT
P (04) 20511471 | plantphenomics.org.au [plantphenomics.org.au]<https://urldefense.com/v3/__http://www.plantphenomics.org.au/__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA645DTRqtC$<https://urldefense.com/v3/__http://www.plantphenomics.org.au/__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA645DTRqtC$>> | subscribe to our news [plantphenomics.org.au]<https://urldefense.com/v3/__https://www.plantphenomics.org.au/news/*news-from-our-blog__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA644LdeDxE$<https://urldefense.com/v3/__https://www.plantphenomics.org.au/news/*news-from-our-blog__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA644LdeDxE$>>
[cid:97ac90e0-4237-4bcb-8364-58658bdf9b22]
APPN acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. We pay our respect to their Elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.
The Australian Plant Phenomics Network (APPN) is supported by the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS [education.gov.au]<https://urldefense.com/v3/__https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA648KLMtAh$<https://urldefense.com/v3/__https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA648KLMtAh$>>)
APPN National Head Office at the University of Adelaide [thewaite.org]<https://urldefense.com/v3/__https://www.thewaite.org/__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA640vCwKX0$<https://urldefense.com/v3/__https://www.thewaite.org/__;!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA640vCwKX0$>> (UoA - CRICOS provider number 00123M). This email (and any attachment) is confidential and may also be privileged or otherwise exempt from disclosure. It is intended only for the addressee. If you are not the intended recipient, please delete it and do not send it on, copy it or disclose its contents. No assurance is given about the security of information sent electronically. Think green and read on the screen.
________________________________
From: Stian Soiland-Reyes <soiland-reyes at manchester.ac.uk>
Sent: Monday, 13 October 2025 6:58 PM
To: Peter Winstanley <peter.winstanley at semanticarts.com>; Donald Hobern <donald.hobern at adelaide.edu.au>; cdif-community at lists.codata.org <cdif-community at lists.codata.org>
Subject: Re: Definition of schema:Dataset
CAUTION: External email. Only click on links or open attachments from trusted senders.
________________________________
Hi, we have for RO-Crate 2 planned to lift the requirements of the root to be a Dataset, in particular it may become a Data Catalogue that has many Dataset.
Our interpretation of Dataset is slightly broader than DCAT that the Schema.org Dataset is based on. For instance we allow "any file" (and in fact no file if you like, using only "mentions" or external URIs) while they assume a Dataset to be something akin to a proxy for a single CSV file (the distribution) and alternative formats of the same conceptual dataset. Thus variableMeasured in RO-Crate may seem awkward on the root datset, as you would not know which file they apply to, and we would rather describe the File (aka MediaObject) directly.
Yet Dataset is very applicable because we normally have a root folder. And so we can have a "distribution" to a Zip download for instance, this is very useful when referencing other crates.
https://www.researchobject.org/ro-crate/specification/1.2/data-entities#directories-on-the-web-dataset-distributions<https://www.researchobject.org/ro-crate/specification/1.2/data-entities#directories-on-the-web-dataset-distributions> [researchobject.org]<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/data-entities*directories-on-the-web-dataset-distributions__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA648W16-G8$<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/data-entities*directories-on-the-web-dataset-distributions__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA648W16-G8$>>
But remember DCAT is about availability firstly so there a Data Catalog is always by reference to each Dataset. RO-Crate would permit nested Dataset so we have one Dataset inside another, one per folder. This may in some cases be another RO-Crate. For that case currently in 1.2 I would use both types on the upper root Dataset, e.g. in pseudo-JSON
@id: ./
@type: [Dataset, DataCatalogue]
This is similar to how we define a Profile Crate as both Dataset and Profile. https://www.researchobject.org/ro-crate/specification/1.2/profiles.html#profile-crate<https://www.researchobject.org/ro-crate/specification/1.2/profiles.html#profile-crate> [researchobject.org]<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/profiles.html*profile-crate__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA64zsW8Nld$<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/profiles.html*profile-crate__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA64zsW8Nld$>>
We are also thinking of using Collection for a looser gathering of files that are not in a separate folder, this is used by Workflow Run Crate
https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate/#representing-multi-file-objects<https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate/#representing-multi-file-objects> [researchobject.org]<https://urldefense.com/v3/__https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate/*representing-multi-file-objects__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA64wXlZ03e$<https://urldefense.com/v3/__https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate/*representing-multi-file-objects__;Iw!!PDiH4ENfjr2_Jw!F14mgZ5MY0haeiWjHmL1Mwms_SZrCid9O6F0kLhzm72ecoBkwe-ZrJtX4TXHsPvgMUwB8WKH7LuXCVPt5fDpg4KxIm8iqXA64wXlZ03e$>>
________________________________
From: cdif-community <cdif-community-bounces at lists.codata.org> on behalf of Peter Winstanley <peter.winstanley at semanticarts.com>
Sent: Monday, October 13, 2025 8:26:18 AM
To: Donald Hobern <donald.hobern at adelaide.edu.au>; cdif-community at lists.codata.org <cdif-community at lists.codata.org>
Subject: Re: [cdif-community] Definition of schema:Dataset
HI Donald In some respects a dataset is an idea for some managed information under a single curator that needs initially to be identified and then populated - so you'd have an identifier for a datasets and then providing some information about
HI Donald
In some respects a dataset is an idea for some managed information under a single curator that needs initially to be identified and then populated - so you'd have an identifier for a datasets and then providing some information about what it is (going to be) about - and that might be before you have any 'data' in it.
The cataloging of datasets starts off with recording the things that you have which are not recorded, but once that is done you need you process to record the description of the data as it is being recorded. So, we can start with dcat:Resource, and then update that as the resource comes into being (dcat:Dataset, dcat:Distribution, etc)
Peter
________________________________
From: cdif-community <cdif-community-bounces at lists.codata.org> on behalf of Donald Hobern <donald.hobern at adelaide.edu.au>
Sent: 13 October 2025 08:16
To: cdif-community at lists.codata.org <cdif-community at lists.codata.org>
Subject: [cdif-community] Definition of schema:Dataset
I'd like to check that there is a consistent definition for what we label as a schema:Dataset. Schema.org defined it as "A body of structured information describing some topic(s) of interest" (https://schema.org/Dataset [schema.org]<https://urldefense.com/v3/__https://schema.org/Dataset__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhkq-UHAg$<https://urldefense.com/v3/__https://schema.org/Dataset__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhkq-UHAg$>>). In line with this, RO-Crate seems to use Dataset as a container for one or more Files via schema:hasPart (e.g. https://www.researchobject.org/ro-crate/specification/1.2/introduction.html<https://www.researchobject.org/ro-crate/specification/1.2/introduction.html> [researchobject.org]<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/introduction.html__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhk-fpbsg$<https://urldefense.com/v3/__https://www.researchobject.org/ro-crate/specification/1.2/introduction.html__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhk-fpbsg$>>). Science on Schema.org doesn't provide a definition at https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md<https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md> [github.com]<https://urldefense.com/v3/__https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOgQCCCstg$<https://urldefense.com/v3/__https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOgQCCCstg$>>, but my reading is that it expects a Dataset to be a file that contains PropertyValues.
Based on RO-Crate usage, I've been expecting to use schema:Dataset to describe the set of data, metadata and other files I expect to store together as the RO-Crate. I have also expected to use Dataset to delimit subsets of the RO-Crate that merit describing as self-contained subunits worth describing separately. This would mean that a simple RO-Crate would be a Dataset and that it would have multiple Files as parts. A more complicated RO-Crate would be a Dataset that has multiple Files and Datasets as parts (with the nested Datasets themselves having Files as parts). Based on this interpretation, most Datasets have a one-to-one relationship with a Folder.
Does this align with the expectations of other groups interested in CDIF? In short, is a schema:Dataset 1) a collection of Files representing the results of a study or 2) a File that contains PropertyValues (using e.g. CSV, NetCDF, HDF5, ...)?
Thanks,
Donald
Donald Hobern
Data Management Director, Australian Plant Phenomics Network
University of Adelaide - working from Canberra, ACT
P (04) 20511471 | plantphenomics.org.au [plantphenomics.org.au]<https://urldefense.com/v3/__http://www.plantphenomics.org.au/__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOjwacPD6Q$<https://urldefense.com/v3/__http://www.plantphenomics.org.au/__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOjwacPD6Q$>> | subscribe to our news [plantphenomics.org.au]<https://urldefense.com/v3/__https://www.plantphenomics.org.au/news/*news-from-our-blog__;Iw!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOgKJCc98g$<https://urldefense.com/v3/__https://www.plantphenomics.org.au/news/*news-from-our-blog__;Iw!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOgKJCc98g$>>
[cid:ed0e4064-94e7-4d04-b9ec-02ece5eed479]
APPN acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. We pay our respect to their Elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.
The Australian Plant Phenomics Network (APPN) is supported by the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS [education.gov.au]<https://urldefense.com/v3/__https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOjzIJ0sZQ$<https://urldefense.com/v3/__https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOjzIJ0sZQ$>>)
APPN National Head Office at the University of Adelaide [thewaite.org]<https://urldefense.com/v3/__https://www.thewaite.org/__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhAH8lF4g$<https://urldefense.com/v3/__https://www.thewaite.org/__;!!PDiH4ENfjr2_Jw!H1BbexhqxXZIpx5vz2t7ootw5n7KdXlan0eP558o5JHO1jeOxC-55F30e-DgKlq8ypOZExLt1kFU309T371wHjzghnpJrZgotOhAH8lF4g$>> (UoA - CRICOS provider number 00123M). This email (and any attachment) is confidential and may also be privileged or otherwise exempt from disclosure. It is intended only for the addressee. If you are not the intended recipient, please delete it and do not send it on, copy it or disclose its contents. No assurance is given about the security of information sent electronically. Think green and read on the screen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.codata.org/pipermail/cdif-community_lists.codata.org/attachments/20251013/ca0e05d0/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-sfhb4qcr.png
Type: image/png
Size: 25838 bytes
Desc: Outlook-sfhb4qcr.png
URL: <http://lists.codata.org/pipermail/cdif-community_lists.codata.org/attachments/20251013/ca0e05d0/attachment-0001.png>
More information about the cdif-community
mailing list