[cdif-community] Definition of schema:Dataset

Rolf Krahl rolf.krahl at helmholtz-berlin.de
Tue Oct 14 18:10:59 EDT 2025


Hi all,

I thought it might be useful for the discussion if I add a note how we
use the term “Dataset” in our facility outside the definition of
schema.org or RO-Crate and what it means to us.

At HZB, we use the ICAT [1] metadata catalogue as a central element in
our scientific data management.  Internally, ICAT is a relational
database.  Its database schema imposes some structure on how we
organize the data in terms of Investigations, Datasets and Datafiles.

According to our internal definitions:

The Investigation is a delimited endeavor to collect data.  It sets
the scientific context.  It describes /why/ the data has been
created.  Most data from our large scale facility BESSY II is
collected during a beamtime that has been granted as a result of an
application (the so called “Proposal”) to the user.  In these cases,
the Investigation corresponds to the Proposal.  Most bibliographic
metadata and also the access permissions to the data are managed at
the level of the Investigation.

The Dataset sets the context of the creation of the data.  It
describes /how/ the data has been created.  For data created during an
experiment, a Dataset usually corresponds to an individual
measurement, experiment or simulation.  (It is up to the respective
instrument scientist to define what “a measurement” means in this
context.)  In general, the Dataset is the smallest piece of data that
is attributed an identifier and that can be externally referenced.
Physical metadata, as for instance the parameter of the measurement,
are managed at the level of the Dataset.  There is a many-to-one
relation from Dataset to Investigation, e.g. each Dataset belongs to
one and only one Investigation.

Depending on various factors, including control software and
measurement protocoll, an experiment may result in one or more
(sometimes many) files.  So each Dataset may contain one or more
Datafiles.

HTH & best regards,
Rolf


[1]: https://icatproject.org/

Am Montag, 13. Oktober 2025, 17:16:44 AEST schrieb Donald Hobern:
> I'd like to check that there is a consistent definition for what we label as a schema:Dataset. Schema.org defined it as "A body of structured information describing some topic(s) of interest" (https://schema.org/Dataset). In line with this, RO-Crate seems to use Dataset as a container for one or more Files via schema:hasPart (e.g. https://www.researchobject.org/ro-crate/specification/1.2/introduction.html). Science on Schema.org doesn't provide a definition at https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md, but my reading is that it expects a Dataset to be a file that contains PropertyValues.
> 
> Based on RO-Crate usage, I've been expecting to use schema:Dataset to describe the set of data, metadata and other files I expect to store together as the RO-Crate. I have also expected to use Dataset to delimit subsets of the RO-Crate that merit describing as self-contained subunits worth describing separately. This would mean that a simple RO-Crate would be a Dataset and that it would have multiple Files as parts. A more complicated RO-Crate would be a Dataset that has multiple Files and Datasets as parts (with the nested Datasets themselves having Files as parts). Based on this interpretation, most Datasets have a one-to-one relationship with a Folder.
> 
> Does this align with the expectations of other groups interested in CDIF? In short, is a schema:Dataset 1) a collection of Files representing the results of a study or 2) a File that contains PropertyValues (using e.g. CSV, NetCDF, HDF5, ...)?
> 
> Thanks,
> 
> Donald
> 
> 
> Donald Hobern
> Data Management Director, Australian Plant Phenomics Network
> University of Adelaide - working from Canberra, ACT
> 
> P (04) 20511471   |   plantphenomics.org.au<http://www.plantphenomics.org.au/>   |   subscribe to our news<https://www.plantphenomics.org.au/news/#news-from-our-blog>
> 
> [cid:ed0e4064-94e7-4d04-b9ec-02ece5eed479]
> APPN acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. We pay our respect to their Elders past and present and extend that respect to all Aboriginal and Torres Strait Islander peoples today.
> The Australian Plant Phenomics Network (APPN) is supported by the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS<https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris>)
> APPN National Head Office at the University of Adelaide<https://www.thewaite.org/> (UoA - CRICOS provider number 00123M). This email (and any attachment) is confidential and may also be privileged or otherwise exempt from disclosure. It is intended only for the addressee. If you are not the intended recipient, please delete it and do not send it on, copy it or disclose its contents. No assurance is given about the security of information sent electronically. Think green and read on the screen.
> 


-- 
Rolf Krahl <rolf.krahl at helmholtz-berlin.de>
Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
Albert-Einstein-Str. 15, 12489 Berlin
Tel.: +49 30 8062 12122
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 7032 bytes
Desc: not available
URL: <http://lists.codata.org/pipermail/cdif-community_lists.codata.org/attachments/20251015/dd9021eb/attachment.p7s>


More information about the cdif-community mailing list