Wiki: Archiving Research Data

Topics:


Overview of iRODS
Accesing iRODS
Example Process
iRODS Commands
Simplified iRODS Wrappers
Adding Metadata

 

 

Overview of iRODS

iRODS is a remote virtual filesystem separate from your normal directories in Linux or Windows. In iRODS, directories are called collections and they may contain further sub-collections (sub-directories) or data objects (files). In practice you can treat collections and data as directories and files for convenience. Files and folders in your own local and mounted directories can be Archived (copied) to the remote iRODS filesystem that the ZDV provides. Archiving is done by first creating a remote path directory (collection) and then copying your local files and directories to the remote path in iRODS. Archived files can be copied back or downloaded years later. Archives can be private for long term storage only, restricted to research groups or public for download.

 

Accessing iRODS

Archiving to iRODS is currently possible using the command line from the ZDV Linux Work-PC's, the linux.zdv.uni-mainz.de web server, and the MOGON II supercomputer. MOGON II also has a specialized iRODS wiki. New and more user friendly ways to access iRODS will be available in the future. Currently only users with a university account can use iRODS. If your current work computer does not have a standard ZDV Linux installation, you can remotely access the Linux servers using SSH. From there you can use the available iRODS iCommands to Archive data. Personal computers with an iRODS iCommand client installation could also access the ZDV iRODS Archive.

 

Example Process

All commands to interact with iRODS start with an 'i' and are similar to UNIX commands. An example process workflow would look like this:

ipwd
shows the current directory;

imkdir -p /zdv/home/MUSTERPERSON/project123/
creates a new remote iRODS path to copy data to;

icd /zdv/home/MUSTERPERSON/project123/
changes the current remote iRODS directory to the newly created one;

ipwd
shows and confirms that it is the current directory;

ils
shows the (empty) contents of the remote directory;

iput -K some.file --metadata="Title;Publications data;;ResourceType;DataSet Short Description;;Title;Scientific Data;" 
archives a local some.file to the current remote iRODS path by copying it.

iget /zdv/home/MUSTERPERSON/project123/some.file
downloads some.file from iRODS to the local directory.

iticket create read /zdv/home/MUSTERPERSON/project123/some.file
creates a ticket number (i.e: ABCDEFG123456890) used to download some.file from the web:
wget https://irods-web.zdv.uni-mainz.de/irods-rest/rest/fileContents/zdv/home/MUSTERPERSON/some.file?ticket=ABCDEFG123456890

 

iRODS Commands.

Here is a short summary over the most important iRODS commands with some important command line parameters. Note that they start with an ' i ':

Command Parameters Description
ipwd print current iRODS working directory (collection)
ils -l, -L, -A, -r list iRODS directory (collection) (-l: with details; -L: more details; -A: ACL, -r recursively)
icd <target path> change iRODS directory (collection) to the target path
imkdir -p <target path> create a new directory (collection) (directory; -p: create full path with parents)
iput -K, -r, --metadata Upload files/directories, (K: calculate  and validate checksums; -r: recursive, --metadata add descriptions of data:  "Title;My data;;Description;Research Data;;")
iget -r, -f <target path> Download target iRODS data to the current local path, -r recursive, -f overwrite
iticket create read <target path> Create a ticket used for web downloads of files
imeta ls -d/-C <file/dir> List the metadata of an iRODS file or directory (collection)
imeta set -d/-C <file/dir> Key "Value" Add/Update metadata to an iRODS file or directory (collection); i.e: Key = Title, "Value" = My research data

 

Simplified Wrapper Commands for iRODS.

Here is a short summary of our wrapper iRODS commands that simplify the archiving process of research data. Note that wrappers start with an ' i_ ' unlike pure iRODS commands that srat with 'i ':

Command Parameters Description
source /usr/local/bin/i_init.sh For linux.zdv.uni-mainz.de Initializes iRODS and shows current remote iRODS directory after every subsequent command.
source /project/zdvresearch/irods/i_init.sh For (MOGONII) Initializes iRODS and shows current remote iRODS directory after every subsequent command.
i_exit Clears the current remote iRODS directory and cleans local settings.
i_archive <local file/dir path> <local .json metadata file path> Uploads the 1st arg. to the current remote iRODS directory/collection and adds the metadata of the second argument .json file to all uploaded files.
i_metaupdate <remote iRODS file/dir path> <local .json metadata file path> Updates the (first argument) remote iRODS directory/collection and updates the metadata (second argument) using the .json file.
i_publish <remote iRODS file/dir path> Creates an iRODS web download Ticket and links for all files in the path (first argument). Anyone can use the links to download the files.
i_ticketget <remote iRODS file/dir path> Prints the iRODS Ticket of a remote iRODS path if available.
i_downlinkget <iRODS Ticket> Uses the iRODS Ticket and prints the public web download links

 

A simplified procedure to Archive research data would use the simplified wrappers and look like this:

source /usr/local/bin/i_init.sh for the linux.zdv.uni-mainz.de systems or source /project/zdvresearch/irods/i_init.sh for MogonII
Makes the wrapper commands available and initializes iRODS

imkdir -p /zdv/home/MUSTERPERSON/project123/
creates a new remote iRODS path to copy data to;

icd /zdv/home/MUSTERPERSON/project123/
changes the current remote iRODS directory to the newly created one;

i_archive /fullpath/some.file metadata.json
archives a local some.file to the current remote iRODS path by copying it and applying the metadata from the metadata.json files.

i_publish /zdv/home/MUSTERPERSON/project123/some.file
creates a ticket number (i.e: ABCDEFG123456890) and prints the generated web download link.

 

Adding Metadata

Metadata is accepted by iRODS as triplets (Attribute [i.e: Title], Value [i.e: "Research Data from X publication"], Unit [mostly always left empty]).
The first two fields Attribute and Value are mandatory and must not be empty, the unit is optional.

You could manually add metadata to every remote iRODS file and directory after uploading them by:
imeta set -d some.file Title "My research data from Publication X"

This would be very effort and time consuming so recommended alternative is to define a .json Metadata file for all your files and directories of a single Archive tree.

i_archive some.file metadata.json
Can be used to upload your local files with Metadata all at once.

i_metaupdate /zdv/home/MUSTERPERSON/project123/some.file metadata.json
Can be used to update a single remote file/directory with Metadata.

This flat metadata.json example needs a minimum set of Metadata Attributes:

{
 "Title":"",
 "ResourceType":"",
 "Project":"",
 "Keywords":"",
}
{
 "Title":"My Scientific Data from XYZ publication",
 "ResourceType":"Tables, Texts, Images",
 "Project":"BMBF-12345, DFG-67890",
 "Keywords":"Thermodynamics, Simulation, HPC, MPI , XYZ, ,BMBF, DFG",
}

The following Attributes are set automatically upon uploading files/directories and are not needed within the .json file:

Creator, Publisher, Location, Date, ExpiryDate(Date + 10 years), protected (default: “false”) .

The following Attributes are recommended within your .json Metadata file:

{
 "Title":"",
 "ResourceType":"",
 "Project":"",
 "Keywords":"",
 "Contributor":"",
 "Reference":"",
 "License":""
}
{
 "Title":"My Scientific Data from XYZ publication",
 "ResourceType":"Tables, Texts, Images",
 "Project":"BMBF-12345, DFG-67890",
 "Keywords":"Thermodynamics, Simulation, HPC, MPI , XYZ, ,BMBF, DFG",
 "Contributor":"Co-author1, Co-author2, Co-author3",
 "Reference":"",
 "License":"GPLX, CC0, CC-BY"
}

You can confirm the metadata was set by using:
imeta ls -d /zdv/home/MUSTERPERSON/project123/some.file for files
imeta ls -C /zdv/home/MUSTERPERSON/project123/ for directories.

Summary of imeta:

Parameter Description
add|set|rm|ls|cp command, see next table for details (ls|cp do not require the AVU triplet)
-d dataObject |-C directory/collection which object/collection (file/path) should be queried/edited
Attribute Value [Unit] AVU triplet, where the Unit is optional

Command Description:

Command Description
add add a AV(U) triplet
set set a single value
rm remove an AV(U) triplet
ls list existing metadata. If Attribute is given, only metadata of the given attribute
cp copy existing metadata. Needs a target and source (e.g. imeta cp -d source -c target)

 

 

Publishing research data

For public access a ticket needs to be created for downloading files.
With this ticket and the path to the remote iRODS location, anyone can download information and the content.
Publishing already archived data can be done by using i_publish

i_publish /zdv/home/MUSTERPERSON/project123/some.file
creates a ticket number (i.e: ABCDEFG123456890) and prints the generated web download link.

You can also manually perform the publishing process with iRODS commands instead:
iticket create read /zdv/home/MUSTERPERSON/project123/some.file
creates a ticket number (i.e: ABCDEFG123456890) used to download some.file from the web.
Then you need to add the path such that it looks like this:
wget https://irods-web.zdv.uni-mainz.de/irods-rest/rest/fileContents/zdv/home/MUSTERPERSON/some.file?ticket=ABCDEFG123456890

 

Data Policy and Licensing

The “Creator” is responsible for taking care that reusing of third party data is legal (Urheberrechtsgesetz) and that personal data is handled correctly (DSGVO).
This applies for all data Archived, even if the “Creator” is not employed at the university any more or if the data is not public.
This a decision guide can help you decide if your data can be published(German).

If your data can indeed be published, Different kinds of Licenses exist for various cases:

The applicability of CC-BY licenses for Software is not recommended: CC-recommendation and discussion. The same applies for datasets, their publication under a CC-license other than CC0 is doubtful. For other dataset licenses search at Open Definition Licenses Service.

Proprietary file formats should be avoided, since you don't know if the software to open them still exists in a few years. Try to stick to open standards.