3. Data files

3.1. Data Policy

The data policy of European XFEL is available at https://www.xfel.eu/users/experiment_support/policies/scientific_data_policy/index_eng.html

3.2. Data Files

On the offline cluster (Maxwell), European XFEL data is stored in /gpfs/exfel/exp. Each proposal has its own directory, so your data will be available at a path something like:

/gpfs/exfel/exp/SPB/201830/p900022/

3.3. Calibrated files

Calibrated files will be created automatically after data has been migrated to Maxwell cluster and appear in the proc folder for a given proposal (see Storage).

3.4. Reading Data in Python

We provide a Python package karabo_data to read data from European XFEL.

3.5. Combining detector data from multiple modules

The pixel detectors (AGIPD and LPD) record data in separate files for each of their 16 modules.

The karabo_data Python library can combine detector modules into a numpy array (example)

Alternatively, the modules can be combined in a single view as an HDF5 virtual dataset, allowing the data to be processed by CrystFEL, for instance.

There are scripts on Maxwell to produce a combined view for a given run, in:

/gpfs/exfel/sw/software/hdf5-virtualise

Also see the README in that directory for details on how to use it.

3.6. Geometry files

Geometry files specify the location of the detector modules in real space. Please contact your instrument scientist regarding obtaining geometry files for the detector at each instrument.

One geometry file for the FXE LPD detector is available online.

One geometry file for the SPB/SFX AGIPD1M detector is available from http://cxidb.org/id-83.html (Experiment by Anton Barty).

karabo_data can read both file formats, see for example https://karabo-data.readthedocs.io/en/latest/apply_geometry.html .

GeoAssembler can be used to create or adjust geometry files by visually moving detector quadrants around.

A more systematic provision of geometry files is in preparation.

3.7. Data Format

The data format that is used at the European XFEL is described in the European XFEL Data schema specification.

Experimental data are taken in the context of the following categories:

  • instruments: each instrument has their own label. For each instrument, there are multiple cycles:
  • cycle: a scheduling period in which multiple user experiments will take place (of the order of months). Within each cycle, there are multiple proposals:
  • proposals: (a.k.a. beamtimes) each user experiment proposal gets a number here. Within that proposal, users may carry out runs which can be logicaly associated to different experiments and samples:
  • runs: when a user starts acquiring data, then a new run starts, until that data acquisition is stopped. Within a run, there are trains labelled by a unique train Id:
  • train id: there are 10 trains per second. Each train can carry multiple pulses
  • pulse id: up to 2700 pulses per train, individually counted.Counter starts from zero for every train.

(Multiple runs can be grouped together in an experiment, but this is not activated at the moment.)

We distinguish different types of data:

  • fast data which is different for every pulse (carries train ID and pulse ID)
  • train data which is recorded at train speed (carries only train ID)
  • sequence data, not attached to every train (could be inter-train data). Sequence data can have multiple data entries within each train (and each carries a time stamp)

Datafiles are in HDF5 format following the above described structure is also displayed as in Fig. 3.1

_images/h5.png

Fig. 3.1 example of data structure in a HDF5 file

3.7.1. HDF5 Chunking & Compression

Both raw and corrected data may be stored using the HDF5 chunked layout. Some parts of the corrected data are compressed using the gzip compression filter in HDF5. In particular, detector gain stage and mask datasets compress well, saving a lot of disk space.

You can examine compression and chunk sizes using the GUI HDF View tool, our h5glance command line tool, or h5ls -v:

$ h5glance /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5 \
  INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
      dtype: uint16
      shape: 16000 × 2 × 512 × 128
   maxshape: Unlimited × 2 × 512 × 128
     layout: Chunked
      chunk: 16 × 2 × 512 × 128
compression: None (options: None)
...

$ h5ls -v /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
Opened "/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5" with sec2 driver.
data                     Dataset {16000/Inf, 2/2, 512/512, 128/128}
    Location:  1:12333
    Links:     1
    Modified:  2017-11-20 04:57:44 CET
    Chunks:    {16, 2, 512, 128} 4194304 bytes
    Storage:   4194304000 logical bytes, 4194304000 allocated bytes, 100.00% utilization
    Type:      native unsigned short

From January 2019, all compressed datasets are stored with a single frame per chunk, to minimise the impact on analysis code reading the data.

If you observe pathologically slow reading, check whether you are accessing a compressed dataset with a chunk size larger than one frame. HDF5 decompresses an entire chunk at once, and it may be redoing this for each frame you read. You can avoid this by setting a cache size large enough to hold one complete chunk. The necessary C code looks something like this:

hid_t dapl = H5Pcreate(H5P_DATASET_ACCESS);
// Set a 32 MB cache size (calculate at least the size of one chunk)
H5Pset_chunk_cache(dapl, H5D_CHUNK_CACHE_NSLOTS_DEFAULT, 32 * 1024 * 1024, 1);
hid_t h5_dataset_id = H5Dopen(h5_file_id, ".../image/gain", dapl);

To benefit from chunk caching, you need to reuse the opened dataset ID for successive reads, instead of opening and closing it to read each frame.

3.8. Sample Data

The file data structure is specific to each experiment and run configurations, so it is not foreseen that it is going to be static and fully known in advance. On the other hand, (non-representative) test sets are being prepared and made available the users for helping their preparation to the experiments.

3.8.1. Example runs

We prepared an environment to mimic real experiment data cycle at the European XFEL. For this, we have a fake instrument called XMPL which contains will contains numerous proposals and data runs giving an overview of the data to expect. This data is made available on Maxwell:

/gpfs/exfel/exp/XMPL/201750/p700000

It follows the same structure that each experiment have (see Storage for more details), and will be used to share different example of file format generated at the facility, from all instrument and detectors. These datasets are also linked to the Metadata catalog and information about the data (instrument, detector, sample, date, …) can be found there (MDC). Each run datasets comprise raw data (in .../p700000/raw/run_id) calibrated data (in .../p700000/proc/run_id) and a set of sample script to read the data (in .../p700000/usr/run_id).

List of sample data sets:

run id description Date comments
r0001
instrument: SPB
detector: AGIPD
sample: Water
2018-04-03 commissioning
r0002
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
2018-04-03 commissioning
r0003
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
2018-04-03 commissioning
r0004
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
2018-04-03 commissioning
r0005
instrument: SPB
detector: AGIPD
sample: Lithium titanate
2018-08-18
AGIPD
calibration
r0006
instrument: SPB
detector: AGIPD
sample: Lithium titanate [1]
2017-11-20 commissioning
r0007
instrument: FXE
detector: LPD
sample: aqueous solution
of [Fe(bpy)3]2+
2017-09-18 User Run
r0008
tunnel: SA1_XTD2
device: XGM
sample: n/a
2019-02-15 commissioning (XPD)
r0009
tunnel: SA3_XTD10
device: XGM
sample: n/a
2019-02-15 commissioning (XPD)
r0010
instrument: SPB
detector: AGIPD
run type: calibration - dark high gain
2019-08-10 commissioning
r0011
instrument: SPB
detector: AGIPD
run type: calibration - dark medium gain
2019-08-10 commissioning
r0012
instrument: SPB
detector: AGIPD
run type: calibration - dark low gain
2019-08-10 commissioning
r0013
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0014
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0015
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0016
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0017
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0018
instrument: SPB
detector: AGIPD
sample: Lysozyme
2019-08-11 commissioning
r0019
instrument: SQS
device: digitizer
sample: Xenon
2019-10-11 commissioning
r0020
instrument: SQS
device: digitizer
sample: Xenon
2019-10-11 commissioning
r0021
instrument: SPB
detector: Jungfrau
sample: Lysozyme
2019-05-05 IRDa commissioning
r0022
instrument: SPB
detector: Jungfrau
sample: Lysozyme
2019-05-05 IRDa commissioning

Note

Mock data can be generated using the karabo_data package, i.e.:

>>> from karabo_data.tests.make_examples import make_agipd_example_file
>>> make_agipd_example_file('agipd_example.h5')

>>> from karabo_data.tests.make_examples import write_file, Motor, ADC, XGM
>>> write_file('test_file.h5', [
..:     XGM('SPB_XTD1_XGM/XGM/MAIN'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_X'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Y'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Z'),
..:     ADC('SA1_XTD2_MPC/ADC/1', nsample=0, channels=(
..:         'channel_3.output/data',
..:         'channel_4.output/data',
..:         'channel_5.output/data'))
..:     ], ntrains=500, chunksize=50)
[1]Lithium titanate, spinel; nanopowder, <200 nm particle size (BET), >99%; CAS Number 12031-95-7; Linear Formula Li4Ti5O12; https://www.sigmaaldrich.com/catalog/product/aldrich/702277

3.8.2. Other example files

A data file representing the large pixel detectors’ file structure can be found in:

/gpfs/exfel/data/scratch/example_data

The files available (as of 25 Aug 2017) contain further information and an Jupyter (Python) Notebook demonstrating how to parse the hdf5 file:

219K Aug 24 17:47 DataformatExample.ipynb
220K Aug 21 12:42 DataformatExample.pdf
 39K Aug 21 12:43 data_orientation_HDF5.png
635M Aug 21 12:41 R0126-AGG01-S00002.h5
 62K Aug 21 12:56 README.pdf
2.3K Aug 21 12:54 README.rst

These files may update without further notice.

3.9. Download of experiment data

In the long run, the x-ray image data will have detector calibration applied by default but no geometry is applied. For the first experiments, the calibration may need to be triggered manually.

3.9.1. Files

Raw data files from the experiment are available to users (after migration to the offline cluster).

A tool to produce calibrated files from the raw data is in preparation.

3.9.2. HDF5 File Compression

The calibrated data, which is stored in HDF5-Format, is compressed to save valuable storage space on Maxwell. A temporary script to uncompress the calibrated data is provided for the case of occurring issues accessing the compressed files. The script creates an uncompressed copy of all HDF5 files in a given directory; it is accessible on Maxwell via:

$: /gpfs/exfel/sw/software/uncompress-run
  uage: uncompress-run [-h] source_directory destination_directory

  positional arguments:
      source_directory      Directory containing compressed HDF5 files
      destination_directory
                            Directory to hold uncompressed files. Will be created

  optional arguments:
      -h, --help            show this help message and exit

Note

This script has to be considered as a temporary escape hatch for tools which have problems reading the compressed data. Measures solving the problems in the tool reading the data should be taken rather than relying on the script. It is slow, and the produced uncompressed may be removed to free up disk space.

3.9.3. Metadata catalog

The alternative way to retrieve data is through the metadata catalogue. This service is available via web interface. Data can be retrieved from user query: run number, file name, data.

Recorded experiment data is downloaded as an HDF5 file in XFEL’s Data Format.

This service is not fully deployed for the first experiments, and will need expert support to use initially. (In particular: the metadata catalog can identify the location of the desired files, but some manual work is required to retrieve those files via the DataReader device.) The service is initially restricted to retrieving all files from one run.