3. Data files

3.1. Data Policy

The data policy of European XFEL is available at https://www.xfel.eu/users/experiment_support/policies/scientific_data_policy/index_eng.html

3.2. Data Files

On the offline cluster (Maxwell), European XFEL data is stored in /gpfs/exfel/exp. Each proposal has its own directory, so your data will be available at a path something like:


3.3. Calibrated files

Calibrated files will be created automatically after data has been migrated to Maxwell cluster and appear in the proc folder for a given proposal (see Storage).

3.4. Reading Data in Python

We provide a Python package karabo_data to read data from European XFEL.

3.5. Combining detector data from multiple modules

The pixel detectors (AGIPD and LPD) record data in separate files for each of their 16 modules.

The karabo_data Python library can combine detector modules into a numpy array (example)

Alternatively, the modules can be combined in a single view as an HDF5 virtual dataset, allowing the data to be processed by CrystFEL, for instance.

There are scripts on Maxwell to produce a combined view for a given run, in:


Also see the README in that directory for details on how to use it.

3.6. Data Format

The data format that is used at the European XFEL is described in the European XFEL Data schema specification.

Experimental data are taken in the context of the following categories:

  • instruments: each instrument has their own label. For each instrument, there are multiple cycles:
  • cycle: a scheduling period in which multiple user experiments will take place (of the order of months). Within each cycle, there are multiple proposals:
  • proposals: (a.k.a. beamtimes) each user experiment proposal gets a number here. Within that proposal, users may carry out runs which can be logicaly associated to different experiments and samples:
  • runs: when a user starts acquiring data, then a new run starts, until that data acquisition is stopped. Within a run, there are trains labelled by a unique train Id:
  • train id: there are 10 trains per second. Each train can carry multiple pulses
  • pulse id: up to 2700 pulses per train, individually counted.Counter starts from zero for every train.

(Multiple runs can be grouped together in an experiment, but this is not activated at the moment.)

We distinguish different types of data:

  • fast data which is different for every pulse (carries train ID and pulse ID)
  • train data which is recorded at train speed (carries only train ID)
  • sequence data, not attached to every train (could be inter-train data). Sequence data can have multiple data entries within each train (and each carries a time stamp)

Datafiles are in HDF5 format following the above described structure is also displayed as in Fig. 3.1


Fig. 3.1 example of data structure in a HDF5 file

3.6.1. HDF5 Chunking & Compression

Both raw and corrected data may be stored using the HDF5 chunked layout. Some parts of the corrected data are compressed using the gzip compression filter in HDF5. In particular, detector gain stage and mask datasets compress well, saving a lot of disk space.

You can examine compression and chunk sizes using the GUI HDF View tool, our h5glance command line tool, or h5ls -v:

$ h5glance /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5 \
      dtype: uint16
      shape: 16000 × 2 × 512 × 128
   maxshape: Unlimited × 2 × 512 × 128
     layout: Chunked
      chunk: 16 × 2 × 512 × 128
compression: None (options: None)

$ h5ls -v /gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5/INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data
Opened "/gpfs/exfel/exp/XMPL/201750/p700000/raw/r0803/RAW-R0803-AGIPD00-S00000.h5" with sec2 driver.
data                     Dataset {16000/Inf, 2/2, 512/512, 128/128}
    Location:  1:12333
    Links:     1
    Modified:  2017-11-20 04:57:44 CET
    Chunks:    {16, 2, 512, 128} 4194304 bytes
    Storage:   4194304000 logical bytes, 4194304000 allocated bytes, 100.00% utilization
    Type:      native unsigned short

From January 2019, all compressed datasets are stored with a single frame per chunk, to minimise the impact on analysis code reading the data.

If you observe pathologically slow reading, check whether you are accessing a compressed dataset with a chunk size larger than one frame. HDF5 decompresses an entire chunk at once, and it may be redoing this for each frame you read. You can avoid this by setting a cache size large enough to hold one complete chunk. The necessary C code looks something like this:

hid_t dapl = H5Pcreate(H5P_DATASET_ACCESS);
// Set a 32 MB cache size (calculate at least the size of one chunk)
H5Pset_chunk_cache(dapl, H5D_CHUNK_CACHE_NSLOTS_DEFAULT, 32 * 1024 * 1024, 1);
hid_t h5_dataset_id = H5Dopen(h5_file_id, ".../image/gain", dapl);

To benefit from chunk caching, you need to reuse the opened dataset ID for successive reads, instead of opening and closing it to read each frame.

3.7. Sample Data

The file data structure is specific to each experiment and run configurations, so it is not foreseen that it is going to be static and fully known in advance. On the other hand, (non-representative) test sets are being prepared and made available the users for helping their preparation to the experiments.

3.7.1. Example runs

We prepared an environment to mimic real experiment data cycle at the European XFEL. For this, we have a fake instrument called XMPL which contains will contains numerous proposals and data runs giving an overview of the data to expect. This data is made available on Maxwell:


It follows the same structure that each experiment have (see Storage for more details), and will be used to share different example of file format generated at the facility, from all instrument and detectors. These datasets are also linked to the Metadata catalog and information about the data (instrument, detector, sample, date, …) can be found there (MDC). Each run datasets comprise raw data (in .../p700000/raw/run_id) calibrated data (in .../p700000/proc/run_id) and a set of sample script to read the data (in .../p700000/usr/run_id).

List of sample data sets:

run id description MDC Date comments
instrument: SPB
detector: AGIPD
sample: Water
r0017 2018-04-03 commissioning
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
r0034 2018-04-03 commissioning
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
r0035 2018-04-03 commissioning
instrument: SPB
detector: AGIPD
sample: Lysozyme (liquid)
r0036 2018-04-03 commissioning
instrument: SPB
detector: AGIPD
sample: Lithium titanate
r0273 2018-08-18
instrument: SPB
detector: AGIPD
sample: Lithium titanate [1]
r0803 2017-11-20 commissioning
instrument: FXE
detector: LPD
sample: aqueous solution
of [Fe(bpy)3]2+
r0216 2017-09-18 User Run
tunnel: SA1_XTD2
device: XGM
sample: n/a
r0009 2019-02-15 commissioning (XPD)
tunnel: SA3_XTD10
device: XGM
sample: n/a
r0032 2019-02-15 commissioning (XPD)


Mock data can be generated using the karabo_data package, i.e.:

>>> from karabo_data.tests.make_examples import make_agipd_example_file
>>> make_agipd_example_file('agipd_example.h5')

>>> from karabo_data.tests.make_examples import write_file, Motor, ADC, XGM
>>> write_file('test_file.h5', [
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_X'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Y'),
..:     Motor('SPB_DET_MOT/MOTOR/AGIPD_Z'),
..:     ADC('SA1_XTD2_MPC/ADC/1', nsample=0, channels=(
..:         'channel_3.output/data',
..:         'channel_4.output/data',
..:         'channel_5.output/data'))
..:     ], ntrains=500, chunksize=50)
[1]Lithium titanate, spinel; nanopowder, <200 nm particle size (BET), >99%; CAS Number 12031-95-7; Linear Formula Li4Ti5O12; https://www.sigmaaldrich.com/catalog/product/aldrich/702277

3.7.2. Other example files

A data file representing the large pixel detectors’ file structure can be found in:


The files available (as of 25 Aug 2017) contain further information and an Jupyter (Python) Notebook demonstrating how to parse the hdf5 file:

219K Aug 24 17:47 DataformatExample.ipynb
220K Aug 21 12:42 DataformatExample.pdf
 39K Aug 21 12:43 data_orientation_HDF5.png
635M Aug 21 12:41 R0126-AGG01-S00002.h5
 62K Aug 21 12:56 README.pdf
2.3K Aug 21 12:54 README.rst

These files may update without further notice.

3.8. Download of experiment data

In the long run, the x-ray image data will have detector calibration applied by default but no geometry is applied. For the first experiments, the calibration may need to be triggered manually.

3.8.1. Files

Raw data files from the experiment are available to users (after migration to the offline cluster).

A tool to produce calibrated files from the raw data is in preparation.

3.8.2. HDF5 File Compression

The calibrated data, which is stored in HDF5-Format, is compressed to save valuable storage space on Maxwell. A temporary script to uncompress the calibrated data is provided for the case of occurring issues accessing the compressed files. The script creates an uncompressed copy of all HDF5 files in a given directory; it is accessible on Maxwell via:

$: /gpfs/exfel/sw/software/uncompress-run
  uage: uncompress-run [-h] source_directory destination_directory

  positional arguments:
      source_directory      Directory containing compressed HDF5 files
                            Directory to hold uncompressed files. Will be created

  optional arguments:
      -h, --help            show this help message and exit


This script has to be considered as a temporary escape hatch for tools which have problems reading the compressed data. Measures solving the problems in the tool reading the data should be taken rather than relying on the script. It is slow, and the produced uncompressed may be removed to free up disk space.

3.8.3. Metadata catalog

The alternative way to retrieve data is through the metadata catalogue. This service is available via web interface. Data can be retrieved from user query: run number, file name, data.

Recorded experiment data is downloaded as an HDF5 file in XFEL’s Data Format.

This service is not fully deployed for the first experiments, and will need expert support to use initially. (In particular: the metadata catalog can identify the location of the desired files, but some manual work is required to retrieve those files via the DataReader device.) The service is initially restricted to retrieving all files from one run.