5. Compute environment

5.1. User account

User accounts are created during the initial registration step in the UPEX portal. At this point the account can only be used for the UPEX itself. If the user account is associated to the accepted and scheduled proposal, then the account is upgraded 4 weeks before the first scheduled beamtime of the given user. For the first early user period, the time between the upgrade of the accounts and the start of the experiment can be shorter. The upgraded account allows the user to access additional services such as the online safety training, the metadata catalog , and the computing infrastructure of the European XFEL.

By default upgraded user accounts are kept in this state for 1 year after the user’s last beamtime. An extension can be requested by the PI.

On-site guest WLAN (WiFi) is provided for all users. For those users with eduroam accounts provided by their home institute, access is straightforward. For those without eduroam account, a special registration procedure must be conducted to obtain guest access for the limited time period. After connecting to the XFEL-Guest network (also when using a network patch cable) and opening a web browser, the user will be able to register for the usage if guest network. The registration is valid for 10 days and 5 devices.

5.1.1. Tools

At different stages of the proposal, users are granted access to different services:

Stage Access provided
Proposal submission Acces to User portal (UPEX)
Approval of proposal and scheduling Lightweight account
Preparation phase Access to Metadata catalog and beamtime store filesystem. LDAP account upgraded for members of all accepted proposals.
Beam time Access to catalogs and dedicated online and offline services
Data analysis Access to catalogs and shared offline computing resources, initially limited to 1 year time period after beamtime.

5.2. Online cluster

During beam time, exclusive access to a dedicated online cluster (ONC) is available only to the experiment team members and instrument support staff.

European XFEL aims to keep the software provided on the ONC identical to that available on the offline cluster (which is the Maxwell cluster).

5.2.1. Online cluster nodes in SASE 1

Beamtime in SASE 1 is shared between the FXE and the SPB/SFX instruments, with alternating shifts: when the FXE shift stops, the SPB/SFX shift starts, and vice versa.

Within SASE1, there is one node reserved for the SPB/SFX experiments (sa1-onc-spb), and one node is reserved for the FXE experiments (sa1-onc-fxe). These can be used by the groups at any time during experiment period (i.e. during shifts and between shifts).

Both the SPB/SFX and the FXE users have shared access to another 6 nodes. The default expectation is that those nodes are using during the shift of the users, and usage stops at the end of the shift (so that the other experiment can start using the machines during their shift). These are sa1-onc-01, sa1-onc-02, sa1-onc-03, sa1-onc-04, sa1-onc-05, sa1-onc-06.

Overview of available nodes and usage policy:

name purpose
sa1-onc-spb reserved for SPB/SFX
sa1-onc-fxe reserved for FXE
sa1-onc-01 to sa1-onc-06 shared between FXE, SPB use only during shifts

These nodes do not have access to the Internet.

The name sa1-onc- of the nodes stands for SAse1-ONlineCluster.

5.2.2. Online cluster nodes in SASE 2

Beamtime in SASE 2 is shared between the MID and the HED instruments, with alternating shifts: when the MID shift stops, the HED shift starts, and vice versa.

Within SASE2, there is one node reserved for the MID experiments (sa2-onc-mid), and one node is reserved for the HED experiments (sa2-onc-hed). These can be used by the groups at any time during experiment period (i.e. during shifts and between shifts).

Both the MID and the HED users have shared access to another 6 nodes. The default expectation is that those nodes are using during the shift of the users, and usage stops at the end of the shift (so that the other experiment can start using the machines during their shift). These are sa2-onc-01, sa2-onc-02, sa2-onc-03, sa2-onc-04, sa2-onc-05, sa2-onc-06.

Overview of available nodes and usage policy:

name purpose
sa2-onc-mid reserved for MID
sa2-onc-hed reserved for HED
sa2-onc-01 to sa2-onc-06 shared between MID, HED use only during shifts

These nodes do not have access to the Internet.

The name sa2-onc- of the nodes stands for SAse2-ONlineCluster.

5.2.3. Online cluster nodes in SASE 3

Beamtime in SASE 3 is shared between the SQS and the SCS instruments, with alternating shifts, when the SQS shift stops, the SCS shift starts, and vice versa.

Within SASE3, there is one node reserved for the SCS experiments (sa3-onc-scs), and one node is reserved for the SQS experiments (sa3-onc-sqs). These can be used by the groups at any time during experiment period (i.e. during and between shifts).

Both SASE3 instrument users have shared access to another 6 nodes. The default expectation is that those nodes are used during users shift, and usage stops at the end of the shift (so that the other experiment can start using the machines during their shift). These are sa3-onc-01, sa3-onc-02, sa3-onc-03, sa3-onc-04, sa3-onc-05, sa3-onc-06.

Overview of available nodes and usage policy:

name purpose
sa3-onc-scs reserved for SCS
sa3-onc-sqs reserved for SQS
sa3-onc-01 to sa3-onc-06 shared between SCS, SQS use only during shifts

These nodes do not have access to the Internet.

The name sa3-onc- of the nodes stands for SAse3-ONlineCluster.

Note that the usage policy on shared nodes is not strictly enforced. Scientists across instruments should liaise for agreement on usage other than specified here.

5.2.4. Access to online cluster

The ONC can only be accessed from workstation (Linux Ubuntu 16.04) in the control hutch or from dedicated access workstations located at the XFEL headquarter building on level 1. Currently, such access workstations are available in the green area, close to room E1.062 at the level of the main reception.

From these access computers, one can ssh directly into the online cluster nodes and also to the Maxwell cluster (see Offline cluster). The X display is forwarded automatically in both cases.

There is no direct Internet access from the online cluster possible.

5.2.5. Storage

The following storage resources are available on the Online user cluster:

  • raw: data stored by DAQ (data cache) - not accessible (access planned via reader service in the long run)
  • usr: beamtime store. Into this folder users can upload some files, data or scripts to be used during the beamtime. This folder is mounted and thus immediately synchronised with a corresponding folder in the offline cluster. There is not a lot of space here (5TB).
  • proc: can contain data processed by dedicated pipelines (e.g. calibrated data). Not used at the moment (May 2019).
  • scratch: folder where users can write temporary data, i.e. the output of customized calibration pipelines etc. This folder is intended for large amounts of processed data. If the processed data is small in volume, it is recommended to use usr.

Access to data storage is possible via the same path as on Maxwell cluster:

/gpfs/exfel/exp/<instrument>/<instrument_cycle>/p<proposal_id>/(raw|usr|proc|scratch)
Folder Permission Quota Retention
raw None No Data migration to offline storage and removed
usr Read/write 5TB immediately synced with Maxwell cluster
proc Read NO Data removed after migration
scratch Read/write NO Data removed when needed

To simplify access to files, symbolic links are in place that create a file structure as is visible on the online cluster.

5.2.6. Access to data on the online cluster

Currently, no access to data files is possible from the online cluster: the raw directory is not readable and the proc directory not populated with files.

Online analysis tools running on the online cluster thus have to be fed the currently recorded data through the Karabo Bridge.

File-based post-processing thus has to take place from the offline (=Maxwell) cluster after the files have been transferred at the end of a run. There is a delay of several minutes for this (depending on run length and overall business of the data transfer system).

5.2.7. Home directory warning

The home directory /home/<username> for each user (with username <username> on the online cluster is not shared with the home directory /home/<username> on the offline(=Maxwell) cluster. The home directory /home/<username> within the online cluster is shared across all nodes of the online cluster. The home directory /home/<username> within the offline cluster is shared across all nodes of the offline cluster.

To share files between the online and the offline cluster, the /gpfs/exfel/exp/<instrument>/<instrument_cycle>/p<proposal_id>/usr directory should be used: the files stored here show up in both the online and offline cluster, and are accessible to the whole group of users of this proposal.

5.3. Offline cluster

The Maxwell cluster located in DESY data center is available for data processing and analysis for use during and after the experiment. Users are welcome and encouraged to make themselves familiar with the Maxwell cluster and its environment well in advance of the beam time. In the context of European XFEL experiments, the Maxwell cluster is also referred to as the “offline” cluster. Offline here refers to the experiments and is used to distinguish the “offline” cluster from the “online cluster”, not being offline from the Internet.

General information on the Maxwell cluster is available here: https://confluence.desy.de/display/IS/Maxwell .

5.3.1. User access

Upon acceptance of a proposal the main proposer will be asked to fill out the “A-form” which, among information on the final selection of samples to be brought to the experiment, also contains a list of all experiment’s participants. At time of submission of the A-form, all the participants have to have an active account in UPEX. This is the prerequesite for getting access to the facility’s computing and data resources. After submission of the A-form, additional participants can be granted access to the experiment’s data by PI request.

Users have access to:

  • HPC cluster
  • beamtime store, data repository and scratch space
  • web based tools

5.3.2. FastX Login

The recommanded way to access to the Maxwell cluster is through FastX. It can be used with the FastX client (Windows, OsX, Linux) or in any web browser.

Entry-points:

Detailed information on remote login to Maxwell
Detailed information on FastX service

5.3.3. SSH access

Access via ssh is possible through host max-exfl.desy.de.

If you are trying to access the cluster from eduroam (or more generally outside the DESY/EuXFEL network), then you need to ssh to bastion.desy.de first, and then ssh to the max-exfl.

Note

It is not advised to ssh through bastion to access the Maxwell cluster from outside the desy network. The recommanded way is to use FastX login, see FastX Login.

5.3.4. Resources

Overall XFEL resources ready for the first user experiment period in Maxwell cluster includes:

  • 100 CPU nodes (20 cores)
  • 7 nodes with GPU cards

Hardware detail is available here: https://confluence.desy.de/display/IS/Maxwell+Hardware.

Jobs are scheduled by a Slurm system. For more details, also see the Maxwell documentation.

See also various contributions to a training event from March 2018 at https://indico.desy.de/indico/event/19753 .

During beamtime

4 compute nodes are reserved for each beamtime. Status of the reservation can be seen using the command:

scontrol show res upex_002416

where the 6 digit number (upex_002416) is the proposal id padded with zeros.

The output of the command looks like:

[@max-exfl001]~/reservation% scontrol show res upex_002416
ReservationName=upex_002416 StartTime=2019-03-07T23:05:00 EndTime=2019-03-11T14:00:00 Duration=3-14:55:00
Nodes=max-exfl[034-035,057,166] NodeCnt=4 CoreCnt=156 Features=(null) PartitionName=upex Flags=IGNORE_JOBS
TRES=cpu=312
Users=appel,cerantol,cstrohm,grabiger,lwollenw,plueckth,prestont,rupert,uschmann,zastrauu,zuzkon Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

The output of this command tells you the time when the reservation is valid (start_beamtime-6hours, end_beamtime+6hours), number and names of the compute nodes reserved, ReservationName and users who are allowed to submit jobs to utilize this reservation.

In order to submit a job using this reservation one needs to add the reservation name as an option to the sbatch command in the job submission script. For example:

#SBATCH --reservation=upex_002416

5.3.5. Queue

Please use the upex queue on the maxwell cluster. For example, to allocate a node for interactive use:

salloc -N 1 -p upex -t 01:00:00

See the Maxwell documentation on Running Jobs Interactively for more details.

5.3.6. Tools available

Desy provides a set of applications on the Maxwell cluster. The list is available here: https://confluence.desy.de/display/IS/Alphabetical+List+of+Packages.

5.3.7. Storage

Users will be given a single experiment folder per beam time (not per user) through which all data will be accessible, e.g:

/gpfs/exfel/exp/<instrument>/<instrument_cycle>/p<proposal_id>/(raw|usr|proc|scratch)
Storage Quota Permission Lifetime comments
raw None Read 2 months Fast accessible raw data
usr 5TB Read/Write 24 months user data, results
proc None Read 6 months processed data e.g. calibrated
scratch None Read/Write 6 months Temporary data (lifetime not guaranteed)

5.3.8. Synchronisation

The data in the raw directories are moved from the online cluster (at the experiment) to the offline (Maxwell) cluster as follows:

  • when the run stops (user presses button), the data is flagged that it can be copied to the Maxwell cluster, and is queued to a copy service (provided by DESY). The data will be copied without the user noticing.

  • Once the data is copied, the data is ‘switched’ and becomes available on the offline cluster.

    The precise time at which this switch happens after the user presses the button cannot be predicted: if the data is copied already (in the background), it could be instantaneous, otherwise the copy process needs to finish first.

  • The actual copying process (before the switch) could take anything between minutes to hours, and will depend on (i) the size of the data and (ii) how busy the (DESY) copying queue is.
  • The usr folder is mounted from the Maxwell cluster, and thus always identical between the online and offline system. However, it is not optimised for dealing with large files and thus potentially slow for lager files. There is a quota of 5TB.

5.4. Docker containers

Running of docker containers is supported as an experimental feature. To use this, users need to request permission to do so (by sending an email requesting this):

  • for use of docker containers on the online cluster, please request by email it-support@xfel.eu
  • for use of docker containers on the offline cluster, please request this by email to maxwell.service@desy.de

You may need to use docker containers that are used by other tools. For example, the calibrate.py script that converts raw into calibrated files needs docker access (instrument scientists can provide the relevant script).

5.5. Compute environment FAQ

Frequently asked questions

5.5.1. Is the Jupyter notebook available on the Maxwell cluster?

The easiest way to run Jupyter Notebooks on the Maxwell cluster is to use the JupyterHub portal.

Alternatively, you can use the Python anaconda distribution: To get the Anaconda distribution (based on Python 3) and the Jupyter notebook as an executable into the PATH, the command is:

module load exfel exfel_anaconda3

For example (as of 15 January 2018):

[user@max-exfl001]~% which python
/usr/bin/python                                # this is the linux system python
[user@max-exfl001]~% module load exfel exfel_anaconda
[user@max-exfl001]~% which python
/gpfs/exfel/sw/software/xfel_anaconda3/X.X/bin/python   # xfel anaconda's python
[user@max-exfl001]~% which ipython
/gpfs/exfel/sw/software/xfel_anaconda3/X.X/bin/ipython
[user@max-exfl001]~% which jupyter-notebook
/gpfs/exfel/sw/software/xfel_anaconda3/X.X/bin/jupyter-notebook
[user@max-exfl001]~% jupyter-notebook --version
5.7.4

The `module load exfel exfel_anaconda3` will create a so called jupyter kernel that makes the all exfel python modules available in jupyter. If you prefer, you can create your own anaconda installation in your user account.

5.5.2. Is there a JupyterHub instance available?

Maxwell cluster host Jupyter Hub, can be accessed using the user’s login credentials. Once logged in, perform the following steps,

  • Choose the Maxwell partition (Shared node Jupyter/ dedicated node ExFEL etc.) your group has access to. This will start the Jupyter notebook server on one node on the Maxwell cluster. Maximum time limit for shared node on Jupyter hub partition is 7 days. Dedicated node on all other partions has a maximum time limit of 8 hours.

  • Click on Spawn

  • Browse through your home (/home/<username>) directory to select or start a new Jupyter project.

  • Jupyter projects stored for eg. on:

    /gpfs/exfel/exp/<instrument>/<instrument_cycle>/p<proposal_id>/(usr|scratch)
    

are also accessible via symbolic link in user’s home directory.

Remember to disconnect by clicking on Control Panel and then Stop My Server to free the reserved node.

This is the easiest way to run Jupyter Notebooks on the Maxwell cluster (it does not required port forwarding or machine reservations). As mentioned above loading the exfel_anaconda3 module (`module load exfel exfel_anaconda3`) for the first time will create a jupyter kernel called `xfel` in you home directory. When using JupyterHub you can activate the kernel and hence all xfel python modules by selecting `xfel` in the `Change Kernel` submenu under the `Kernel` menu entry.

Some details are explained at https://confluence.desy.de/display/IS/JupyterHub+on+Maxwell .

5.5.3. Can I execute a Jupyter notebook on the Maxwell cluster and connect it to the webbrowser of my desktop/laptop?

Yes.

This is a neat way of running the analysis as we don’t need X-forwarding, and you can often get better rendering using your ‘own’ browser on your computer than to X-forward the browser window of the browser provided on the Maxwell nodes.

The recommended way of using Jupyter notebooks is through JupyterHub.

Note

Max-jhub service is experimental and may experience downtime

If you cannot use JupyterHub, or want to have more fine grained control about the resources you use in your notebook session.

5.5.3.1. From inside the DESY network

  1. Request a node just for you to carry out the analysis.

    To do this, we need to login to max-exfl.desy.de, and then use the salloc command to request - for example - one Node (N=1) for 8 hours (t=08:00:00) in the upex partition (p=upex):

    [MYLOCALUSERNAME@COMP]$ ssh USERNAME@max-exfl.desy.de
    [USERNAME@max-exfl001]$ salloc -N 1 -p upex -t 08:00:00
    

    The system will respond (if it can allocate a node for you) with something like:

    salloc: granted job allocation ....
    salloc: ...
    salloc: Nodes max-exfl072 are ready for job
    

    The node we can now use (for the next 8 hours) is max-exfl072.

    In the commands above replace USERNAME with your username on the Maxwell cluster (same as XFEL/DESY user name).

    See section Queue for the salloc command.

  2. Now we open a new terminal on our local machine COMP and ssh directly to that node max-exfl072. We also forward port 8432 on node max-exfl072 (or whichever one is assigned) to the port 8432 on our local machine COMP):

    [MYLOCALUSERNAME@COMP]$ ssh -L 8432:localhost:8432 max-exfl072
    
  3. Now we can start the Jupyter Notebook and need to tell it to use port 8432, and not to start a browser automatically:

    [USERNAME@max-exfl072]$ module load anaconda/3
    [USERNAME@max-exfl072]$ jupyter-notebook --port 8432 --no-browser
    
  4. Then open a browser window on your machine COMP and ask it to connect to port 8432 https://localhost:8432

Summary

  • request node

  • then forward port and start jupyter there:

    ssh -L 8432:localhost:8432 USERNAME@max-exfl072
    module load exfel exfel_anaconda3
    jupyter-notebook --port 8432 --no-browser
    
  • open browser on local machine (https://localhost:8432)

5.5.3.2. From outside the DESY network

If your machine LAPTOP is outside the XFEL/DESY network, you need to get into the DESY network via bastion.desy.de. In this case, we recommend that you first ssh to max-exfl (via bastion.desy.de) to create a node allocation (in our example below for 8 hours):

[MYLOCALUSERNAME@LAPTOP]$ ssh USERNAME@bastion.desy.de
[USERNAME@bastion01]$ ssh max-exfl
[USERNAME@max-exfl001]$ salloc -N 1 -p upex -t 08:00:00

Once this is done, we need to connect a port on your local machine with the port that the jupyter notebook listens to on max-exfl072. We need to go via bastion, i.e.

[MYLOCALUSERNAME@LAPTOP]$ ssh -L 8432:localhost:8432 bastion.desy.de -t ssh -L 8432:localhost:8432 max-exfl072
[USERNAME@max-exfl072] module load exfel exfel_anaconda3
[USERNAME@max-exfl072] jupyter-notebook --port=8432 --no-browser

Then open a browser with URL https://localhost:8432 on local machine LAPTOP.

5.5.3.3. Technical comments

You can ignore the comments below unless you run into difficulties, or what to get more background information.

  • If your laptop is connect to eduroam, you are outside the XFEL/DESY network, and need to follow instructions in From outside the DESY network.

  • We have used 8432 as the port in the examples above. There is no particular reason for doing so, other than the port not being used by any other famous software (see list on Wikipedia), and the port number being greater than 1024.

    You can chose other ports as you like. Using different ports also allows to run multiple Jupyter Notebook servers on the same node (each listening to a particular node).

    By default (i.e. if we don’t use the --port switch when starting the Jupyter Notebook), port 8888 is used.