クラスタの利用法

Introduction to the PREWS cluster

* 総則 General rules

バックアップはユーザの責任とし、可用性・セキュリティは保証しません。
メンテナンスのために予告なくシステムを停止することがあります。
障害や攻撃によりデータが失われたり盗まれる危険性も完全には否定できません。
例えば、過去に京都大学のスパコンシステムで 77 TB のデータが消失したことがあります。
https://www.iimc.kyoto-u.ac.jp/ja/whatsnew/information/detail/211228056999.html
本クラスタでも、雷のため、システムが突然停止したことがあります。

Users have to back up their own data themselves.
We do not guarantee availability and security.
We might shutdown the system for maintenance without prior announcement.
Your data might be lost and/or stolen due to system failures or attack.
For example, 77 TB of user data was lost from the super-computer system at Kyoto University.
https://www.iimc.kyoto-u.ac.jp/ja/whatsnew/information/detail/211228056999.html
Our cluster also suffered an unexpected shutdown when a thunderstorm hit this area.

したがって、重要な研究データは必ず各自の責任でバックアップし、個人情報・機微情報・
企業と NDA を結んでいるような データは置かないでください。

Please back up your important data yourself.
Do not put private or confidential data (like those protected by NDA or
export controlled).

アカウントの有効期限満了後 2 週間でシステムにログインできなくし、
1 ヶ月でデータを削除します。

We will block your login two weeks after your account expires and
delete your files one month after expiration.

言うまでもないことですが、不正アクセスや他人の知的財産権を侵害するような
行為は厳禁です。また、暗号通貨マイニングや NFT など研究と関係ないことに
計算資源を使ってはいけません(電気代がかかっています)。不正行為は関係部局に
連絡し、処分の対象となります。

Never perform anything illegal, including (but not limited to) cracking and piracy.
Do not use our computing resource for anything un-related to your approved research.
Do not perform crypto-currency mining or NFT transaction.
Violation of these rules will be reported to relevant personels and subject to
disciplinary actions.

本クラスタはクライオ電顕のデータ解析(SPA, STA, MicroED)専用です。
広い意味での構造生物学・構造化学計算(例えば、構造予測、蛋白質設計、分子動力学、
量子化学)については、管理者にご相談ください。
計算資源の空き具合や科学的意義に基づいて、個別に判断します。

The PREWS cluster must be used exclusively for data processing of
cryo-electron microscopy (SPA, STA, MicroED).
For a broader scope of structural biology/chemistry applications
(e.g. structure prediction, protein design, molecular dynamics,
quantum chemistry), please contact the cluster administrator.
We might give permission on a case-by-case basis, based on the resource
availability and scientific merits.

アカウントを無断で他人に共有しないでください。同じラボの同僚であってもです。
不正行為があった場合、アカウントの持ち主の責任となります。

Do not let others use your account without explicit permission from the
cluster administrators, even to your colleagues.
The account owner will be held responsible for any damages.

ジョブの実行様態やディスク消費量などは随時監視しており、非効率な利用については
注意することがあるので指導に従ってください。具体的には、scratch を使っていないとか、
粒子を適切にダウンサンプリングしていないとか、古い job を Clean していないとか、
ゴミ箱や core dump を何ヶ月も消していないなどの行為は注意します。

We monitor job and disk usages and might warn inefficient use cases.
Follow advices from the system administrator.
For example, use scratch space, down-sample particles, clean old jobs and
delete trash folders and core dumps.

不正利用を防ぐため、ネットワーク接続のログ(接続元と接続先の IP アドレス・ポート・日時など)
を保存しています。

To prevent abuse, we logs network connections, including source and destination
IP addresses and ports, and connection timestamps.

利用状況の統計は利用者を匿名化した上で公開することがあります。

We might publish usage statistics after anonymizing.

* ログインノード Log-in node

ログインノード `prews-login` は、計算ジョブを投入したり、計算結果を確認するためのものです。
ログインノードに負荷をかける行為は禁止です。ログインノードはファイルサーバを兼ねているので
他のユーザにも迷惑がかかります。

The log-in node `prews-login` is exclusively for submitting jobs and checking your results.
Never put loads on the log-in node. Log-in node serves as a file server, so
high loads on it affect other users too.

例えば、`relion_display` によって処理結果を確認するのはかまいませんが、
`relion_refine` によって classification や refinement を実行するのは禁止します。
負荷をかけているプロセスは管理者権限によって通告なく強制終了させます。

For example, you may use `relion_display` to check your job results but
do not run `relion_refine` for classification and refinement on the log-in node.
Processes with high loads will be terminated by the administrator without notice.

ログインノード上で JupyterLab や CryoSPARC など、Web ベースのアプリケーションを
実行することは禁止します。複数ユーザが同じプログラムを使うとポートが衝突してトラブルの
原因になるのと、デフォルト設定のままではセキュリティ・リスクがあるためです。
計算ノード上で実行してローカルへ port-forward するのは禁止しませんが、アプリケーション側で
パスワードを設定するなど、他ユーザから操作できないような設定を施してください。
計算ノードへの SSH は、そのノード上で job を実行している人しかできませんが、
その他のポートへの接続は誰でもできてしまいます。
安全なやりかたが分からない人は Web アプリケーションは利用しないでください。

Do not run Web-based applications such as JupyterLab and CryoSPARC on the log-in node
to avoid conflicts of ports and security risks.
You may run these programs on allocated worker nodes and access them by port-forwarding
to your local computer, provided that you make sure others can not use the Web interface (e.g. by password).
SSH connection to a worker node is denied unless your job is running on the node,
but other ports can be accessed by all users.
If you don't know how to do so securely, please do not run Web-based applications.

* 計算ノード Worker nodes

計算ノードは GPU ノード 30 台と CPU ノード 10 台の二種類があります。
それぞれのスペックは以下の通りです。

Worker nodes consist of 30 GPU nodes and 10 CPU nodes.

GPU ノード:

- AMD EPYC 7713 (64 物理コア、SMT 有効で 128 スレッド)
- NVIDIA A5000 GPU (24 G) x 4
- 512 GB RAM
- SSD スクラッチ 2 TB

GPU nodes:

- AMD EPYC 7713 (64 physical cores, 128 SMT threads)
- NVIDIA A5000 GPU (24 G) x 4
- 512 GB RAM
- SSD scratch 2 TB

CPU ノード:

- Intel Xeon Gold 6330 x 2 (合計 56 物理コア、SMT 有効で 112 スレッド)
- 256 GB RAM
- スクラッチエリアなし

CPU nodes:

- Intel Xeon Gold 6330 x 2 (56 physical cores, 112 SMT threads in total)
- 256 GB RAM
- No scratch area

ジョブを投入すると SSD スクラッチ領域が確保され、 `$SCRATCH_DIR` 環境変数にパスが
設定されるので、そこを利用してください。この領域はジョブが終了すると自動的に消去されます。

When you submit a job, a scratch space is allocated on the local SSD and the path
to it is set to the `$SCRATCH_DIR` environmental variable.
When the job is completed, this space is wiped out.

* ディスク Disk

一人あたりのディスク容量は 10 TB に制限されています(TODO: 10 TB は仮の値。利用状況を見て変更します)。
この制限を超えると、ディスクへの書き込みが出来なくなり、プログラムがクラッシュしたり
データが失われる可能性があるので注意してください。

Your disk quota is 10 TB (TODO: TBD). When exceeded, you can no longer write to the disk.
Your program will crash and un-saved data will be lost.

容量制限は、ユーザ単位ではなくディレクトリ単位で適用されます。
あなたのディレクトリに ACL などを設定し、他のユーザがファイルを設置した場合、その容量は
あなたのディスク利用量に加算されますので注意してください。

The disk quota is calculated by directories, not users.
If you set up an ACL such that other accounts (e.g. your collaborators) can write to your directory,
their files are added to YOUR usage, not their usage.

現在の使用量は次のコマンドで確認できます。

    disk_quota

次のような出力が得られた場合、10 TB の上限中のうち、66 GB を利用していることが分かります。

    DIRECTORY                          USED_BYTES              QUOTA_BYTES
    ----------------------------------------------------------------------
    /home/tnakane                  66,135,976,471       10,000,000,000,000

You can check your current disk usage by the following command.

    disk_quota

The above example shows the account `tnakane` is using 66 GB out of 10 TB quota.

正当な理由がある場合は、管理者の承認のもとで上限を増やすことがあります。
ディスクの無駄な使い方をしていないことが前提であり、利用状況を確認して審査します。
(例えば、古い job を clean していないとか、粒子をダウンサンプリングしていないと
いった無駄をなくすまでは申請を却下します)

If you have a valid reason, the administrator might agree to increase your quota,
after reviewing your disk usage. If you have wasted space (for example,
you did not clean old jobs or did not down-sample particles), the application
will be rejected until you sort out the mess.

* ジョブシステム Job system

ジョブシステムとして SLURM を利用しています。

We use SLURM for the job system.

現在、1 ユーザが利用できる資源の上限はシステム的には制限していません。
他の人の job が入らなくならないように、譲り合って利用してください。
(TODO: 今後 、混雑状況を見て、上限を設定します)

The maximum number of jobs you can run is not limited by the system at the moment.
Please be considerate and leave room for others' jobs.
(TODO: In future, we might enforce a limit after analysing the demand.)

** SLURM コマンドの例 Examples

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu*         up 1-12:00:00     27  down* prews-gpu[04-30]
gpu*         up 1-12:00:00      3   idle prews-gpu[01-03]
cpu          up 1-12:00:00      7  down* prews-cpu[04-10]
cpu          up 1-12:00:00      3   idle prews-cpu[01-03]

ノードの状態を表示します。idle が待機状態です。drain や down はメンテナンス中あるいはシャットダウン中です。

`sinfo` command displays the status of nodes.
`idle` means free nodes. `drain` and `down` mean the node is under maintenance or shutdown.

需要が少ない時期は、節電のため、ノードの一部を停止しています。
混雑してきてノードが不足している場合は連絡ください。追加起動します。

We often shutdown some nodes to save electricity when the demand is low.
If you find the queue is full, please don't hesitate to ask us to launch more nodes.

$ squeue

入っているジョブを一覧表示します。

`squeue` displays the list of submitted jobs.

$ scancel ID

指定した ID のジョブをキャンセル(実行中の場合は強制終了)します。
当然自分のジョブしかキャンセルできません。　

`scancel` cancles the specified job. If the job is already running, the job is terminated.
You can cancel only your own jobs, of course.

$ sbatch XXXX.sh

ジョブスクリプト XXXX.sh を投入します。
スクリプトの書き方は SLURM マニュアルを見てください。

`sbatch` submits a job script.
Please read SLURM documentation how to write a job script.

$ srun --pty --mem 60G --cpus-per-task 16 --gres gpu:1 bash

16 コア、60 GB のメモリ、1 GPU を確保して interactive job を始めます。
(計算が終わったら確実にログアウトしてください)

The above command starts an interactive job, reserving 16 cores, 60 GB of memory
and 1 GPU. Please make sure you log-out and release the allocation once your
job is done.

$ srun --pty --mem 60G --cpus-per-task 16 -p cpu

CPU ノードを要求します。
GPU を使わない計算は CPU ノードで行ってください。

This command requests a CPU node instead.
Please use CPU nodes for calculations that do not use GPUs.

$ srun --x11 --pty --mem 60G --cpus-per-task 16 --gres gpu:1 bash

上と同じですが X11 forwarding を有効にし、GUI アプリケーションも使えます。

This is the same as the above `srun` example but enables X11 forwarding such that
you can use GUI applications

====================================================================================

* アプリケーション Applications

システムには以下のアプリケーションが導入されています。
これ以外のソフトウェアが必要な場合、管理者に提案してください。
多くのユーザにとって有益であり、なおかつ、ライセンスや管理上の問題がない場合は管理者が導入します。
それ以外の場合は、各ユーザが自分のホームディレクトリにインストールしてください。
この場合、導入方法などについてのサポートは行いません。

The cluster has the following applications.
If you need other programs, please contact the administrator.
If your suggestion is useful to many users and does not have liccensing and management issues,
we will install it globally. Otherwise, please install your own copy in your home directory.
We do not provide support for local installations.

** IMOD

`module load imod` で IMOD 4.11.25 を利用できます。

You can activate IMOD 4.11.25 by running `module load imod`.

** RELION と関連ツール RELION

RELION 5.0 が入っています。これを使うには、ログイン後
`module load relion` コマンドを実行してください。

これにより、以下の関連プログラムのパスも RELION のデフォルトとして設定されます。

We have RELION 5.0 beta. To use, run `module load relion` after logging in.
This activates the following relevant programs as well.

- CTFFIND 4.1.15
- Topaz
- Blush

RELION 公式チュートリアルのデータは `/home/data/RELION_Tutorial` に置いてあります。

RELION's official tutorial dataset is available at `/home/data/RELION_Tutorial`.

RELION の計算は、かならず計算ノードにジョブを投入して行ってください。
Job template としては、`/apps/packages/relion-5.0/relion-5.0-slurm-2gpu.sh` がデフォルトで
指定されているはずです。これは GPU ノードを半ノード(2 GPUs, 256 GB RAM, 32 cores/64 threads)確保
するので、ジョブの種類に応じて適宜 MPI process 数や thread 数を指定してください。

Please submit all non-trivial RELION jobs to worker nodes.
By default, `/apps/packages/relion-5.0/relion-5.0-slurm-2gpu.sh` is specified as
a job submissin template. This script reserves a half of a GPU node (2 GPUs, 256 GB RAM, 32 cores/64 threads);
set up the number of MPI processes and threads according to your job types.

Class2D/3D や Refine3D を GPU ノードで実行する場合は、ローカル SSD スクラッチエリア $SCRATCH_DIR を使ってください。

For Class2D/3D and Refine3D on a GPU node, you can and should use local SSD scratch space by
specifying "$SCRATCH_DIR".

/apps/packages/relion-5.0/relion-5.0-slurm-4gpu.sh は GPU ノードを 1 ノードまるごと確保します。
/apps/packages/relion-5.0/relion-5.0-slurm-cpu.sh は CPU ノード(256 GB RAM, 56 cores/112 threads)を
1 ノードまるごと確保します。GPU 非対応ジョブ(Extract, Polish, CtfRefine など)は CPU ノードに投入してください。

/apps/packages/relion-5.0/relion-5.0-slurm-4gpu.sh reserves a whole of a GPU node.
/apps/packages/relion-5.0/relion-5.0-slurm-cpu.sh reserves a whole of a CPU node (256 GB RAM, 56 cores/112 threads).
Non-GPU accelerated jobs such as Extract, Polish and CtfRefine must be submitted to a CPU node.

適切な MPI process 数や thread 数は、box size などデータセットの性質によって異なりますが、
まずは次のような値を推奨します。

Adequate numbers of MPI processes and threads depend on your dataset (e.g. the box size).
As a starting point, we recommend the following values.

Refine3D, Class3D, MultiBody: 1 + (number of GPUs x 2) MPI processes, 8 - 10 threads / process
Class2D: 1 MPI process (for VDAM algorithm), 24 - 32 threads / process
AutoPick with Topaz: (number of GPUs x 3) MPI processes
CtfFind, Extract: 32 MPI processes
CtfRefine, Polish: 8 MPI processes x 8 threads / process

GPU の数を 2 から 4 に増やしても、計算速度は倍にはなりません。せいぜい 1.8 倍です。
そのため、粒子数の多いデータを処理するときは、GPU 数を増やすより、データセットを分割して処理したほうが効率的です。
また、複数ノードにまたがる計算や 8 GPU を使う計算は効率が悪いので推奨しません。

Doubling the number of GPUs from two to four does not double the computation speed.
The gain is at most 80 %.
Thus, when processing a huge dataset, it is more efficient to split the dataset and run multiple jobs
than running a single job using many GPUs.
We do not recommend running a job over multiple nodes or using more than 4 GPUs, because it is not efficient.

* Intel one API Toolkit (and Intel Classic compiler)

Intel oneAPI toolket 2023.2 is available by running /apps/oneapi/setvars.sh.

* CUDA SDK

Various versions of NVIDIA CUDA SDK are available in /usr/local.