Abstract of JOKLA Project

The JOREK code is already able to use up to several thousand CPU cores through hybrid MPI plus OpenMP parallelization. As an example, simulations of ELMs are produced taking into account the X-point geometry with both closed and open field lines. The complexity of the tokamak geometry and the fine mesh required leads to large computational requirements. The code is mainly composed of numerical computations on 3D data. The toroidal dimension of the tokamak is treated in Fourier space, while the poloidal plane is decomposed in 2D Bezier patches. The numerical scheme used involves a direct solver on a large sparse matrix in the preconditioning for the iterative solver. This direct solver (usually Pastix library) dominates computational time and memory consumption for large simulations. Additionally important operations are the assembly of the large sparse matrix, and to some extent the multiplication of the large sparse matrix with vectors. In some places, collective communications are synchronizing part of the MPI processes.

Intel Xeon Phi architecture and NVIDA GP-GPUs are steadily being adopted in clusters. The current generation MIC co-processor, Xeon Phi, provides a highly multi-threaded environment. Regular programming models such as MPI/OpenMP have started utilizing systems with these coprocessors. This specific hardware offers both large memory bandwidth and CPU resources compared to standard compute nodes. Porting a large scientific application to the Intel KNC (former version) of Xeon Phi aiming at high performance was a difficult task [1, 2]. We now expect to port the JOREK application on Intel KNL which promises to be significantly more suitable, in particular due to the higher memory bandwidth.

Former HLST projects (e.g., MICPORT in 2013, GOMIC in 2014) have demonstrated that porting a large scientific application on MIC is not a simple task. The aim of this project is twofold: first, a profiling/benchmark study will identify the bottlenecks of the JOREK application on KNL hardware and develop strategies to overcome these limitations, second, first optimizations will be carried out in order to improve the most severe bottlenecks.

[1] http://www.esaim-proc.org/articles/proc/abs/2016/01/proc165313/proc165313.html
[2] http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/post220s2-file3.pdf