Feature: Addition of other compilers, vectorization, OpenMP and hybrid parallelization
Summary:
-
Sassena is equipped with MPI parallelization for distributed parallelization and with threading for shared memory parallelization. However significant scalability is not achieved through thread parallelism only. It is useful to add another layer of shared memory parallelism to enable hybris parallelism within sassena. This branch tries to use OpenMP to solve this problem.
-
Sassena does not take the memory architecture into account while it vectorizes.
-
New options are added so that the user can choose the compiler. Current sassena would always chose the default MPI compiler.
Problem 1:
Detailed description: Current state of parallelism: (n MPI processes and n threads)
- MPI Process 1
- Thread 1
- Thread 2 ...
- Thread n
- MPI Process 2
- Thread 1
- Thread 2 ...
- Thread n ...
- MPI Process n
- Thread 1
- Thread 2 ...
- Thread n
Problem with the current state: It is expected to be n*n or n^2 times faster. However, It is only n times faster.
Reason: If we use n MPI processes and 1 thread as shown below,
- MPI Process 1
- Thread 1
- MPI Process 2
- Thread 1 ...
- MPI Process n
- Thread 1
then it is n times faster.
However, if we use 1 MPI processes and n threads as shown below,
- MPI Process 1
- Thread 1
- Thread 2 ...
- Thread n then it is not speeding up at all apart from some special case like a single trajectory.
Conclusion: So only MPI is effective.
Solution: This feature branch adds openMP as another layer of threading to solve this problem:
Expected final state: (n MPI processes, 1 thread and n OpenMP threads)
- MPI Process 1
- OpenMP thread 1
- OpenMP thread 2 ...
- OpenMP thread n
- MPI Process 2
- OpenMP thread 1
- OpenMP thread 2 ...
- OpenMP thread n ...
- MPI Process n
- OpenMP thread 1
- OpenMP thread 2 ...
- OpenMP thread n
It is expected to be n*n times faster with this configuration.
Implementation steps:
- I found the hotspot using the intel tools.
- For all (coherent) type of calculations, parallelized the loop. For self (incoherent) type of calculations, could not find a good strategy.
- Added options for intel compilers and added compiler flags to do vectorization more efficiently.