|
|
|
|
|
This page will collect discussion on code parallelization issues.
|
|
|
|
|
|
* http://codebetter.com/blogs/patricksmacchia/archive/2008/12/01/lessons-learned-from-a-real-world-focus-on-performance.aspx
|
|
|
|
|
|
* http://herbsutter.wordpress.com/2008/11/02/effective-concurrency-understanding-parallel-performance/ Subsequent articles in the series will also be linked to from this wiki.
|
|
|
|
|
|
Note also: OpenMP 2.5, as implemented in currently available gcc compiler versions, does not allow the parallelization of loops indexed by iterators, or even unsigned integer types. http://iwomp.univ-reims.fr/cd/papers/TM06.pdf analyses approaches to get around this limitation. http://wikis.sun.com/pages/viewpage.action?pageId=57508020 suggests that this will be possible in OpenMP 3.0.
|
|
|
|
|
|
|
|
|
## Remaining text initially grabbed from CloudySummitBrussels
|
|
|
|
|
|
We could do the grids and optimizations by a separate script/program. On the other hand doing it in MPI should be quick to do and easy.
|
|
|
|
|
|
Vectorization would require us to break up large complicated loops into multiple simple loops. These could then be vectorized or parallelized or both.
|
|
|
|
|
|
Issues with MPI/OpenMP
|
|
|
|
|
|
shared vs distributed memory. The current implementation of OpenMP does not support distributed memory clusters. Future upgrades may.
|
|
|
|
|
|
NUMA systems. OpenMP was never defined with NUMA systems in mind. Threads/CPUs share memory and the physical allocation is determined by which CPU touches the memory first. Keeping this optimal as far as access speed is concerned is going to be a nightmare.
|
|
|
|
|
|
global structs
|
|
|
|
|
|
|
|
|
The first priority should be to remove redundancy from the code. E.g. create a loop out of the various ionization solvers for each element.
|
|
|
|
|
|
The way we determine opacities, ionization, etc, gets in the way of parallelizing over the frequency grid. So the best approach still seems to be to parallelize things like iso-electronic sequences? We can profile test suite runs and try to identify routines that are worth paralellizing, but even for the big H2 models the profile is fairly flat, The matrix solvers only used up roughly 30%. This could become more if other parts of the code started using basic BLAS routines to evaluate intergrals etc.
|
|
|
|
|
|
It seems worthwhile to insert SuperLU into the code since that gives paralellization with little effort. But this is likely only to give us 10 - 20% speedup (more for big Fe II models).
|
|
|
|
|
|
- How to handle exceptions in parallel sections?
|
|
|
|