Check-pointing Long-running Python Applications

Still more scientific applications are written in Python, as Python provides higher productivity and portability than most other programming languages. A common challenge in all scientific applications, is that they often run for very long time, days, weeks, or even months. As runtime increase so does the risk of a power or hardware failure, and scientific applications thus often require the programmer to write a method for check-pointing, i.e. a function that saves a known state of the running process which may be reloaded after a system error. From Python one may access the global heap space directly, including the name space, and it should thus be possible to write a generic checkpoint-restore function and integrate that in every long-running Python process. Such a system would be a huge benefit for scientists as they need no longer write their own checkpoint-restore methods.

Area: Project Bachelor Masters

Tags: Bohrium:scientific-programming:checkpoint

Contact: Brian Vinter, vinter@nbi.ku.dk

Activities: Analysis, design and implementation