Memory allocation errors

asked by River (2025/06/12 22:18)

Hello,

I've been running some fairly large models on a compute cluster, and after a few crashes due to running out of memory, I tried implementing the suggested memory management command “ulimit -v <number of KB RAM>”. This stopped the program from crashing outright / stopped the scheduler from killing my task, but now I am getting more errors during runtime.

Specifically, one error caught my eye:

Outer loop   4, Number of Determinants:   1433409  26543058 last variance  7.412405896600516E+02
alloc failed WaveFunctionInitCopyBasis 02 Im
to do BlockLanczosGroundStateConserveBasisKrylovRecalculate
Cheap fix needs to be improved

as it specifically says it's a cheap fix and needs to be improved. Does anyone know if anything better has been developed?

Or is there a better way to limit the amount of RAM Quanty attempts to use? On my local machine it would fill ram and then periodically write to disk when it needed more space, I'm not sure why it fails to do this in a server setting.

See below the raw output for more error messages.

Code Output:

Lmod is automatically replacing "gcc/12.3" with "intel/2023.2.1".

Lmod Warning:
-------------------------------------------------------------------------------
The following dependent module(s) are not currently loaded: gcccore/.12.3
(required by: intel/2023.2.1)
-------------------------------------------------------------------------------




Due to MODULEPATH changes, the following have been reloaded:
  1) flexiblas/3.3.1     2) openmpi/4.1.5

=============================================================
====    written by Maurits W. Haverkort                  ====
====    with contributions from:                         ====
====    Yi Lu, Robert Green, Sebastian Macke             ====
====    Marius Retegan, Martin Brass, and Simon Heinze   ====
====    (C) 1995-2018   All rights reserved              ====
====    www.quanty.org                                   ====
====    Beta version, be critical and report errors!!!   ====
=============================================================
====    Version 0.6 Autumn 2018                          ====
====            compiled at: Nov 25 2018 at 23:37:47     ====
=============================================================
====    When used in scientific publications please cite ====
====    one of the following papers as appropriate with  ====
====    respect to the methods used in your publication: ====
====    Phys. Rev. B 85, 165113 (2012)                   ====
====    Phys. Rev. B 90, 085102 (2014)                   ====
====    Euro Phys. Lett. 108, 57004 (2014)               ====
====    J. of Phys.: Conf. Series 712, 012001 (2016)     ====
=============================================================
Program executed on: Thu Jun 12 12:29:01 2025
Running on host    : platocpu010
number of available processors              : 40
maximum number of threads in parallel region: 40
Smallest positive float  : 2.225074E-308 
Smallest deviation from 1: 2.220446E-16 

Start of BlockGroundState. Converge 8 states to an energy with relative variance smaller than  1.490116119384766E-06

Start of BlockOperatorPsiSerialRestricted
Outer loop   1, Number of Determinants:        45        45 last variance  2.190014106090412E+00
Start of BlockOperatorPsiSerialRestricted
Start of BlockGroundState. Converge 8 states to an energy with relative variance smaller than  1.490116119384766E-06

Start of BlockOperatorPsiSerial
Outer loop   1, Number of Determinants:        45      2021 last variance  5.754242953567713E+00
  Restart loop 1 with a Krylov basis of 108 and a full basis of 2021
Start of BlockOperatorPsiSerial
Outer loop   2, Number of Determinants:      2021     63239 last variance  1.090220143151499E+02
  Restart loop 1 with a Krylov basis of 108 and a full basis of 63239
Start of BlockOperatorPsiSerial
Outer loop   3, Number of Determinants:     63239   1433409 last variance  2.797107634518841E+02
  Restart loop 1 with a Krylov basis of 108 and a full basis of 1433409
Start of BlockOperatorPsiSerial
Outer loop   4, Number of Determinants:   1433409  26543058 last variance  7.412405896600516E+02
alloc failed WaveFunctionInitCopyBasis 02 Im
to do BlockLanczosGroundStateConserveBasisKrylovRecalculate
Cheap fix needs to be improved
  Restart loop 1 with a Krylov basis of 24 and a full basis of 26543058
alloc failed WaveFunctionInitCopyBasis 02 Im
  Restart loop 2 with a Krylov basis of 24 and a full basis of 26543058
alloc failed WaveFunctionInitCopyBasis 02 Im
Start of BlockOperatorPsiSerial
alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01OperatorPsi failed in BlockOperatorPsiSerial
Start of BlockOperatorPsiSerial
alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC
 01alloc failed RealWaveFunctionAddElement 11 Re
ComplexWaveFunctionAddElement failed in ComplexWaveFunctionAddElementOMPMiniFlush
ComplexWaveFunctionAddElementOMPMiniFlush failed in OperatorPsiMC

Answers

, 2025/06/13 09:15, 2025/06/13 09:16

Dear River,

We have several routines in Quanty that can calculate spectra and or ground-states. In several cases one can tread between memory usage and speed. Whenever I need more memory I use the C function alloc or calloc and look if this call succeeds. If not we try to continue with an algorithm that is less memory hungry.

On modern machines alloc and calloc almost never fails unless you specifically tell the machine a limit. This is what the command “ulimit -v <number of KB RAM>” does for you. Modern machines assume that they will be able to give you the memory at the moment you actually need it. If the allocation did not fail and you do not have the hardware memory the code will crash (sometimes hard). On your local machine you are probably using your hard-disk as additional memory.

The error message that you see indicates that the slower routine we switch to when we run out of memory can be optimised and I also know (knew) how this can be done (you probably find a hint on how to do this in the source code) but at the same time I have not found the time to do the optimisation. I have a list of optimisations I want to make, but are limited by time at the moment.

For now I see 4 options to continue

  • 1) If you know how to program in C and have some time, feel free to contact me by mail and we can discuss your case in more detail and look where to make the improvements.
  • 2) If you do not want to do anything, just keep running the code. The warning does not make the results wrong.
  • 3) If you want to use swap memory, make sure to have a reasonable ssd disk in your computer and configure the machine such that it uses swap memory when running out of RAM. You now can increase the limit for the RAM in “ulimit -v <number of KB RAM>”.
  • 4) This requires some thinking and is model dependent.
    • 4a) I see you calculate the lowest 8 states. Can you run the calculation for a single state? If yes, can you generate state 2 to 8 by flipping spins? i.e. can you calculate the “ground” state as $\langle S_z \rangle = 4$ and generate the other states as ${S^-}^n |\psi_{S_z=4}\rangle$ ?
    • 4b) can you turn of spin-orbit coupling and calculate the lowest 8 states and then add spin orbit coupling at a later state perturbatively?
    • 4c) Do you really need all 8 states
    • 4d) Are you able to add additional restrictions without chaining the answer?

Best wishes, Maurits

, 2025/06/14 00:06

Hello Maurits,

Thank you for the reply!

I am not very familiar with C programming so thank you for clarifying. As long as these errors do not make the results incorrect then I think my best option is to continue with the ulimit command. I have also implemented some Restrictions, the documentation on this site is not amazing for this, so I didn't realize how they worked earlier. With a few other optimizations I believe things are working smoother now.

Thanks again, River

You could leave a comment if you were logged in.
Print/export