baseline/blr.cn.16nodes.012steps.lb03_91/out.dd3d.91.baseline:QQQ Size= 16 Seconds= 89.29 Joules= 263904 Watts= 2955.60 AvgWatts=184.72
baseline/blr.cn.16nodes.012steps.lb03_92/out.dd3d.92.baseline:QQQ Size= 16 Seconds= 89.20 Joules= 264432 Watts= 2964.36 AvgWatts=185.27
baseline/blr.cn.16nodes.012steps.lb03_93/out.dd3d.93.baseline:QQQ Size= 16 Seconds= 89.40 Joules= 263872 Watts= 2951.69 AvgWatts=184.48
baseline/blr.cn.16nodes.012steps.lb03_94/out.dd3d.94.baseline:QQQ Size= 16 Seconds= 88.89 Joules= 262464 Watts= 2952.81 AvgWatts=184.55
baseline/blr.cn.16nodes.012steps.lb03_95/out.dd3d.95.baseline:QQQ Size= 16 Seconds= 88.88 Joules= 262384 Watts= 2952.26 AvgWatts=184.52
baseline/blr.cn.16nodes.012steps.lb03_96/out.dd3d.96.baseline:QQQ Size= 16 Seconds= 89.85 Joules= 265280 Watts= 2952.43 AvgWatts=184.53
baseline/blr.cn.16nodes.012steps.lb03_97/out.dd3d.97.baseline:QQQ Size= 16 Seconds= 89.22 Joules= 263600 Watts= 2954.38 AvgWatts=184.65
baseline/blr.cn.16nodes.012steps.lb03_98/out.dd3d.98.baseline:QQQ Size= 16 Seconds= 89.46 Joules= 264256 Watts= 2953.75 AvgWatts=184.61
baseline/blr.cn.16nodes.012steps.lb03_99/out.dd3d.99.baseline:QQQ Size= 16 Seconds= 96.37 Joules= 282816 Watts= 2934.70 AvgWatts=183.42
Adagio
Monday, August 4, 2008
And it keeps getting worse
Pcontrol(3)
adagio/blr.cn.16nodes.012steps.lb03_70/out.dd3d.70.adagio:QQQ Size= 16 Seconds= 102.42 Joules= 254944 Watts= 2489.30 AvgWatts=155.58
adagio/blr.cn.16nodes.012steps.lb03_71/out.dd3d.71.adagio:QQQ Size= 16 Seconds= 94.61 Joules= 236688 Watts= 2501.76 AvgWatts=156.36
adagio/blr.cn.16nodes.012steps.lb03_72/out.dd3d.72.adagio:QQQ Size= 16 Seconds= 94.85 Joules= 237872 Watts= 2507.82 AvgWatts=156.74
adagio/blr.cn.16nodes.012steps.lb03_73/out.dd3d.73.adagio:QQQ Size= 16 Seconds= 94.36 Joules= 236688 Watts= 2508.46 AvgWatts=156.78
adagio/blr.cn.16nodes.012steps.lb03_74/out.dd3d.74.adagio:QQQ Size= 16 Seconds= 94.69 Joules= 237664 Watts= 2510.01 AvgWatts=156.88
adagio/blr.cn.16nodes.012steps.lb03_75/out.dd3d.75.adagio:QQQ Size= 16 Seconds= 94.12 Joules= 237472 Watts= 2523.04 AvgWatts=157.69
adagio/blr.cn.16nodes.012steps.lb03_76/out.dd3d.76.adagio:QQQ Size= 16 Seconds= 93.75 Joules= 234432 Watts= 2500.67 AvgWatts=156.29
adagio/blr.cn.16nodes.012steps.lb03_77/out.dd3d.77.adagio:QQQ Size= 16 Seconds= 94.43 Joules= 235616 Watts= 2495.20 AvgWatts=155.95
adagio/blr.cn.16nodes.012steps.lb03_78/out.dd3d.78.adagio:QQQ Size= 16 Seconds= 94.25 Joules= 236032 Watts= 2504.21 AvgWatts=156.51
adagio/blr.cn.16nodes.012steps.lb03_79/out.dd3d.79.adagio:QQQ Size= 16 Seconds= 93.73 Joules= 236864 Watts= 2527.17 AvgWatts=157.95
adagio/blr.cn.16nodes.012steps.lb03_70/out.dd3d.70.adagio:QQQ Size= 16 Seconds= 102.42 Joules= 254944 Watts= 2489.30 AvgWatts=155.58
adagio/blr.cn.16nodes.012steps.lb03_71/out.dd3d.71.adagio:QQQ Size= 16 Seconds= 94.61 Joules= 236688 Watts= 2501.76 AvgWatts=156.36
adagio/blr.cn.16nodes.012steps.lb03_72/out.dd3d.72.adagio:QQQ Size= 16 Seconds= 94.85 Joules= 237872 Watts= 2507.82 AvgWatts=156.74
adagio/blr.cn.16nodes.012steps.lb03_73/out.dd3d.73.adagio:QQQ Size= 16 Seconds= 94.36 Joules= 236688 Watts= 2508.46 AvgWatts=156.78
adagio/blr.cn.16nodes.012steps.lb03_74/out.dd3d.74.adagio:QQQ Size= 16 Seconds= 94.69 Joules= 237664 Watts= 2510.01 AvgWatts=156.88
adagio/blr.cn.16nodes.012steps.lb03_75/out.dd3d.75.adagio:QQQ Size= 16 Seconds= 94.12 Joules= 237472 Watts= 2523.04 AvgWatts=157.69
adagio/blr.cn.16nodes.012steps.lb03_76/out.dd3d.76.adagio:QQQ Size= 16 Seconds= 93.75 Joules= 234432 Watts= 2500.67 AvgWatts=156.29
adagio/blr.cn.16nodes.012steps.lb03_77/out.dd3d.77.adagio:QQQ Size= 16 Seconds= 94.43 Joules= 235616 Watts= 2495.20 AvgWatts=155.95
adagio/blr.cn.16nodes.012steps.lb03_78/out.dd3d.78.adagio:QQQ Size= 16 Seconds= 94.25 Joules= 236032 Watts= 2504.21 AvgWatts=156.51
adagio/blr.cn.16nodes.012steps.lb03_79/out.dd3d.79.adagio:QQQ Size= 16 Seconds= 93.73 Joules= 236864 Watts= 2527.17 AvgWatts=157.95
Well.... bother
Pcontrol(3,13)
adagio/blr.cn.16nodes.012steps.lb03_80/out.dd3d.80.adagio:QQQ Size= 16 Seconds= 112.15 Joules= 255952 Watts= 2282.15 AvgWatts=142.63
adagio/blr.cn.16nodes.012steps.lb03_81/out.dd3d.81.adagio:QQQ Size= 16 Seconds= 101.93 Joules= 234496 Watts= 2300.45 AvgWatts=143.78
adagio/blr.cn.16nodes.012steps.lb03_82/out.dd3d.82.adagio:QQQ Size= 16 Seconds= 106.44 Joules= 240480 Watts= 2259.25 AvgWatts=141.20
adagio/blr.cn.16nodes.012steps.lb03_83/out.dd3d.83.adagio:QQQ Size= 16 Seconds= 105.52 Joules= 238912 Watts= 2264.22 AvgWatts=141.51
adagio/blr.cn.16nodes.012steps.lb03_84/out.dd3d.84.adagio:QQQ Size= 16 Seconds= 102.76 Joules= 233904 Watts= 2276.17 AvgWatts=142.26
adagio/blr.cn.16nodes.012steps.lb03_85/out.dd3d.85.adagio:QQQ Size= 16 Seconds= 103.75 Joules= 236320 Watts= 2277.88 AvgWatts=142.37
adagio/blr.cn.16nodes.012steps.lb03_86/out.dd3d.86.adagio:QQQ Size= 16 Seconds= 104.29 Joules= 235696 Watts= 2259.91 AvgWatts=141.24
adagio/blr.cn.16nodes.012steps.lb03_87/out.dd3d.87.adagio:QQQ Size= 16 Seconds= 103.22 Joules= 233128 Watts= 2258.52 AvgWatts=141.16
adagio/blr.cn.16nodes.012steps.lb03_88/out.dd3d.88.adagio:QQQ Size= 16 Seconds= 102.16 Joules= 233008 Watts= 2280.74 AvgWatts=142.55
adagio/blr.cn.16nodes.012steps.lb03_89/out.dd3d.89.adagio:QQQ Size= 16 Seconds= 102.16 Joules= 232192 Watts= 2272.88 AvgWatts=142.06
adagio/blr.cn.16nodes.012steps.lb03_80/out.dd3d.80.adagio:QQQ Size= 16 Seconds= 112.15 Joules= 255952 Watts= 2282.15 AvgWatts=142.63
adagio/blr.cn.16nodes.012steps.lb03_81/out.dd3d.81.adagio:QQQ Size= 16 Seconds= 101.93 Joules= 234496 Watts= 2300.45 AvgWatts=143.78
adagio/blr.cn.16nodes.012steps.lb03_82/out.dd3d.82.adagio:QQQ Size= 16 Seconds= 106.44 Joules= 240480 Watts= 2259.25 AvgWatts=141.20
adagio/blr.cn.16nodes.012steps.lb03_83/out.dd3d.83.adagio:QQQ Size= 16 Seconds= 105.52 Joules= 238912 Watts= 2264.22 AvgWatts=141.51
adagio/blr.cn.16nodes.012steps.lb03_84/out.dd3d.84.adagio:QQQ Size= 16 Seconds= 102.76 Joules= 233904 Watts= 2276.17 AvgWatts=142.26
adagio/blr.cn.16nodes.012steps.lb03_85/out.dd3d.85.adagio:QQQ Size= 16 Seconds= 103.75 Joules= 236320 Watts= 2277.88 AvgWatts=142.37
adagio/blr.cn.16nodes.012steps.lb03_86/out.dd3d.86.adagio:QQQ Size= 16 Seconds= 104.29 Joules= 235696 Watts= 2259.91 AvgWatts=141.24
adagio/blr.cn.16nodes.012steps.lb03_87/out.dd3d.87.adagio:QQQ Size= 16 Seconds= 103.22 Joules= 233128 Watts= 2258.52 AvgWatts=141.16
adagio/blr.cn.16nodes.012steps.lb03_88/out.dd3d.88.adagio:QQQ Size= 16 Seconds= 102.16 Joules= 233008 Watts= 2280.74 AvgWatts=142.55
adagio/blr.cn.16nodes.012steps.lb03_89/out.dd3d.89.adagio:QQQ Size= 16 Seconds= 102.16 Joules= 232192 Watts= 2272.88 AvgWatts=142.06
Debugging runs -- logic bug
Ok, looks like a combination bug.
baseline/blr.cn.16nodes.012steps.lb03_02/out.dd3d.02.baseline:QQQ Size= 16 Seconds= 90.35 Joules= 266320 Watts= 2947.70 AvgWatts=184.23
baseline/blr.cn.16nodes.012steps.lb03_03/out.dd3d.03.baseline:QQQ Size= 16 Seconds= 98.57 Joules= 290304 Watts= 2945.23 AvgWatts=184.08
baseline/blr.cn.16nodes.012steps.lb03_13/out.dd3d.13.baseline:QQQ Size= 16 Seconds= 99.10 Joules= 291552 Watts= 2941.87 AvgWatts=183.87
baseline/blr.cn.16nodes.012steps.lb03_99/out.dd3d.99.baseline:QQQ Size= 16 Seconds= 96.37 Joules= 282816 Watts= 2934.70 AvgWatts=183.42
Pcontrol(3)
adagio/blr.cn.16nodes.012steps.lb03_03/out.dd3d.03.adagio:QQQ Size= 16 Seconds= 94.98 Joules= 238896 Watts= 2515.15 AvgWatts=157.20
Pcontrol(3) + Pcontrol(13)
adagio/blr.cn.16nodes.012steps.lb03_13/out.dd3d.13.adagio:QQQ Size= 16 Seconds= 94.00 Joules= 234864 Watts= 2498.49 AvgWatts=156.16
Pcontrol(3)+Pcontrol(2)
adagio/blr.cn.16nodes.012steps.lb03_02/out.dd3d.02.adagio:QQQ Size= 16 Seconds= 93.72 Joules= 235936 Watts= 2517.51 AvgWatts=157.34
Pcontrol(3)+Pcontrol(2)+Pcontrol(13)
adagio/blr.cn.16nodes.012steps.lb03_99/out.dd3d.99.adagio:QQQ Size= 16 Seconds= 103.69 Joules= 235888 Watts= 2275.04 AvgWatts=142.19
baseline/blr.cn.16nodes.012steps.lb03_02/out.dd3d.02.baseline:QQQ Size= 16 Seconds= 90.35 Joules= 266320 Watts= 2947.70 AvgWatts=184.23
baseline/blr.cn.16nodes.012steps.lb03_03/out.dd3d.03.baseline:QQQ Size= 16 Seconds= 98.57 Joules= 290304 Watts= 2945.23 AvgWatts=184.08
baseline/blr.cn.16nodes.012steps.lb03_13/out.dd3d.13.baseline:QQQ Size= 16 Seconds= 99.10 Joules= 291552 Watts= 2941.87 AvgWatts=183.87
baseline/blr.cn.16nodes.012steps.lb03_99/out.dd3d.99.baseline:QQQ Size= 16 Seconds= 96.37 Joules= 282816 Watts= 2934.70 AvgWatts=183.42
Pcontrol(3)
adagio/blr.cn.16nodes.012steps.lb03_03/out.dd3d.03.adagio:QQQ Size= 16 Seconds= 94.98 Joules= 238896 Watts= 2515.15 AvgWatts=157.20
Pcontrol(3) + Pcontrol(13)
adagio/blr.cn.16nodes.012steps.lb03_13/out.dd3d.13.adagio:QQQ Size= 16 Seconds= 94.00 Joules= 234864 Watts= 2498.49 AvgWatts=156.16
Pcontrol(3)+Pcontrol(2)
adagio/blr.cn.16nodes.012steps.lb03_02/out.dd3d.02.adagio:QQQ Size= 16 Seconds= 93.72 Joules= 235936 Watts= 2517.51 AvgWatts=157.34
Pcontrol(3)+Pcontrol(2)+Pcontrol(13)
adagio/blr.cn.16nodes.012steps.lb03_99/out.dd3d.99.adagio:QQQ Size= 16 Seconds= 103.69 Joules= 235888 Watts= 2275.04 AvgWatts=142.19
Runtime bug in paradis
paradis runs way too long under adagio. Can't replicate using synthetic benchmark.
Am now taking out the Pcontrols one by one -- the ones that should be put back in are marked with a ZZZ.
Am now taking out the Pcontrols one by one -- the ones that should be put back in are marked with a ZZZ.
Opt13 crashing when shifting turned on.
Opt13 reliably crashed with paradis using the adagio and andante algorithms, but not with fermata or adagio with shifting disabled. DKL suggested turning off shifting on just that node (is that 14 + 15 or 14 + 28? Eh, kill all of them....).
Need to eventually write a synthetic benchmark to make sure this implementation is correct, but that's kinda pointless if the machines are going to stay up long enough to run the real benchmarks.
Debugging notes from last night: taking opt13 out of my hostfile and adding opt16 caused opt16 to fail.
A long-term solution might be to put a timer in the shifting code along with a 1-deep queue. When a shift is made, further shifts are blocked until the timer expires. If another request comes in, it goes in the queue. If the queue is already full, the newer request replaces the older request. When the timer expires, if there is a request in the queue, shift to that and empty the queue.
If a request comes in when the timer is not set (which should imply the queue is empty) then immediately shift and set the timer.
DKL thinks this is far too complicated.
Blacklisting nodes in the shift code doesn't work -- the scheduler gets confused.
Let's try forcing the scheduler to split_frequencies=0, first_freq=0.
Nope, now everybody wants to run as slow as possible. Let's back this off to 16 nodes.
Ok, nothing crashes, but it's slowing down way too much....
Need to eventually write a synthetic benchmark to make sure this implementation is correct, but that's kinda pointless if the machines are going to stay up long enough to run the real benchmarks.
Debugging notes from last night: taking opt13 out of my hostfile and adding opt16 caused opt16 to fail.
A long-term solution might be to put a timer in the shifting code along with a 1-deep queue. When a shift is made, further shifts are blocked until the timer expires. If another request comes in, it goes in the queue. If the queue is already full, the newer request replaces the older request. When the timer expires, if there is a request in the queue, shift to that and empty the queue.
If a request comes in when the timer is not set (which should imply the queue is empty) then immediately shift and set the timer.
DKL thinks this is far too complicated.
Blacklisting nodes in the shift code doesn't work -- the scheduler gets confused.
Let's try forcing the scheduler to split_frequencies=0, first_freq=0.
Nope, now everybody wants to run as slow as possible. Let's back this off to 16 nodes.
Ok, nothing crashes, but it's slowing down way too much....
Sunday, August 3, 2008
PARADIS flow control
Tracking MPI_Allreduces()
03 Cellcharge
13 LocalSegForcees
02 EulerBackwards
05 Loadcurve
The state machine looks like 03 13 (13 02)+ 05.
03 Cellcharge
13 LocalSegForcees
02 EulerBackwards
05 Loadcurve
The state machine looks like 03 13 (13 02)+ 05.
ECO schedule.notes.c removed
Nothing appeared to be using it. I dimly recall it was a branch that later got remerged, but I never got around to deleting the file.
NEVERMIND Fermata alarm killed too often
Killing an alarm should be pretty low overhead, but I think I'm doing it twice as often as I need to in runtime_post().
Defer to later -- does not cause incorrect behavior and the performance hit appears to be minimal.
killtimer() handles the bookkeeping. Leave it be.
Defer to later -- does not cause incorrect behavior and the performance hit appears to be minimal.
killtimer() handles the bookkeeping. Leave it be.
ECO Cleanup pass in runtime.c
Removed redundant shift(0) in runtime_post().
Removed "countdown" code that would delay initialization for a specified number of timesteps. It's never been used.
Removed "countdown" code that would delay initialization for a specified number of timesteps. It's never been used.
ECO: _use.h
PMPI allows MPI calls to be trapped by a user-supplied library via the miracle of weak linking. For the SC Adagio paper, I thought that additional runtime overhead might be caused by trapping all of the MPI calls instead of the ones we were interested in.
In the pmpi_modules directory, there's a _use.h file that defines what function calls will be trapped. This had previously been edited down to a small set of global synchronization calls (plus a few others).
We're now moving back to wanting to track all calls. This may add to execution time overhead. Or it might not.
To play with this, copy _use_all.h to _use_whatever_you_like.h, comment out the #defines of the functions you don't want to see, and replace #include "_use.h" in blr_pmpi.c with and #include for your new file.
In the pmpi_modules directory, there's a _use.h file that defines what function calls will be trapped. This had previously been edited down to a small set of global synchronization calls (plus a few others).
We're now moving back to wanting to track all calls. This may add to execution time overhead. Or it might not.
To play with this, copy _use_all.h to _use_whatever_you_like.h, comment out the #defines of the functions you don't want to see, and replace #include "_use.h" in blr_pmpi.c with and #include for your new file.
Subscribe to:
Posts (Atom)