Opt13 reliably crashed with paradis using the adagio and andante algorithms, but not with fermata or adagio with shifting disabled. DKL suggested turning off shifting on just that node (is that 14 + 15 or 14 + 28? Eh, kill all of them....).
Need to eventually write a synthetic benchmark to make sure this implementation is correct, but that's kinda pointless if the machines are going to stay up long enough to run the real benchmarks.
Debugging notes from last night: taking opt13 out of my hostfile and adding opt16 caused opt16 to fail.
A long-term solution might be to put a timer in the shifting code along with a 1-deep queue. When a shift is made, further shifts are blocked until the timer expires. If another request comes in, it goes in the queue. If the queue is already full, the newer request replaces the older request. When the timer expires, if there is a request in the queue, shift to that and empty the queue.
If a request comes in when the timer is not set (which should imply the queue is empty) then immediately shift and set the timer.
DKL thinks this is far too complicated.
Blacklisting nodes in the shift code doesn't work -- the scheduler gets confused.
Let's try forcing the scheduler to split_frequencies=0, first_freq=0.
Nope, now everybody wants to run as slow as possible. Let's back this off to 16 nodes.
Ok, nothing crashes, but it's slowing down way too much....
Monday, August 4, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment