-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
weird behavior with verbose=True #224
Comments
Thanks for the report. Could this be related to #197 ? |
This is pretty weird! Could you share the job submission script that you're using to run this example? I'm wondering if some kind of stderr/stdout buffer could be filling up (not that I've ever seen that before). Do you know how many iterations the 100 mock example gets through before it hangs? Does it always hang on the same iteration? Also, this isn't on the Flatiron cluster by any chance, is it? If it were, I could run the example myself. |
i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins. it's on the sirocco cluster at nyu - actually thanks for pointing out #197 because @zxzhai might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco. here's the job submission script: #PBS -N cf_min
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely
#PBS -o output/$PBS_JOBNAME.out
#PBS -j oe
cd $PBS_O_WORKDIR
python cf_minimal.py |
So if you put a Python print statement in the loop, it prints 13 times? Do
you get 13 chunks of verbose Corrfunc output too? The Corrfunc output
comes from the C level, so it would be good to check that print statements
from both the C and Python level are making it through.
I'm wondering if the wurlitzer package we use to capture output from the C
extension could be causing trouble somehow?
…On Tue, Jun 30, 2020 at 1:26 PM Kate Storey-Fisher ***@***.***> wrote:
i suspected something like a full buffer! yes this example actually hangs
at 13 iterations, independent of Nmocks or Nbins.
it's on the sirocco cluster at nyu - actually thanks for pointing out #197
<#197> because @zxzhai
<https://github.com/zxzhai> might be running on the same cluster?
zhongxu, would be great if you don't mind testing this on sirocco.
here's the job submission script:
#PBS -N cf_min#PBS -l walltime=40:00:00#PBS -l nodes=1:ppn=1#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely #PBS -o output/$PBS_JOBNAME.out#PBS -j oe
cd $PBS_O_WORKDIR
python cf_minimal.py
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#224 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLA7SY5473BLAWRRND4UJLRZIN5XANCNFSM4OL6VKCQ>
.
--
Lehman Garrison <[email protected]>
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics, Flatiron Institute
lgarrison.github.io
|
so the other strange/frustrating thing is that when it hangs, i have to cancel the job, and then the output file is empty (or if i print things about the nodes, it has that, but no printlines from the code itself). i figured out the 13 by appending to file each iteration. maybe i could also redirect the output to a file or something? when i run on interactive node, i see both python and c-level Corrfunc output. oh when i was building from source earlier i actually did have some trouble with wurlitzer's syspipes.... |
If it's convenient for you to run Corrfunc from source, could you try a few wurlitzer-related things? Specifically, in Lines 1044 to 1055 in 75b8ac9
Could you print the values of Could you also try adding |
could you clarify the syntax for adding a) on the interactive node, both for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)). |
Okay, it sounds like the TTY detection is working properly. The
interspersed output is likely just due to buffering and is not unusual.
Adding flush=True as an argument to print() would probably fix this, but
it's probably not related to the real problem.
For the exception, you can just do "raise" as the first line in the
"except" block, and Python will figure out to re-raise the exception that
was caught by the current block.
…On Tue, Jun 30, 2020 at 4:08 PM Kate Storey-Fisher ***@***.***> wrote:
could you clarify the syntax for adding raise? want to make sure i'm
doing it correctly.
a) on the interactive node, both sys.stdout.isatty() and
sys.stderr.isatty() are true.
b) when running Nmocks=10 and verbose=True, both are false.
c) when running Nmocks=20 (>13) and verbose=False, both are false.
d) when running Nmocks=20 and verbose=True, can't see the output!
for case b), verbose output from the C code is printed - but all before
the python output, even though it should be interspersed between the python
for the realizations (as happens properly in case a)).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#224 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLA7S6VEMQE3EXUEJFEEYLRZJA5LANCNFSM4OL6VKCQ>
.
--
Lehman Garrison <[email protected]>
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics, Flatiron Institute
lgarrison.github.io
|
ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines). and yes i've seen the interspersing before, probably not the issue here. |
Okay, not seeing a problem yet.
The following is a little paranoid, but easy to check. Could you pip
install psutil and place the following in the Wurlitzer context:
p = psutil.Process()
print(p.num_fds())
Aside from that, there are a few places in the code where the verbose flag
could, in theory, lead to an early exit, like this one:
https://github.com/manodeep/Corrfunc/blob/master/theory/xi/countpairs_xi_impl.c.src#L115-L129
Or if we had a bad printf formatting, that could lead to out-of-bounds
writes and undefined behavior.
There's also the progress bar stuff. I'm not familiar enough with that
part of the code to know whether it's a risk.
…On Tue, Jun 30, 2020 at 4:41 PM Kate Storey-Fisher ***@***.***> wrote:
ok thanks that's what i was doing, it's not changing anything when i add
the "raise" line. in the cases i can see the output, it's not reaching the
except block (can see from adding printlines).
and yes i've seen the interspersing before, probably not the issue here.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#224 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLA7S7V3FFJRB6OO4UCYHTRZJEZHANCNFSM4OL6VKCQ>
.
--
Lehman Garrison <[email protected]>
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics, Flatiron Institute
lgarrison.github.io
|
assuming you mean the wurlitzer context you quoted above. for the submitted job Nmocks=14 verbose=True, output of num_fds() is: n=0: 14 if i more mocks with verbose=False, it keeps increasing then jumps back to 16 at n=31, and so on (with slightly diff mins and maxs). i see this is the number of file descriptors opened, not sure what this means tho! |
Within the C code, everything is printed to stderr, except for the final result that might be written out from a command-line executable. That means nothing is printed out to stdout if you are using the python interface. I am surprised that num_fds is not constant. Doesn't that mean there is resource leak occurring somewhere? |
The progressbar code should be okay (just looked through the code), but it has string handling - so who knows! |
I'm not sure about the number of fds. I guess it is a leak, but this is
way under the default max number of fds, which I think is 1024. But maybe
this cluster has a lower limit? I will have to look up the query syntax...
…On Tue, Jun 30, 2020 at 6:33 PM Manodeep Sinha ***@***.***> wrote:
The progressbar code should be okay (just looked through the code), but it
has string handling - so who knows!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#224 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLA7S23FW33MQJAPO2J2CDRZJR4HANCNFSM4OL6VKCQ>
.
--
Lehman Garrison <[email protected]>
Flatiron Research Fellow, Cosmology X Data Science Group
Center for Computational Astrophysics, Flatiron Institute
lgarrison.github.io
|
thanks both! let me know if you have other ideas i can test. for now i can just turn off verbose, but it took me a while to diagnose, so maybe having this here will help someone else! |
General information
Issue description
I have a strange issue running Corrfunc on my cluster, that I can't reproduce on another cluster. I admit it's quite weird but can't figure it out.
When I submit a Corrfunc job to my cluster, it sometimes crashes partway and hangs indefinitely. Specifically, it hangs when verbose=True, and when I loop over a high enough number of mocks. The same code works fine on an interactive node.
To be clear:
verbose=True, Nmocks=100: hangs
verbose=False, Nmocks=100: works
verbose=True, Nmocks=10: works
This seems like some sort of memory issue, given the Nmocks trend. I also had a similar issue when messing with the code and building from source, and I found the issue to be some printlines I had added in C that I think were causing a memory leak.
Expected behavior
The script completes successfully.
Actual behavior
The job crashes hangs indefinitely.
What have you tried so far?
Running on an interactive node, which works; running on a different cluster's queue system, which works. Also building the code from source.
Minimal failing example
The text was updated successfully, but these errors were encountered: