Your experiment didn't work and you'd like to know why. You should contact David or Rithvik, since this codebase has lots of things working together and it's tricky to debug. If you're on your own, here's some useful tips, assuming you followed the setup and protocol instructions:
If your results are much worse than expected, you can inspect the metrics across time collected by Prometheus. Head to your benchmark results in /mnt/nfs/tmp, go to the specific run's folder (such as 001
), then run:
prometheus --config.file=<(echo "") --storage.tsdb.path=prometheus_data --web.listen-address=0.0.0.0:9090
Then, on your local computer, run the following, where <ip>
is the external IP of eval-primary
, assuming you can SSH into eval-primary
:
ssh -L 127.0.0.1:9090:localhost:9090 <ip>
Then navigate to localhost:9090
in your browser, click the Graph tab, and check metrics. Metrics can be logged in Dedalus like so, and in Scala like so. Feel free to add your own metric measurements; they are relatively lightweight.
Create a new VM instance using worker-image
, and check if it has access to /mnt/nfs/tmp, and can run java
and prometheus
from the command line. Then, check if eval-primary
can SSH into itself and also run java
and prometheus
with the following, replacing <username>
with the username on eval-primary
:
ssh -i ~/.ssh/id_ed25519 <username>@localhost java
ssh -i ~/.ssh/id_ed25519 <username>@localhost prometheus
This is usually caused by:
- Prometheus attempted to start on a port that conflicts with other processes, because the configuration you're trying to launch is too large, as seen here. Start Prometheus on a higher port number (like the 60000s).
- Zombie processes from previous runs. Find them with
ps aux
and kill them one by one, or if you can't, restart the VM.
Try waiting for all messages to arrive (2f+1 for Paxos, for example), instead of a quorum. This should simplify garbage collection logic, and make Dedalus way more performant for some reason.
Alternatively, check if you're using a relation that is persisted with .persist
in the body of any rule. Since those relations grow monotonically, the evaluation of those rules will eventually slow down over time and should be avoided.
Check that all your types are correct (did you put 2 different types in the same attribute of a relation, or use the same relation with a different number of attributes). If so, type inference may have failed, so you should try manual annotation, like in this example where the type of nextSlot
is manually annotated.