-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DS starts in read-only mode when another replica is unreachable #85
Comments
GitHub won't let me attach YML file, so here it is: version: "3.4"
x-shared-config: &shared-config
extra_hosts:
- "wrends-test1:10.0.0.31"
- "wrends-test2:10.0.0.32"
services:
wrends-test1:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test1
environment:
ADDITIONAL_SETUP_ARGS: "--sampleData 10"
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
volumes:
- wrends-data:/opt/wrends/instance
networks:
wrenam:
ipv4_address: 10.0.0.31
<<: *shared-config
wrends-test2:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test2
environment:
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
networks:
wrenam:
ipv4_address: 10.0.0.32
<<: *shared-config
volumes:
wrends-data:
networks:
wrenam:
name: wrends-test
ipam:
config:
- subnet: 10.0.0.0/24 |
What is actually happening under the hood:
|
To be honest I don't understand why Line 337 in b9af647
wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java Lines 566 to 567 in b9af647
I mean the only thing that happens from the point of view of When I drop the requirement for the connection check, everything works as expected. But dropping something that is there obviously for some reason is a pretty big deal :/. |
Digging through commit history... Based on 51ef33b it seems that the wait was always mandatory when creating new Commit message for this change 4d90aff seems like it can be the title of this issue:
Still, none the wiser to why the wait is there. |
Ok, even though the timeout is very strangely implemented, the server is behaving as intended. With |
Summary
When in multi-master replication mode a single server is unreachable (socket connection has to timeout) it will cause replication server to not be able to accept any connection due to data server's own socket timeout. This happens during the first DS-RS handshake phase and is accompanied with the following error in the log:
Steps To Reproduce ("Repro Steps")
docker-compose up -d
docker-compose stop
docker-compose up -d wrends-test1
docker exec -it wrends-test1 ldapdelete -h localhost -p 1389
Expected Result (Behavior You Expected to See)
Server deletes the requested LDAP entry.
Actual Result (Behavior You Saw)
The following error is returned:
Additional Notes
I have spent several hours trying to debug this. The underlying issue is that replication server's connection listener is trying to contact all other replication servers when accepting new connection. As the other server does not exist, this attempt timeouts with the same timeout value as is the data server's timeout for the connection handshake.
I am not sure why we need to wait for the replication domain to actually contact all servers. Simply increasing handshake timeout wouldn't work if we have multiple servers in the domain.
Creating this issue to actually track potential discussion regarding solving this problem.
The text was updated successfully, but these errors were encountered: