Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get returns random values when lower than default OPT_POLL_TIMEOUT is set #531

Closed
ondrejcech opened this issue Jan 9, 2023 · 42 comments
Closed

Comments

@ondrejcech
Copy link

After upgrading to libmemcached to 3.1.5-6, I'm sometimes (!) getting random results when calling the get method. There's no error, but the returned value is from a different key. This does not happen on every request. I've narrowed this error down to OPT_POLL_TIMEOUT option. When set to lower than 500ms this starts to happen. It seems that if OPT_POLL_TIMEOUT happens, the library returns a seemingly random value without any error.

It does not happen when OPT_POLL_TIMEOUT is set to default 1000ms value or omitted.

Not sure if it's an issue of libmemcached or php-memcached though.

PHP version:

php -v
PHP 8.1.12 (cli) (built: Oct 25 2022 17:30:00) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.1.12, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.12, Copyright (c), by Zend Technologies

memcache* libraries:

rpm -qa | grep memcached
remi-libmemcached-awesome-1.1.3-1.el7.remi.x86_64
libmemcached-1.0.16-5.el7.x86_64
php81-php-pecl-memcached-3.2.0-1.el7.remi.x86_64
libmemcached-opt-libs-1.0.18-2.el7.remi.x86_64
memcached-1.6.16-1.el7.remi.x86_64
php80-php-pecl-memcached-3.1.5-4.el7.remi.x86_64

some memcache stats

echo 'stats' | nc localhost 11211
STAT version 1.6.16
STAT libevent 2.0.21-stable
END

I will be more than happy to provide more debug info.

@sodabrew
Copy link
Contributor

sodabrew commented Jan 9, 2023

Are the many thumbs up coming from people who are able to reproduce this problem? If so, please post your php-memcached and libmemcached and memcached versions

@ondrejcech
Copy link
Author

@sodabrew TBH, I think most of the thumbs-up comes from the same project. That volume just means how important the issue is for us.

@k0ka
Copy link

k0ka commented Jan 25, 2023

Hi,

Thanks God I'm not alone.
It looks like I have the same problem. Here is my debug info:

  1. I have quite an old container. The project runs properly there.
php 7.4.13:
php-pecl-memcached-3.1.5-1.el8.remi.7.4.x86_64
libmemcached-libs-1.0.18-15.el8.x86_64
  1. I tried to update it two times:
php 7.4.33 
php-pecl-memcached-3.2.0-1.el8.remi.7.4.x86_64
remi-libmemcached-awesome-1.1.3-1.el8.remi.x86_64

and

php 8.1.14
php-pecl-memcached-3.2.0-1.el8.remi.8.1.x86_64
remi-libmemcached-awesome-1.1.3-1.el8.remi.x86_64

both had that weird behavior.

  1. it was reproduced both in a localhost memcached and remote memcached cluster (several servers). I'm using the latest docker.io/library/memcached:1.6-alpine

  2. I narrowed the problem to php-pecl-memcached (or libmemcached-awesome). I just logged every get key and returned the value just after calling Memcached::get. Sometimes it returns a value for another key that was requested before.

  3. The value is always from the same request. So it was requested during the same page processing. We have a lot of pages so I'm pretty sure it is not a value from another request.

  4. (It looks like) it is only reproduced if persistent_id is set during Memcached construction and is not unique.

  5. The ordinal chance for it is less than 1%. However I've found these settings that make chance much bigger, and got this issue reproduced on my test server very reliably. Here are all my options:

                Memcached::OPT_BINARY_PROTOCOL => true,
                Memcached::OPT_TCP_NODELAY => true,
                
                Memcached::OPT_DISTRIBUTION => Memcached::DISTRIBUTION_CONSISTENT,
                Memcached::OPT_LIBKETAMA_COMPATIBLE => true,

                Memcached::OPT_SERIALIZER => Memcached::SERIALIZER_IGBINARY,
                Memcached::OPT_REMOVE_FAILED_SERVERS => true,

                Memcached::OPT_RETRY_TIMEOUT => 60, // seconds
                Memcached::OPT_CONNECT_TIMEOUT => 20, // milliseconds
                Memcached::OPT_POLL_TIMEOUT => 20, // milliseconds
                Memcached::OPT_RECV_TIMEOUT => 20*1000, // microseconds
                Memcached::OPT_SEND_TIMEOUT => 20*1000, // microseconds

                Memcached::OPT_SERVER_FAILURE_LIMIT => 10,

Commenting the first two options out makes is not reproducible on the test server, but it still happens on the prod.

  1. I dumped messages between php and memcached for a page that had problem on it (it is binary protocol). The php got same values for 2 keys: joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/main/__motto/sf_cache_key/40cd750bba9870f18aada2478b24840a and joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/people/__weekTop/sf_cache_key/40cd750bba9870f18aada2478b24840a. Here is how it looks like from tcp view (I snipped unrelated packets and response data):
13:44:30.249726 IP (tos 0x0, ttl 64, id 28370, offset 0, flags [DF], proto TCP (6), length 203)
    127.0.0.1.56144 > 127.0.0.1.11211: Flags [P.], cksum 0xfebf (incorrect -> 0xc43c),
    seq 4035:4186, ack 29271, win 2207, options [nop,nop,TS val 3561905189 ecr 3561905188], length 151
    0x0000:  4500 00cb 6ed2 4000 4006 cd58 7f00 0001  E...n.@[email protected]....
    0x0010:  7f00 0001 db50 2bcb 3be1 4abe c434 d160  .....P+.;.J..4.`
    0x0020:  8018 089f febf 0000 0101 080a d44e 5c25  .............N\%
    0x0030:  d44e 5c24 800d 007f 0000 0000 0000 007f  .N\$............
    0x0040:  003e 0000 0000 0000 0000 0000 6a6f 7972  .>..........joyr
    0x0050:  6561 6374 6f72 3a2f 6b6f 6b61 5f6a 6f79  eactor:/koka_joy
    0x0060:  7265 6163 746f 725f 6964 656e 7469 7479  reactor_identity
    0x0070:  776f 726b 735f 7275 2f72 752f 616c 6c2f  works_ru/ru/all/
    0x0080:  7366 5f63 6163 6865 5f70 6172 7469 616c  sf_cache_partial
    0x0090:  2f6d 6169 6e2f 5f5f 6d6f 7474 6f2f 7366  /main/__motto/sf
    0x00a0:  5f63 6163 6865 5f6b 6579 2f34 3063 6437  _cache_key/40cd7
    0x00b0:  3530 6262 6139 3837 3066 3138 6161 6461  50bba9870f18aada
    0x00c0:  3234 3738 6232 3438 3430 61              2478b24840a

13:44:30.249758 IP (tos 0x0, ttl 64, id 48713, offset 0, flags [DF], proto TCP (6), length 699)
    127.0.0.1.11211 > 127.0.0.1.56144: Flags [P.], cksum 0x00b0 (incorrect -> 0x8df0), 
    seq 29271:29918, ack 4186, win 169, options [nop,nop,TS val 3561905189 ecr 3561905189], length 647
    0x0000:  4500 02bb be49 4000 4006 7bf1 7f00 0001  E....I@.@.{.....
    0x0010:  7f00 0001 2bcb db50 c434 d160 3be1 4b55  ....+..P.4.`;.KU
    0x0020:  8018 00a9 00b0 0000 0101 080a d44e 5c25  .............N\%
    0x0030:  d44e 5c25 810d 007f 0400 0000 0000 026f  .N\%...........o
    0x0040:  003e 0000 0000 0000 0000 0045 0000 0000  .>.........E....
    0x0050:  6a6f 7972 6561 6374 6f72 3a2f 6b6f 6b61  joyreactor:/koka
    0x0060:  5f6a 6f79 7265 6163 746f 725f 6964 656e  _joyreactor_iden
    0x0070:  7469 7479 776f 726b 735f 7275 2f72 752f  tityworks_ru/ru/
    0x0080:  616c 6c2f 7366 5f63 6163 6865 5f70 6172  all/sf_cache_par
    0x0090:  7469 616c 2f6d 6169 6e2f 5f5f 6d6f 7474  tial/main/__mott
    0x00a0:  6f2f 7366 5f63 6163 6865 5f6b 6579 2f34  o/sf_cache_key/4
    0x00b0:  3063 6437 3530 6262 6139 3837 3066 3138  0cd750bba9870f18
    0x00c0:  6161 6461 3234 3738 6232 3438 3430 6161  aada2478b24840aa
    0x00d0:  3a32 3a7b 733a 373a 2263 6f6e 7465 6e74  :2:{s:7:"content
    0x00e0:  223b 733a 3530 3a22 d0a0 d0b5 d0bc d0b8  ";s:50:"........
    0x00f0:  d0bd d0b8 d181 d186 d0b5 d0bd d186 d0b8  ................
    0x0100:  d18f 20d0 b820 d180 d0b8 d0b3 d0b8 d0b4  ................
    0x0110:  d0bd d0be d181 d182 d18c 223b 733a 383a  ..........";s:8:
... SNIP ...

13:44:30.249763 IP (tos 0x0, ttl 64, id 28371, offset 0, flags [DF], proto TCP (6), length 76)
    127.0.0.1.56144 > 127.0.0.1.11211: Flags [P.], cksum 0xfe40 (incorrect -> 0x683e), 
    seq 4186:4210, ack 29918, win 2236, options [nop,nop,TS val 3561905189 ecr 3561905189], length 24
    0x0000:  4500 004c 6ed3 4000 4006 cdd6 7f00 0001  E..Ln.@.@.......
    0x0010:  7f00 0001 db50 2bcb 3be1 4b55 c434 d3e7  .....P+.;.KU.4..
    0x0020:  8018 08bc fe40 0000 0101 080a d44e 5c25  [email protected]\%
    0x0030:  d44e 5c25 800a 0000 0000 0000 0000 0000  .N\%............
    0x0040:  003f 0000 0000 0000 0000 0000            .?..........

13:44:30.249769 IP (tos 0x0, ttl 64, id 48714, offset 0, flags [DF], proto TCP (6), length 76)
    127.0.0.1.11211 > 127.0.0.1.56144: Flags [P.], cksum 0xfe40 (incorrect -> 0x6f39), 
    seq 29918:29942, ack 4210, win 169, options [nop,nop,TS val 3561905189 ecr 3561905189], length 24
    0x0000:  4500 004c be4a 4000 4006 7e5f 7f00 0001  E..L.J@.@.~_....
    0x0010:  7f00 0001 2bcb db50 c434 d3e7 3be1 4b6d  ....+..P.4..;.Km
    0x0020:  8018 00a9 fe40 0000 0101 080a d44e 5c25  [email protected]\%
    0x0030:  d44e 5c25 810a 0000 0000 0000 0000 0000  .N\%............
    0x0040:  003f 0000 0000 0000 0000 0000            .?..........

13:44:30.249923 IP (tos 0x0, ttl 64, id 28372, offset 0, flags [DF], proto TCP (6), length 203)
    127.0.0.1.56144 > 127.0.0.1.11211: Flags [P.], cksum 0xfebf (incorrect -> 0xc0ce), 
    seq 4210:4361, ack 29942, win 2236, options [nop,nop,TS val 3561905189 ecr 3561905189], length 151
    0x0000:  4500 00cb 6ed4 4000 4006 cd56 7f00 0001  E...n.@[email protected]....
    0x0010:  7f00 0001 db50 2bcb 3be1 4b6d c434 d3ff  .....P+.;.Km.4..
    0x0020:  8018 08bc febf 0000 0101 080a d44e 5c25  .............N\%
    0x0030:  d44e 5c25 800d 007f 0000 0000 0000 007f  .N\%............
    0x0040:  0040 0000 0000 0000 0000 0000 6a6f 7972  [email protected]
    0x0050:  6561 6374 6f72 3a2f 6b6f 6b61 5f6a 6f79  eactor:/koka_joy
    0x0060:  7265 6163 746f 725f 6964 656e 7469 7479  reactor_identity
    0x0070:  776f 726b 735f 7275 2f72 752f 616c 6c2f  works_ru/ru/all/
    0x0080:  7366 5f63 6163 6865 5f70 6172 7469 616c  sf_cache_partial
    0x0090:  2f6d 6169 6e2f 5f5f 6d6f 7474 6f2f 7366  /main/__motto/sf
    0x00a0:  5f63 6163 6865 5f6b 6579 2f34 3063 6437  _cache_key/40cd7
    0x00b0:  3530 6262 6139 3837 3066 3138 6161 6461  50bba9870f18aada
    0x00c0:  3234 3738 6232 3438 3430 61              2478b24840a

13:44:30.249954 IP (tos 0x0, ttl 64, id 48715, offset 0, flags [DF], proto TCP (6), length 699)
    127.0.0.1.11211 > 127.0.0.1.56144: Flags [P.], cksum 0x00b0 (incorrect -> 0x8a9e), 
    seq 29942:30589, ack 4361, win 171, options [nop,nop,TS val 3561905189 ecr 3561905189], length 647
    0x0000:  4500 02bb be4b 4000 4006 7bef 7f00 0001  E....K@.@.{.....
    0x0010:  7f00 0001 2bcb db50 c434 d3ff 3be1 4c04  ....+..P.4..;.L.
    0x0020:  8018 00ab 00b0 0000 0101 080a d44e 5c25  .............N\%
    0x0030:  d44e 5c25 810d 007f 0400 0000 0000 026f  .N\%...........o
    0x0040:  0040 0000 0000 0000 0000 0045 0000 0000  [email protected]....
    0x0050:  6a6f 7972 6561 6374 6f72 3a2f 6b6f 6b61  joyreactor:/koka
    0x0060:  5f6a 6f79 7265 6163 746f 725f 6964 656e  _joyreactor_iden
    0x0070:  7469 7479 776f 726b 735f 7275 2f72 752f  tityworks_ru/ru/
    0x0080:  616c 6c2f 7366 5f63 6163 6865 5f70 6172  all/sf_cache_par
    0x0090:  7469 616c 2f6d 6169 6e2f 5f5f 6d6f 7474  tial/main/__mott
    0x00a0:  6f2f 7366 5f63 6163 6865 5f6b 6579 2f34  o/sf_cache_key/4
    0x00b0:  3063 6437 3530 6262 6139 3837 3066 3138  0cd750bba9870f18
    0x00c0:  6161 6461 3234 3738 6232 3438 3430 6161  aada2478b24840aa
    0x00d0:  3a32 3a7b 733a 373a 2263 6f6e 7465 6e74  :2:{s:7:"content
    0x00e0:  223b 733a 3530 3a22 d0a0 d0b5 d0bc d0b8  ";s:50:"........
    0x00f0:  d0bd d0b8 d181 d186 d0b5 d0bd d186 d0b8  ................
    0x0100:  d18f 20d0 b820 d180 d0b8 d0b3 d0b8 d0b4  ................
    0x0110:  d0bd d0be d181 d182 d18c 223b 733a 383a  ..........";s:8:
... SNIP ...

13:44:30.249959 IP (tos 0x0, ttl 64, id 28373, offset 0, flags [DF], proto TCP (6), length 76)
    127.0.0.1.56144 > 127.0.0.1.11211: Flags [P.], cksum 0xfe40 (incorrect -> 0x64d0), 
    seq 4361:4385, ack 30589, win 2266, options [nop,nop,TS val 3561905189 ecr 3561905189], length 24
    0x0000:  4500 004c 6ed5 4000 4006 cdd4 7f00 0001  E..Ln.@.@.......
    0x0010:  7f00 0001 db50 2bcb 3be1 4c04 c434 d686  .....P+.;.L..4..
    0x0020:  8018 08da fe40 0000 0101 080a d44e 5c25  [email protected]\%
    0x0030:  d44e 5c25 800a 0000 0000 0000 0000 0000  .N\%............
    0x0040:  0041 0000 0000 0000 0000 0000            .A..........

13:44:30.249965 IP (tos 0x0, ttl 64, id 48716, offset 0, flags [DF], proto TCP (6), length 76)
    127.0.0.1.11211 > 127.0.0.1.56144: Flags [P.], cksum 0xfe40 (incorrect -> 0x6be7), 
    seq 30589:30613, ack 4385, win 171, options [nop,nop,TS val 3561905189 ecr 3561905189], length 24
    0x0000:  4500 004c be4c 4000 4006 7e5d 7f00 0001  E..L.L@.@.~]....
    0x0010:  7f00 0001 2bcb db50 c434 d686 3be1 4c1c  ....+..P.4..;.L.
    0x0020:  8018 00ab fe40 0000 0101 080a d44e 5c25  [email protected]\%
    0x0030:  d44e 5c25 810a 0000 0000 0000 0000 0000  .N\%............
    0x0040:  0041 0000 0000 0000 0000 0000            .A..........

13:44:30.250907 IP (tos 0x0, ttl 64, id 28374, offset 0, flags [DF], proto TCP (6), length 207)
    127.0.0.1.56144 > 127.0.0.1.11211: Flags [P.], cksum 0xfec3 (incorrect -> 0x0182), 
    seq 4385:4540, ack 30613, win 2266, options [nop,nop,TS val 3561905190 ecr 3561905189], length 155
    0x0000:  4500 00cf 6ed6 4000 4006 cd50 7f00 0001  E...n.@[email protected]....
    0x0010:  7f00 0001 db50 2bcb 3be1 4c1c c434 d69e  .....P+.;.L..4..
    0x0020:  8018 08da fec3 0000 0101 080a d44e 5c26  .............N\&
    0x0030:  d44e 5c25 800d 0083 0000 0000 0000 0083  .N\%............
    0x0040:  0042 0000 0000 0000 0000 0000 6a6f 7972  .B..........joyr
    0x0050:  6561 6374 6f72 3a2f 6b6f 6b61 5f6a 6f79  eactor:/koka_joy
    0x0060:  7265 6163 746f 725f 6964 656e 7469 7479  reactor_identity
    0x0070:  776f 726b 735f 7275 2f72 752f 616c 6c2f  works_ru/ru/all/
    0x0080:  7366 5f63 6163 6865 5f70 6172 7469 616c  sf_cache_partial
    0x0090:  2f70 656f 706c 652f 5f5f 7765 656b 546f  /people/__weekTo
    0x00a0:  702f 7366 5f63 6163 6865 5f6b 6579 2f34  p/sf_cache_key/4
    0x00b0:  3063 6437 3530 6262 6139 3837 3066 3138  0cd750bba9870f18
    0x00c0:  6161 6461 3234 3738 6232 3438 3430 61    aada2478b24840a

13:44:30.250939 IP (tos 0x0, ttl 64, id 48717, offset 0, flags [DF], proto TCP (6), length 2051)
    127.0.0.1.11211 > 127.0.0.1.56144: Flags [P.], cksum 0x05f8 (incorrect -> 0x8a94), 
    seq 30613:32612, ack 4540, win 173, options [nop,nop,TS val 3561905190 ecr 3561905190], length 1999
    0x0000:  4500 0803 be4d 4000 4006 76a5 7f00 0001  E....M@[email protected].....
    0x0010:  7f00 0001 2bcb db50 c434 d69e 3be1 4cb7  ....+..P.4..;.L.
    0x0020:  8018 00ad 05f8 0000 0101 080a d44e 5c26  .............N\&
    0x0030:  d44e 5c26 810d 0083 0400 0000 0000 07b7  .N\&............
    0x0040:  0042 0000 0000 0000 0000 0048 0000 0050  .B.........H...P
    0x0050:  6a6f 7972 6561 6374 6f72 3a2f 6b6f 6b61  joyreactor:/koka
    0x0060:  5f6a 6f79 7265 6163 746f 725f 6964 656e  _joyreactor_iden
    0x0070:  7469 7479 776f 726b 735f 7275 2f72 752f  tityworks_ru/ru/
    0x0080:  616c 6c2f 7366 5f63 6163 6865 5f70 6172  all/sf_cache_par
    0x0090:  7469 616c 2f70 656f 706c 652f 5f5f 7765  tial/people/__we
    0x00a0:  656b 546f 702f 7366 5f63 6163 6865 5f6b  ekTop/sf_cache_k
    0x00b0:  6579 2f34 3063 6437 3530 6262 6139 3837  ey/40cd750bba987
    0x00c0:  3066 3138 6161 6461 3234 3738 6232 3438  0f18aada2478b248
    0x00d0:  3430 61da 1a00 001f 613a 323a 7b73 3a37  40a.....a:2:{s:7
    0x00e0:  3a22 636f 6e74 656e 7422 3b73 3a36 3433  :"content";s:643
    0x00f0:  303a 223c 6469 7620 1863 6c61 7373 3d22  0:"<div..class="
    0x0100:  7369 6465 6261 725f 626c 6f63 6b22 3e3c  sidebar_block"><
    0x0110:  6832 e003 1910 6865 6164 6572 223e d0a2  h2....header">..
... SNIP ...

What is interesting there:

  • the first key is requested two times. But php only requests it once.
  • the second key is requested too, and the server returns the right value for it (with <div class)
  • however the php-pecl-memcached returned the first value for the second key.

I guess something inside libraries retried the first key request and considers its second response as the response for the second key.

Most probably that's a problem is libmemcached-awesome. In this case, could you help composing an issue for them?

@sodabrew
Copy link
Contributor

sodabrew commented Jan 25, 2023

This is awesome debugging info! Tag @m6w6, and maybe just open a tracking issue on awesomized/libmemcahed that links back here.

I'm thinking it may be useful to construct a C test harness that isolates to libmemcahed without PHP involved.

@m6w6
Copy link
Contributor

m6w6 commented Jan 25, 2023

Thanks for your effort!

Any chance for some code that would reproduce this issue?

It's unfortunate, that it has always been tried to upgrade both at the same time, so neither pecl-memcached nor libmemcached-awesome can be ruled out...

That's suspicious, at least:

(It looks like) it is only reproduced if persistent_id is set during Memcached construction and is not unique.

I'll have a look at POLL_TIMEOUT in the meantime, but TBH, it doesn't look too promising that I'll "just be able to see" something...

@k0ka
Copy link

k0ka commented Jan 26, 2023

I can confirm setting OPT_POLL_TIMEOUT to default value 1000 fixes the problem.

I guess it's some sort of "race condition" bug. I tried to run a simple script with a lot of gets, but the bug wasn't reproduced on it. On my test server it is randomly shown once per 5-10 requests.

@m6w6
Copy link
Contributor

m6w6 commented Feb 1, 2023

It could be, that I found a related commit. Hang on.

@k0ka
Copy link

k0ka commented Feb 1, 2023

I'm still debugging it. I already found that it works perfectly until timeout occurs.

it returns an error like this
(0x5585c02958b0) A TIMEOUT OCCURRED, time out, host: localhost:11211 -> /builddir/build/BUILD/libmemcached-6e713c39031ff312164cc3ff12f943f8e8e01885/src/libmemcached/io.cc:220

And after that it starts sending wrong responses quite often until process is restarted.
I'm going to continue debugging it soon using gdb and most probably will post here some results tomorrow.

@k0ka
Copy link

k0ka commented Feb 2, 2023

I guess there are two bad things:

  1. libmemcached-awesome sometimes throws A TIMEOUT OCCURRED even though it must not happen. 20ms poll timeout is more than enough for a localhost connection.
  2. php-pecl-memcached does something wrong when this timeout comes in.

I added logs to php to print everything that is returned from Memcached::get immediately after it. Here is the log of one failed page request:

    [0] => Array
        (
            [type] => get
            [key] => joyreactor:symfony.routing.data
            [value] => a:104:{s:52:"parse_/post/4849934_1321ec5d5af53bbf347fa27ea57d3d4b";...
            [error] => SUCCESS
        )

    [1] => Array
        (
            [type] => get
            [key] => joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6
            [value] => a:2:{s:7:"content";s:514:"<div class="sidebar_block"><h2 class="sideheader">...
            [error] => SUCCESS
        )

    [2] => Array
        (
            [type] => get
            [key] => joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6
            [value] => a:2:{s:7:"content";s:1742:"0|{"title":"\u041a\u0430\u043c\u0435\u043d\u043d\u043e\u0443\u0433\u043e...                  
            [error] => SUCCESS
        )

    [3] => Array
        (
            [type] => get
            [key] => joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__post_cached/sf_cache_key/4849934_noauth
            [value] => a:2:{s:7:"content";s:2512:"0|<div class="article post-normal"><div class="post_top">...
            [error] => SUCCESS
        )

We have 4 requests there. The 2nd and 3rd request requests the same key. The return value on the 2nd request is not right. It returned a value from another key (that was in the end of the previous request). The right value is returned in the 3rd response.

I set 2 breakpoints in the libmemcached-awesome and returned key that was queried there:
1)

br memcached_mget
printf "mget = %s\n", keys[0]
br memcached_fetch_result
printf "fetch = %s\n", result->item_key

Here is what was printed:

mget = joyreactor:routing:4:symfony.routing.data
fetch =
fetch = joyreactor:routing:4:symfony.routing.data▒o%
mget = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6
fetch =
fetch = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/people/__onlineUsers/sf_cache_key/40cd750bba9870f18aada2478b24840a
fetch = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6750bba9870f18aada2478b24840a
mget = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6
fetch =
fetch = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__postSeo/sf_cache_key/4849934_v6
mget = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__post_cached/sf_cache_key/4849934_noauth
fetch =
fetch = joyreactor:/koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/post/__post_cached/sf_cache_key/4849934_noauth

As you can see, after the 2nd mget it requests old key first (__onlineUsers) and then requests the right key (__postSeo) but ignores its result.

In both logs I skipped everything after the 4th request. There was 19 requests to memcached on the page.

@k0ka
Copy link

k0ka commented Feb 2, 2023

Let me know if you need any other internal information to debug it.

@m6w6
Copy link
Contributor

m6w6 commented Feb 7, 2023

This might be an even better candidate: awesomized/libmemcached@d7a0084

@m6w6
Copy link
Contributor

m6w6 commented Feb 18, 2023

I'd need someone to confirm/decline any offending commit to proceed; or/and (even better) a reliable reproduce case.

@m6w6
Copy link
Contributor

m6w6 commented Feb 20, 2023

Hey! Did anyone have a chance to test whether their issue was solved with reverting either commit?

@k0ka
Copy link

k0ka commented Feb 27, 2023

Hi,

Sorry for long response. I set the test environment. Reverted that commit (k0ka/libmemcached@6197650). But when I'm building rpm with this patch (from Remi's srpm) it hangs on test 31: memcached_noblock.

Should I try to revert only changes in response.cc?

@m6w6
Copy link
Contributor

m6w6 commented Feb 28, 2023

I remember fiddling around with this test when looking for offending commits. It's probably a bad test in itself.

IIRC, just make https://github.com/k0ka/libmemcached/blob/v1.x/test/tests/memcached/noblock.cpp#L13 read:

auto num = 10000;

@k0ka
Copy link

k0ka commented Feb 28, 2023

Yep, this fixes the problem with the test. However, I haven't run it yet.

I'm trying to reproduce the bug as a test. I ended up with this code:

#include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"

TEST_CASE("memcached_poll_timeout") {
  auto test = MemcachedCluster::network();
  auto memc = &test.memc;

  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RCV_TIMEOUT, 20*1000));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_SND_TIMEOUT, 20*1000));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT, 60));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_CONNECT_TIMEOUT, 20));


  memcached_return_t rc;
  size_t len;

  auto num = 500;
  auto hit = 0u;
  auto rnd = random_ascii_string(128);

  REQUIRE_SUCCESS(memcached_flush(memc, 0));

  for (auto i = 0; i < num; ++i) {
    auto setKey = rnd + to_string(i);
    REQUIRE_SUCCESS(memcached_set(memc, setKey.c_str(), setKey.length(), setKey.c_str(),
                                  setKey.length(), 1, 0));
  }
  memcached_quit(memc);

  auto timeoutAchieved = false;
  for (auto i = 0; i < num; ++i) {
    for (auto j = 0; j < i; ++j) {
      auto getKey = rnd + to_string(j);
      auto len = getKey.length();
      char key[MEMCACHED_MAX_KEY];
      memcpy(key, getKey.c_str(), len + 1);
      const char *keys[1] = {key};

      REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));
      auto numResults = 0;
      while (auto rv = memcached_fetch(memc, key, nullptr, &len, nullptr, &rc)) {
        REQUIRE(!timeoutAchieved);

        REQUIRE_SUCCESS(rc);
        free(rv);

        numResults++;
        REQUIRE(numResults == 1);
      }

      if (MEMCACHED_END == rc || MEMCACHED_NOTFOUND == rc ){
        continue;
      }

      if (MEMCACHED_TIMEOUT == rc){
        timeoutAchieved = true;
        continue;
      }

      INFO(rc);
      REQUIRE(false);
    }
  }
}

However it just hangs after timeoutAchieved became true and memcached_fetch runs. Is that expected behavior?

m6w6 added a commit to awesomized/libmemcached that referenced this issue Mar 1, 2023
@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

That's interesting. It doesn't hang, though; it just takes a very long time on linux.
See https://github.com/awesomized/libmemcached/actions/runs/4301629917

@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

Thanks, for the test case, though. I'll try to debug it on Linux later.

@k0ka
Copy link

k0ka commented Mar 1, 2023

I also don't understand why it returns MEMCACHED_NOTFOUND sometimes.
The main idea of the test if just to set 500 keys and then query them several times.

@k0ka
Copy link

k0ka commented Mar 1, 2023

In your test runs timeoutAchieved is never set to true (if it is set the test will fail). On my test server it happens every run. Maybe you should decrease MEMCACHED_BEHAVIOR_POLL_TIMEOUT ?

I tested it with MEMCACHED_BEHAVIOR_POLL_TIMEOUT=1000 and 20. It never happens on 1000 and always happens on 20.

@k0ka
Copy link

k0ka commented Mar 1, 2023

So, I'm using rockylinux 8 (RHEL clone). Compile flags are taken from the remi's srpm:

/usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_PREFIX:PATH=/opt/libmemcached-awesome -DINCLUDE_INSTALL_DIR:PATH=/opt/libmemcached-awesome/include -DLIB_INSTALL_DIR:PATH=/opt/libmemcached-awesome/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/opt/libmemcached-awesome/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_TESTING:BOOL=ON -DBUILD_DOCS_MAN:BOOL=ON -DBUILD_DOCS_MANGZ:BOOL=OFF -DENABLE_SASL:BOOL=ON -DENABLE_DTRACE:BOOL=ON -DENABLE_OPENSSL_CRYPTO:BOOL=ON -DENABLE_HASH_HSIEH:BOOL=ON -DENABLE_HASH_FNV64:BOOL=ON -DENABLE_HASH_MURMUR:BOOL=ON -DENABLE_MEMASLAP:BOOL=ON -S .

@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

Thanks for the extra info. Can you also share some words about your hardware?

@k0ka
Copy link

k0ka commented Mar 1, 2023

[root@serv52 koka]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
BIOS Vendor ID:      Advanced Micro Devices, Inc.
CPU family:          23
Model:               113
Model name:          AMD Ryzen 5 3600 6-Core Processor
BIOS Model name:     AMD Ryzen 5 3600 6-Core Processor
Stepping:            0
CPU MHz:             3600.000
CPU max MHz:         4208.2031
CPU min MHz:         2200.0000
BogoMIPS:            7185.82
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-11
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
[root@serv52 koka]# uname -a
Linux serv52.identityworks.ru 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@k0ka
Copy link

k0ka commented Mar 1, 2023

I finally created a docker container with this patch. It looks like this bug is gone - I wasn't able to reproduce it with that commit reverted. However I still see quite a lot of "TIMEOUT OCCURRED" errors. About 5%. The memcached is on the same server.

    [14] => Array
        (
            [type] => get
            [key] => /koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/people/__onlineUsers/sf_cache_key/40cd750bba9870f18aada2478b24840a
            [value] => 
            [error] => (0x55fa3332e5d0) A TIMEOUT OCCURRED, time out,  host: localhost:11211 -> /root/rpmbuild/BUILD/libmemcached-6e713c39031ff312164cc3ff12f943f8e8e01885/src/libmemcached/io.cc:220
        )

    [15] => Array
        (
            [type] => get
            [key] => /koka_joyreactor_identityworks_ru/ru/all/sf_cache_partial/people/__onlineUsers/sf_cache_key/40cd750bba9870f18aada2478b24840a
            [value] => 
            [error] => SUCCESS
        )

m6w6 added a commit to awesomized/libmemcached that referenced this issue Mar 1, 2023
@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

Just realised, your test is doing 500! gets on 500 existing and 500! - 500 non-existent keys... (! means factorial)... yikes! 🤣

Anyway, I think, in normal circumstances, with this low of a timeout, there's still the previous response floating around, since, well, a timeout occurred while trying to fetch it, and the next request will happily fetch the previous response the next time.

I also think, these timeouts are really low, even computers sometimes need some time to think 🤔
If the server is going to fail, it will also fail with a timeout of 100ms.

Please have a try with awesomized/libmemcached@48dcc61a and see if it fixes your issues; it basically reverts 90% of awesomized/libmemcached@d7a0084

Thank you!

@k0ka
Copy link

k0ka commented Mar 1, 2023

it's not 500! iterations, it's only 500*250 iterations.

And 100ms is way too much for a bare metal server. The full page is generated in 200-300ms including several sql requests and 15-20 memcached requests. It took less than 1ms to process a memcached request - you may check my tcpdump log. It has timestamps.

@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

it's not 500! iterations, it's only 500*250 iterations.

Uhm, I may have been sidetracked, please enlighten me!

And 100ms is way too much for a bare metal server. The full page is generated in 200-300ms including several sql requests and 15-20 memcached requests. It took less than 1ms to process a memcached request - you may check my tcpdump log. It has timestamps.

I unterstand your sentiment, but i have been trying to understand Linux clock granularity for the past couple hours.

If a server fails, it doesn't matter
If in 20 or 200 ms.

Most probably I've been looking in the wrong corners...

@k0ka
Copy link

k0ka commented Mar 1, 2023

Please have a try with awesomized/libmemcached@48dcc61a and see if it fixes your issues; it basically reverts 90% of awesomized/libmemcached@d7a0084

I checked it, it works the same way - it fixes the most nasty bug with "wrong key return". However the timeout still occurs.

I'm pretty sure it's not right. I can get tcpdump of interaction with memcached server to see timings.

@k0ka
Copy link

k0ka commented Mar 1, 2023

it's not 500! iterations, it's only 500*250 iterations.

Uhm, I may have been sidetracked, please enlighten me!

The current test is:

1) create 500 keys.
2) for i from 0 to 500:
        for j from 0 to i:
            get j-th key

so the first iteration goes from 0 to 0 and asks for 1 keys, the 2nd - from 0 to 1 and asks for 2 keys, 3rd - from 0 to 2 and asks 3 keys, nth goes from 0 to n and asks for n keys.
1 + 2 + 3 + ... + n = (n * n-1)/2

(actually it goes from 0 to i-1, so it asks even less keys)

@m6w6
Copy link
Contributor

m6w6 commented Mar 1, 2023

Yep, but for num times.

@k0ka
Copy link

k0ka commented Mar 1, 2023

why num times? it executes memcached_mget with 1 key inside of two cycles. And fetches only 1 key.

Anyway, I narrowed my test to this:

#include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"

TEST_CASE("memcached_poll_timeout") {
  auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
  auto memc = &test.memc;

  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RCV_TIMEOUT, 20 * 1000));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_SND_TIMEOUT, 20 * 1000));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT, 60));
  // REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_CONNECT_TIMEOUT, 20));

  memcached_return_t rc;

  auto num = 100'000;

  REQUIRE_SUCCESS(memcached_flush(memc, 0));

  const char* key = "testkey1";
  auto len = strlen(key);

  REQUIRE_SUCCESS(memcached_set(memc, key, len, key, len, 1, 0));
  memcached_quit(memc);

  const char *keys[1] = {key};

  for (auto i = 0; i < num; ++i) {
    REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));

    auto numResults = 0;
    while (auto rv = memcached_fetch(memc, nullptr, nullptr, nullptr, nullptr, &rc)) {
      REQUIRE_SUCCESS(rc);
      free(rv);

      numResults++;
      REQUIRE(numResults == 1);
    }

    if (MEMCACHED_END == rc) {
      continue;
    }

    INFO(rc << " on iteration " << i);
    REQUIRE(false);
  }
}

So it only sets one key testkey1 and mgets it 100'000 times.

It constantly fails for me on about 60-70 000 iteration. But it is the memcached server fail.

19:33:10.916117 IP (tos 0x0, ttl 64, id 45519, offset 0, flags [DF], proto TCP (6), length 66)
    127.0.0.1.56098 > 127.0.0.1.11211: Flags [P.], cksum 0xfe36 (incorrect -> 0xec93), seq 978181:978195, ack 2445451, win 128, options [nop,nop,TS val 2401858559 ecr 2401858559], length 14
        0x0000:  4500 0042 b1cf 4000 4006 8ae4 7f00 0001  E..B..@.@.......
        0x0010:  7f00 0001 db22 2bcb e142 3258 8ddf 27d6  ....."+..B2X..'.
        0x0020:  8018 0080 fe36 0000 0101 080a 8f29 73ff  .....6.......)s.
        0x0030:  8f29 73ff 6765 7420 7465 7374 6b65 7931  .)s.get.testkey1
        0x0040:  0d0a                                     ..
19:33:10.916121 IP (tos 0x0, ttl 64, id 56988, offset 0, flags [DF], proto TCP (6), length 87)
    127.0.0.1.11211 > 127.0.0.1.56098: Flags [P.], cksum 0xfe4b (incorrect -> 0x3301), seq 2445451:2445486, ack 978195, win 128, options [nop,nop,TS val 2401858559 ecr 2401858559], length 35
        0x0000:  4500 0057 de9c 4000 4006 5e02 7f00 0001  E..W..@.@.^.....
        0x0010:  7f00 0001 2bcb db22 8ddf 27d6 e142 3266  ....+.."..'..B2f
        0x0020:  8018 0080 fe4b 0000 0101 080a 8f29 73ff  .....K.......)s.
        0x0030:  8f29 73ff 5641 4c55 4520 7465 7374 6b65  .)s.VALUE.testke
        0x0040:  7931 2030 2038 0d0a 7465 7374 6b65 7931  y1.0.8..testkey1
        0x0050:  0d0a 454e 440d 0a                        ..END..
19:33:10.916129 IP (tos 0x0, ttl 64, id 45520, offset 0, flags [DF], proto TCP (6), length 66)
    127.0.0.1.56098 > 127.0.0.1.11211: Flags [P.], cksum 0xfe36 (incorrect -> 0xec62), seq 978195:978209, ack 2445486, win 128, options [nop,nop,TS val 2401858559 ecr 2401858559], length 14
        0x0000:  4500 0042 b1d0 4000 4006 8ae3 7f00 0001  E..B..@.@.......
        0x0010:  7f00 0001 db22 2bcb e142 3266 8ddf 27f9  ....."+..B2f..'.
        0x0020:  8018 0080 fe36 0000 0101 080a 8f29 73ff  .....6.......)s.
        0x0030:  8f29 73ff 6765 7420 7465 7374 6b65 7931  .)s.get.testkey1
        0x0040:  0d0a                                     ..
19:33:10.916149 IP (tos 0x0, ttl 64, id 56989, offset 0, flags [DF], proto TCP (6), length 57)
    127.0.0.1.11211 > 127.0.0.1.56098: Flags [P.], cksum 0xfe2d (incorrect -> 0x0e02), seq 2445486:2445491, ack 978209, win 128, options [nop,nop,TS val 2401858560 ecr 2401858559], length 5
        0x0000:  4500 0039 de9d 4000 4006 5e1f 7f00 0001  E..9..@.@.^.....
        0x0010:  7f00 0001 2bcb db22 8ddf 27f9 e142 3274  ....+.."..'..B2t
        0x0020:  8018 0080 fe2d 0000 0101 080a 8f29 7400  .....-.......)t.
        0x0030:  8f29 73ff 454e 440d 0a                   .)s.END..

I guess we might accept awesomized/libmemcached@48dcc61a as the fix for this problem.

@k0ka
Copy link

k0ka commented Mar 1, 2023

wait, I'm dumb. I set key to 1 sec lifetime. That's why it is expired. Making more testing...

@k0ka
Copy link

k0ka commented Mar 1, 2023

Or I'm a genius (accidentally)! It looks like problems starts when a key is expired. It only happens when BINARY_PROTOCOL is on. The problem is in the memcached server. It just answers very long for a key that doesn't exist.

#include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"

TEST_CASE("memcached_poll_timeout") {
  auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
  auto memc = &test.memc;

  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));

  memcached_return_t rc;

  REQUIRE_SUCCESS(memcached_flush(memc, 0));

  const char* key = "testkey1";
  auto len = strlen(key);

  const char *keys[1] = {key};

  REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));

  auto numResults = 0;
  while (auto rv = memcached_fetch(memc, nullptr, nullptr, nullptr, nullptr, &rc)) {
    REQUIRE_SUCCESS(rc);
    free(rv);

    numResults++;
    REQUIRE(numResults == 1);
  }

  if (MEMCACHED_END != rc && MEMCACHED_NOTFOUND != rc) {
    INFO(rc);
    REQUIRE(false);
  }
}

This test always fails with rc=31 (MEMCACHED_TIMEOUT) for me. Setting BINARY_PROTOCOL to 0 makes the test passes perfectly. If I set POLL_TIMEOUT to 100 the test will pass but fetch would take quite a lot of time (about 40ms for each request).

TL;DR:

  1. BINARY_PROTOCOL must be set to 0 in production (I heard about it somewhere already).
  2. querying non-existent key with BINARY_PROTOCOL=1 and small POLL_TIMEOUT is a good way to reliably get MEMCACHED_TIMEOUT error for tests to simulate a server that is not completely down.
  3. It looks like awesomized/libmemcached@48dcc61a fixes the problem.

@k0ka
Copy link

k0ka commented Mar 1, 2023

And here is a test that does it:

#include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"

TEST_CASE("memcached_poll_timeout") {
  auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
  auto memc = &test.memc;

  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 0));
  REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 1000));

  memcached_return_t rc;
  size_t len;

  auto num = 100;
  auto rnd = random_ascii_string(128);

  for (auto i = 0; i < num; ++i) {
    auto setKey = rnd + to_string(i);
    REQUIRE_SUCCESS(memcached_set(memc, setKey.c_str(), setKey.length(), setKey.c_str(),
                                  setKey.length(), 0, 0));
  }
  memcached_quit(memc);

  for (auto i = 0; i < num; ++i) {
    for (auto j = 0; j <= i; ++j) {
      auto getKey = rnd + to_string(j);
      auto len = getKey.length();
      char key[MEMCACHED_MAX_KEY];
      memcpy(key, getKey.c_str(), len + 1);
      const char *keys[1] = {key};

      if (0 == j){
        REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 0));
      }

      REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));

      auto numResults = 0;
      while (auto rv = memcached_fetch(memc, key, nullptr, &len, nullptr, &rc)) {
        REQUIRE_SUCCESS(rc);
        free(rv);

        numResults++;
        REQUIRE(numResults == 1);
      }

      if (0 == j){
        REQUIRE(MEMCACHED_TIMEOUT == rc);
        REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 1000));
        continue;
      }

      if (MEMCACHED_END == rc){
        continue;
      }

      INFO(rc);
      REQUIRE(false);
    }
  }
}

On my server it always fails on the current version of libmemcached, but always passes after awesomized/libmemcached@48dcc61a is applied.

Here is what it does:

  1. it creates keys from 0 to num-1.
  2. for i from 0 to num-1, it asks for keys 0, 1, 2, ... i, i in separate mget requests. Each request asks for only one key.
  3. the key 0 is always queried with POLL_TIMEOUT set to 0. The library must return MEMCACHED_TIMEOUT in this case.
  4. it queries other i-1 keys after timeout. And checks that only one result is returned.
  5. at some point it returns two results via memcached_fetch and assertion REQUIRE(numResults == 1); fails

Upd: made the test more reliable.

m6w6 added a commit to awesomized/libmemcached that referenced this issue Mar 2, 2023
@m6w6
Copy link
Contributor

m6w6 commented Mar 2, 2023

Hey, I pushed a much simpler test case; in case you didn't notice.

Unfortunately the fix for this problem is simple & hard (like in the branch) or complicated & considerate (I'll try to write something up until tomorrow).

If your observation of "memcached is choking on missing-key+binary-proto", we should file a decent issue on https://github.com/memcached/memcached, IMO. I'll also try to have a look at this.

Finally, I'm definitely not understanding why memcached_io_poll immediately returns when the timeout is set to zero? Maybe the socket actually IS readable without a timeout?

Unfortunately, the libmemcached legacy is a over-engineered mess; someone should think about making a new meta-protocol-only client library. Maybe @dormando's proxy code would be a good fit for a new client library?

@dormando
Copy link

dormando commented Mar 2, 2023

Hi. Sorry I've only skimmed this; that sounds like a nasty bug.

RE client replacement, this issue on php-memcached is probably the wrong place to discuss, so I created an issue topic over here: dormando/mcmc#2

TL:DR; I wrote a core of a client parser (it's messy because I didn't know what direction to go in at the time) which is in use for the proxy code + some other utilities. I'm trying to find time to expand this into a full client soon but am not sure what a useful replacement looks like.

I wrote up my goals and would invite anyone on this thread who cares (certainly @m6w6 as you seem to have taken ownership of libmemcached) to come help with at least feedback if not code.

@k0ka
Copy link

k0ka commented Mar 3, 2023

Finally, I'm definitely not understanding why memcached_io_poll immediately returns when the timeout is set to zero? Maybe the socket actually IS readable without a timeout?

My guess is: in most cases it will return timeout. As we asked something from the server and it took some time to process our request. So they decided to make poll_timeout=0 a convenient way to get TIMEOUT error. It's definitely usable in tests. Maybe it can be used in some weird real world examples.

m6w6 added a commit to awesomized/libmemcached that referenced this issue Mar 6, 2023
* Fix [gh #107](#107):
  macOS: deprecated sasl API (improve detection of `libsasl2`).
* Fix [gh #131](#131):
  Consider renaming tools (add `CLIENT_PREFIX` build option; default: `mem`)
* Fix [gh #132](#132):
  Add build of static library (add `BUILD_SHARED_LIBS` build option; default: `ON`).
* Fix [gh #134](#134):
  Update client option documentation.
* Fix [gh #136](#136):
  `libmemcachedutil` is underlinked (link against libmemcached).
* Fix [gh php-memcached#531](php-memcached-dev/php-memcached#531):
  `get` returns random values when lower than default `OPT_POLL_TIMEOUT` is set.
@WPprodigy
Copy link

WPprodigy commented Jul 17, 2023

Wanted to mention (for others that may come looking) that I've seen this occurring even without OPT_POLL_TIMEOUT configured. The only two possibly related options that were set are:

Memcached::OPT_CONNECT_TIMEOUT => 1000, (which is the default)
Memcached::OPT_TCP_NODELAY     => true,

Notably this issue only occurs when the memcached server nodes are experiencing some sort of stress, like kernel blocking under memory pressure and page eviction.

I've had a hard time replicating without OPT_POLL_TIMEOUT, but can do so consistently with that option here https://github.com/WPprodigy/memcached-inconsistency-debugging/ by sending pause/continue signals to the memcached servers . And can also confirm that updating libmemcached from 1.1.3 to 1.1.4 does resolve what I've replicated in that test environment. TBD still if it will iron out the issue in production.

Update: Can replicate without OPT_POLL_TIMEOUT(using only the two options listed above, or even no overrides at all) by sending SIGSTOP/SIGCONT signals rather than SIGHUP/SIGCONT. Even happens without persistent connections. I've updating the replication repo with the new signals. Also resolves itself in libmemcached 1.1.4, so I think we're all good here 🎉

@ondrejcech
Copy link
Author

Because we've moved to Redis, I can't tell if this issue is fixed by ibmemcached 1.1.4. Can someone else please confirm that so I can close this issue? cc @k0ka @m6w6

@k0ka
Copy link

k0ka commented Jul 20, 2023

yep, it was fixed

@ondrejcech
Copy link
Author

Thank you!

@sodabrew
Copy link
Contributor

Thank you all for the report, reproduction case, and follow-up that this was fixed upstream!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants