Caching issues can be one of the hardest things to track down in a website. This is especially the case when you have external caches involved that can't be entirely replicated on a local stack, like AWS's ElastiCache Redis in clustered mode!
We all know that cache invalidation is one of the two hardest things in software development. I think that can be shortened to just caches in general. It always seems like the most frustrating and time consuming bug reports are in and around some sort of cache. And this issue is no different.
We were noticing a large number of exceptions being thrown from our cache backend with the extremely helpful message 😱
Error processing response from Redis node!
We are using AWS's ElasticCache clustered backend (provisioned auto-magically with Skpr), connected to Drupal using the Redis module with a patch to implement the cluster client. On our 2 largest websites these exceptions were flooding the logs with no rhyme or reason. There was no way to reproduce them and they only happened on production. And the message above was literally all the information we had 😓
After a lengthy discussion with the community we tested out many different avenues. The first was to switch on persistent connections with $settings['redis.connection']['persistent'] but while that did solve the huge number of new connections, it didn't stop the exceptions coming.
Our next plan of attack was to try and whittle down the larger cache entries and see if that solved anything. Our first attempt was the entry that holds block plugin information. We trimmed this down a bit by removing plugins that we didn't need with a hook_block_alter implementation. Sadly this didn't stop the exceptions coming either 🤬
Finally, in a desperate attempt to get more information, I patched the cache backend to log the key that was being fetched if an exception occurred. You can see this in action in this patch on the cluster integration issue.
Huzzah! We were finally on to something. It turns out that every single time it was the same key 🙌
Redis cluster exception for key our-site-prod-default:default:token_info_sorted:en
So we were on the right path with trimming the block plugins, we just were trimming the wrong thing! A simple hook_token_info_alter and a deployment later and we are a week and a half with absolutely no errors processing those responses from the Redis nodes. The solution is quite specific to our site and architecture but in essence the fix was:
- Unset everything in $info['types'] that would never need to be seen. From what I can tell, the only place this is used is when browsing tokens. You can be very aggressive here since removing a token from this array doesn't result in it no longer being replaced.
- Unset everything in $info['tokens'] that would never be used. You can be as aggressive as you and your test coverage are comfortable with here. Tokens are generated for every field on every entity type and from what I can see there are duplicates like node-body and node:body so you can remove large amounts of data for tokens that will never be used (e.g scheduled_transition, media, or even block_content entity types).
Overall we were able to reduce the size of this cache entry by 52% just by deleting stuff that was never even used!
Although this bug/fix may seem specific to our site, it does seem that there is a common pattern with Redis not quite enjoying large volumes of data in a single cache entry. Drupal can quite easily bloat these info style cache entries quite a bit when your site has a large number of Entity types or Fields (or both!).
Thanks for these notes. I've been struggling with some redis caching issues that also seem to be related to large cache entries (e.g. my 1 Gb cache gets filled up in a few hours).
I used the rdb tool to analyse dumps from redis (off line), which looked cool and was interesting, but didn't really provide the aha moment I thought it would. It did help me trim out some cache entries and figure out where some of the space was going.
The redis module's report itself was a useful tool in pointing out the cache entries with lots of variations, but it seems like a more detailed redis cache analysis module would be a great addition.
One thought I had was that perhaps the recommendation of using the redis cache by default isn't the best for some sites, but I actually couldn't figure out how to exclude particular bins. I did end up disabling the internal static page cache, since I've got varnish replicating that function for the most part anyway, that saved a lot of space in the redis cache.
I'm really interested to see that you're running Drupal with AWS ElastiCache Redis cluster mode enabled.
I have been unable to do this, as keep coming up against the CROSSLOT error (https://aws.amazon.com/premiumsupport/knowledge-center/elasticache-cros…), as Drupal is generating multi-key redis commands which span multiple slots, therefore that error is received. Is the patch you mentioned actually allowing you to work around that?