Discussion:
[Sks-devel] Peering Issues - High IO ending with Eventloop.SigAlarm always occur with 1 peer
Todd Fleisher
2018-10-08 20:54:46 UTC
Permalink
Hi All,
I recently joined the pool and started having an issue after adding a second external peer to my membership file. The symptoms are abnormally high IO load on the disk whenever my server tries to reconcile with the second peer (149.28.198.86), ending with a failure message "add_keys_merge failed: Eventloop.SigAlarm”. It appears it tries to reconcile a large number of keys (100) consistently when this happens. I’ve read previous list threads about this message (e.g. https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html <https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html> & https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html <https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html>) which mention the cause potentially being a large key trying to be processed and failing. I tried increasing my client_max_body_size from 8m -> 32m in NGINX, but the issue persisted. For now, I have removed the second peer from my membership file to keep from over-taxing my server with no apparent benefits. I have included an excerpt of my logs showing the behavior. Can someone please advise what might be causing this issue and what can be done to resolve it? Thanks in advance.



-T
Todd Fleisher
2018-10-10 18:00:44 UTC
Permalink
Hi All,
I wanted to follow up on this and add some new data points. I tried building some new SKS instances based on a more recent dump (specifically 2018-10-07 from https://keyserver.mattrude.com/dump/ <https://keyserver.mattrude.com/dump/>) and found those instances were plagued by the same issue when I began peering with my existing instances. When I re-built the new instances from an older dump (specifically 2018-10-01 from the same source), the issues went away. This seems to imply some problematic data was introduced into the pool during the first week of October that is causing the issues.

I found an existing issue logged about this behavior @ https://bitbucket.org/skskeyserver/sks-keyserver/issues/61/key-addition-failed-blocks-web-interface <https://bitbucket.org/skskeyserver/sks-keyserver/issues/61/key-addition-failed-blocks-web-interface>

For now, I’m able to keep my instances stable by building them from the earlier 2018-10-01 dump and not adding the second peer to my membership file. I would like to better understand why this is happening and figure out how to go about fixing it, in part so I can begin peering with more servers to improve the mesh.

-T
Post by Todd Fleisher
Hi All,
I recently joined the pool and started having an issue after adding a second external peer to my membership file. The symptoms are abnormally high IO load on the disk whenever my server tries to reconcile with the second peer (149.28.198.86), ending with a failure message "add_keys_merge failed: Eventloop.SigAlarm”. It appears it tries to reconcile a large number of keys (100) consistently when this happens. I’ve read previous list threads about this message (e.g. https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html <https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html> & https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html <https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html>) which mention the cause potentially being a large key trying to be processed and failing. I tried increasing my client_max_body_size from 8m -> 32m in NGINX, but the issue persisted. For now, I have removed the second peer from my membership file to keep from over-taxing my server with no apparent benefits. I have included an excerpt of my logs showing the behavior. Can someone please advise what might be causing this issue and what can be done to resolve it? Thanks in advance.
<skslog2.txt>
-T
Paul Fawkesley
2018-10-20 06:38:10 UTC
Permalink
Hi Todd, for what it's worth, I've been experiencing this too since March.

The hangs are so severe my keyserver would fail to respond to requests. In order not to provide a poor experience to users of the pool, I removed myself from it.

Anecdotally it appears other keyservers still in the pool are similarly affected: I experience high rates of timeout and failure when using the pool these days.

I installed Hockeypuck on another server and peered it with my SKS instance. It syncs successfully but Hockeypock *also* goes nuts periodically while syncing. Its memory and CPU use rockets, often pushing into gigabytes of swap space, so that server is pretty unresponsive too.

I'm about to arrive at the OpenPGP Email summit in Brussels, I'm sure this will come up as a topic, I shall report back...

Paul
Post by Todd Fleisher
Hi All,
I wanted to follow up on this and add some new data points. I tried
building some new SKS instances based on a more recent dump
(specifically 2018-10-07 from https://keyserver.mattrude.com/dump/
<https://keyserver.mattrude.com/dump/>) and found those instances were
plagued by the same issue when I began peering with my existing
instances. When I re-built the new instances from an older dump
(specifically 2018-10-01 from the same source), the issues went away.
This seems to imply some problematic data was introduced into the pool
during the first week of October that is causing the issues.
https://bitbucket.org/skskeyserver/sks-keyserver/issues/61/key-addition-failed-blocks-web-interface
<https://bitbucket.org/skskeyserver/sks-keyserver/issues/61/key-addition-failed-blocks-web-interface>
For now, I’m able to keep my instances stable by building them from the
earlier 2018-10-01 dump and not adding the second peer to my membership
file. I would like to better understand why this is happening and figure
out how to go about fixing it, in part so I can begin peering with more
servers to improve the mesh.
-T
Post by Todd Fleisher
Hi All,
I recently joined the pool and started having an issue after adding a second external peer to my membership file. The symptoms are abnormally high IO load on the disk whenever my server tries to reconcile with the second peer (149.28.198.86), ending with a failure message "add_keys_merge failed: Eventloop.SigAlarm”. It appears it tries to reconcile a large number of keys (100) consistently when this happens. I’ve read previous list threads about this message (e.g. https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html <https://lists.nongnu.org/archive/html/sks-devel/2018-06/msg00051.html> & https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html <https://lists.nongnu.org/archive/html/sks-devel/2011-06/msg00077.html>) which mention the cause potentially being a large key trying to be processed and failing. I tried increasing my client_max_body_size from 8m -> 32m in NGINX, but the issue persisted. For now, I have removed the second peer from my membership file to keep from over-taxing my server with no apparent benefits. I have included an excerpt of my logs showing the behavior. Can someone please advise what might be causing this issue and what can be done to resolve it? Thanks in advance.
<skslog2.txt>
-T
_______________________________________________
Sks-devel mailing list
https://lists.nongnu.org/mailman/listinfo/sks-devel
+ signature.asc
1k (application/pgp-signature)
Loading...