[Sks-devel] SKS intermittently stalls with 100% CPU & rate-limiting

Discussion:

Pete Stephenson

2018-06-17 00:18:55 UTC

Hi all,

My server, ams.sks.heypete.com, has been suffering from periods where
the amount of CPU used by the sks process goes to 100% for a few minutes
at a time. During this time, my Apache reverse proxy produces errors of
the following type (client IP address obfuscated for their privacy):

[Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
remote server returned by /pks/lookup

This happens across a range of client IP addresses, so it doesn't appear
to be a single malicious user. Rather, it seems that something is
causing the sks process to stall and connections to it time out.

After a minute or two, CPU usage drops to the normal value of a few
percent up to 15%, with queries being promptly answered until the CPU
usage spikes again and things stall out.

The server is in close sync with its peers, with no particular issues on
the recon side.

Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
things have generally been working well for several years. For good
measure, I recently deleted the key database and recreated it from a
fresh dump, but that had no effect.

Potentially related: several clients, evidently corporate mail servers
that query the SKS pool for every email they send or receive, are making
dozens of queries per second to my server. Is it reasonable to impose
rate limits on such clients (e.g. no more than X queries in Y seconds)?
If so, what would reasonable values be for X and Y?

Thank you.

Cheers!
-Pete

--
Pete Stephenson

Pete Stephenson

2018-06-17 00:46:00 UTC

Permalink

[snip]

As a follow-up, I set a rate limit on ports 80, 443, and 11371 to 6
requests per 30 second window. The high-speed queries ceased almost
immediately and are being blocked by the firewall (they still continue
making their rapid queries, but SKS doesn't see more than 12 a minute).

More ordinary queries are seemingly not affected: only between one and
three IP addresses are being rate limited.

However, this doesn't seem to resolve the problem: SKS still pegs the
CPU meter at 100% for a minute or so every minute or two, with all
queries hanging until it sorts out what's going on.

--
Pete Stephenson

Moritz Wirth

2018-06-17 00:47:39 UTC

Permalink

Hi,

seems like that is the "problem":

https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable

Best regards,

Moritz

Post by Pete Stephenson
Hi all,
My server, ams.sks.heypete.com, has been suffering from periods where
the amount of CPU used by the sks process goes to 100% for a few minutes
at a time. During this time, my Apache reverse proxy produces errors of
[Sun Jun 17 00:00:31.414596 2018] [proxy:error] [pid 4648:tid
139657505371904] [client CLIENT_IP:40327] AH00898: Error reading from
remote server returned by /pks/lookup
This happens across a range of client IP addresses, so it doesn't appear
to be a single malicious user. Rather, it seems that something is
causing the sks process to stall and connections to it time out.
After a minute or two, CPU usage drops to the normal value of a few
percent up to 15%, with queries being promptly answered until the CPU
usage spikes again and things stall out.
The server is in close sync with its peers, with no particular issues on
the recon side.
Any ideas what might be causing this? I'm running 1.1.6 on Debian, and
things have generally been working well for several years. For good
measure, I recently deleted the key database and recreated it from a
fresh dump, but that had no effect.
Potentially related: several clients, evidently corporate mail servers
that query the SKS pool for every email they send or receive, are making
dozens of queries per second to my server. Is it reasonable to impose
rate limits on such clients (e.g. no more than X queries in Y seconds)?
If so, what would reasonable values be for X and Y?
Thank you.
Cheers!
-Pete

Pete Stephenson

2018-06-17 03:53:08 UTC

Permalink

Thanks.

I then have three more questions:

1. If this issue is affecting my server to the point of it being booted
from the pool (since it's stalling near-continuously and can't respond
toe queries), why are other servers not being similar affected? There's
lots of servers still in the pool.

2. Is there some countermeasure one can use to protect their server? I
have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
clearly something is still annoying the server.

3. Any suggestions on how to deal with the unreasonably high-speed
queries from corporate mail systems? Ideally, they'd run their own
server locally to handle their huge amount of queries, but I have no
real way of communicating that with them. I'd love to slow down their
queries (tarpitting, maybe?) to minimize excess resource consumption
while still answering their queries as opposed to just cutting them off
once they hit a rate limit.

Cheers!
-Pete

Post by Moritz Wirth
Hi,
https://bitbucket.org/skskeyserver/sks-keyserver/issues/60/denial-of-service-via-large-uid-packets
https://bitbucket.org/skskeyserver/sks-keyserver/issues/57/anyone-can-make-any-pgp-key-unimportable
Best regards,
Moritz

_______________________________________________
Sks-devel mailing list
https://lists.nongnu.org/mailman/listinfo/sks-devel

--
Pete Stephenson

Paul M Furley

2018-06-17 07:59:19 UTC

Permalink

Hi Pete,

I certainly should've been booted from the pool since my server has
filled up its disk and trashed its database (twice) so it was offline
all of yesterday.

I'm bringing it back up with the `set_flags DB_LOG_AUTOREMOVE` setting
this time which will hopefully save it.

Post by Pete Stephenson
2. Is there some countermeasure one can use to protect their server? I
have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
clearly something is still annoying the server.

It appears from Rob's previous email that our servers are failing to
synchronise a 22M key (because of settings like this) which is causing

Post by Pete Stephenson
The size is causing timeouts on some reverse proxies and the constant

retries is causing the .log files to be created and growing in the DB
directory.

Post by Pete Stephenson
3. Any suggestions on how to deal with the unreasonably high-speed
queries from corporate mail systems? Ideally, they'd run their own
server locally to handle their huge amount of queries, but I have no
real way of communicating that with them. I'd love to slow down their
queries (tarpitting, maybe?) to minimize excess resource consumption
while still answering their queries as opposed to just cutting them off
once they hit a rate limit.

Are you sure these users are the cause of your troubles? Or is it this
constant-retry loop caused by this large key?

I'd suggest contacting them before rate limiting them, ask them to point
at the pool or slow down their queries.

Paul

Post by Pete Stephenson
Cheers!
-Pete

_______________________________________________
Sks-devel mailing list
https://lists.nongnu.org/mailman/listinfo/sks-devel

Pete Stephenson

2018-06-18 03:08:53 UTC

Permalink

Post by Paul M Furley
Hi Pete,

I certainly should've been booted from the pool since my server has
filled up its disk and trashed its database (twice) so it was offline
all of yesterday.
I'm bringing it back up with the `set_flags DB_LOG_AUTOREMOVE` setting
this time which will hopefully save it.

Yeah, I added the same line. There's now just two log files rather than
dozens. Seems to work ok in controlling the disk space usage, but it
doesn't seem to do anything about the spikes in CPU usage,
non-responsiveness, etc.

Post by Paul M Furley

It appears from Rob's previous email that our servers are failing to
synchronise a 22M key (because of settings like this) which is causing

The server had been running with no limits on the request body size for
several years without problems. I added that line in the hopes of
controlling things from getting worse. I've since removed it, but it
doesn't seem to have much of an effect.

Is there some way of (a) resolving the problem with this key (e.g.
locally adding it to the server, so it won't keep choking while
retrying) and (b) preventing such issues from occurring in the future
that I can take now?

Post by Paul M Furley

Are you sure these users are the cause of your troubles? Or is it this
constant-retry loop caused by this large key?

I don't know.

Regardless, I do think that the high-volume users are being a bit
unreasonable: SKS queries are relatively "heavy" compared to lightweight
queries like those to DNSbls, so making queries to the SKS pool for each
email sent or received seems excessive, but that may just be me.

Anyway, I've removed the rate limits since they didn't seem to have any
effect on the constant-retry loop or stalling.

Post by Paul M Furley
I'd suggest contacting them before rate limiting them, ask them to point
at the pool or slow down their queries.

I think they already were querying the pool, and just happened to get my
server as part of the rotation. I just sent them an email inquiring
about the number of queries and encouraging them to run their own server
and have it join the pool, but in general I don't have the time or
motivation to contact every potentially abusive user. I'm just curious
if there's any recommended practices for throttling abusive users.

Cheers!
-Pete

--
Pete Stephenson

Pete Stephenson

2018-06-21 22:11:52 UTC

Permalink

Post by Paul M Furley
Hi Pete,

It appears from Rob's previous email that our servers are failing to
synchronise a 22M key (because of settings like this) which is causing

It's been four days and my server is still stalling and connections time
out. The server is regularly being added to and removed from the pool.

I've removed the Apache LimitRequestBody directive in my relevant
reverse proxy configuration file, but what else can I do to stop this
continuous cycle such that my server is again a stable member of the pool?

Post by Paul M Furley

Are you sure these users are the cause of your troubles? Or is it this
constant-retry loop caused by this large key?
I'd suggest contacting them before rate limiting them, ask them to point
at the pool or slow down their queries.

It turns out that contacting them was the right thing to do: they've
implemented a caching proxy on their end to minimize the load to the
pool and are looking at running their own server going forward.
Excellent. Thanks for the suggestion.

Cheers!
-Pete

--
Pete Stephenson

Moritz Wirth

2018-06-22 01:32:54 UTC

Permalink

I am afraid there is not much you can do about this right now - the pool
itself is very unstable and crashes multiple times per day.

I found over 8 key hashes which cause an Eventloop - this happens every
2-3 minutes, sometimes with the same key, sometimes with other keys.Â

Best regards,

Post by Pete Stephenson

Post by Paul M Furley
Hi Pete,

It appears from Rob's previous email that our servers are failing to
synchronise a 22M key (because of settings like this) which is causing

It's been four days and my server is still stalling and connections time
out. The server is regularly being added to and removed from the pool.
I've removed the Apache LimitRequestBody directive in my relevant
reverse proxy configuration file, but what else can I do to stop this
continuous cycle such that my server is again a stable member of the pool?

Post by Paul M Furley

Are you sure these users are the cause of your troubles? Or is it this
constant-retry loop caused by this large key?
I'd suggest contacting them before rate limiting them, ask them to point
at the pool or slow down their queries.

Moritz Wirth

2018-06-17 10:33:41 UTC

Permalink

I have an idea about this, however i am not sure that this is still the
same problem.

The spider who queries the availability of the keyservers requests
/pks/lookup?op=get&search=0x16e0cf8d6b0b9508 - which contains the
problematic key (just look it up..).

I am not sure that this is the actual problem, but just imagine the
request of the key causes massive load - the request is not answered and
your keyserver is kicked out of the pool.

Post by Pete Stephenson
Thanks.
1. If this issue is affecting my server to the point of it being booted
from the pool (since it's stalling near-continuously and can't respond
toe queries), why are other servers not being similar affected? There's
lots of servers still in the pool.
2. Is there some countermeasure one can use to protect their server? I
have LimitRequestBody set to 8000000 (8MB) to prevent blatant abuse, but
clearly something is still annoying the server.
3. Any suggestions on how to deal with the unreasonably high-speed
queries from corporate mail systems? Ideally, they'd run their own
server locally to handle their huge amount of queries, but I have no
real way of communicating that with them. I'd love to slow down their
queries (tarpitting, maybe?) to minimize excess resource consumption
while still answering their queries as opposed to just cutting them off
once they hit a rate limit.
Cheers!
-Pete

_______________________________________________
Sks-devel mailing list
https://lists.nongnu.org/mailman/listinfo/sks-devel

Paul Fontela

2018-06-25 11:08:55 UTC

Permalink

Hello everyone,
without the intention of sticking your finger in the wound ....

I have spent almost 10 days investigating the problem that I see related
in different threads of the list [Sks-devel], the falls of the sks
servers for abuse of requests.

I have tried almost everything, from downloading a dump and starting the
server sks again to reinstall system and everything else, the result is
always the same, it works well for a while, sometimes an hour sometimes
a little more and suddenly it it freezes the key server, reaching 80%
RAM, which makes it unstable and inoperable.

Of the three servers that I have, only 2 of them are surviving with
difficulty to this strange problem that has appeared "suddenly", I
wonder the following:

Is there any way to solve this problem?

Checking the logs of Nginx and SKS I have seen that there are some types
that consult without rest for a long time.

Is it possible to block mercenaries who do not want to spend a few
dollars to set up their own key server?

What happens to those huge keys that clog servers?

Is it possible to limit or block queries with scripts and limit them
only to the web interface?

Seen the seen, I'm going to stop one of the servers, the smallest of
them and that is hosted in the site that has been working best until
now, it is a small virtual machine with little RAM (1Gb) and it is that
server that most Problems is causing me, I think it is not worth having
a server running 24 hours if only it fulfills its mission 30 minutes a
day and that makes me be aware of it to restart services every time it
hangs.

I will keep the other servers until I see that they start giving me
promises too, if this happens, I will have to make a difficult decision.

What I do not want to do is have machines consuming electricity,
bandwidth and resources so that they are not fulfilling their mission.

Greetings to all and a lot of encouragement.
Paul Fontela

--
Paul Fontela
keyserver.ispfontela.es 11370 # Paul Fontela <***@ispfontela.es> 0x31743FFC33E746C5
a.0.na.ispfontela.es 11370 # Paul Fontela Gmail <***@gmail.com> 0x3D7FCDA03AAD46F1

Gabor Kiss

2018-06-25 13:37:12 UTC

Permalink

Post by Paul Fontela
I have tried almost everything, from downloading a dump and starting the
server sks again to reinstall system and everything else, the result is
always the same, it works well for a while, sometimes an hour sometimes
a little more and suddenly it it freezes the key server, reaching 80%
RAM, which makes it unstable and inoperable.

Eeerrr... A few years ago I had a similar problem.
See thread at http://lists.nongnu.org/archive/html/sks-devel/2015-03/msg00004.html

Regards

Gabor

Paul Fontela

2018-06-26 09:52:10 UTC

Permalink

Hi Phill,

Thank you very much for your interest and your answer, the server
keyserver.ispfontela.es has no problems, in fact has been able to
synchronize almost 200,000 keys in less than 2 hours, that computer is
powerful, has a large processor and a lot of RAM, the one that has a
serious problem is *a.0.na.ispfontela.es*, is a virtual host that only
has 1Gb of RAM, has always worked well until a few days ago that
suddenly has begun to suffer what other colleagues comment, including
with the updated database, more than 5100000 keys, it got stuck and
stopped, I asked myself then:
If nothing has been modified in the configuration of the server or in
the SKS service, what has happened?
That's when I started with the battery of tests.
1 - Changes in Nginx configuration.
2 - Begin the database of keys with a new dump from scratch.
3 - System re-installation (Ubuntu)
4 - Other modifications (add swap to linux that you did not have).

The result was always the same, after a short period of time after
starting SKS it increased RAM consumption up to 80% and did not decrease
at any time.

Maybe some system update may have affected?

Today is underway synchronizing with only 2 pairs from 26,000 keys until
it reaches 5,100,000 with that I will know more or less what is happening.

I have seen that some other servers that are also hosted on Amazon
datacenters are suffering from the same problem, could it be Amazon, I
do not know, I can not answer that yet.

I will continue investigating and if in the end it does not improve, I
will eliminate that server and I will leave running only
keyserver.ispfontela.es that for the moment works well

That sounds like recon gone wild, normally a sign that you're peering
with someone who is very much behind on keys. The recon system only
works if your peers are "mostly up-to-date".
This is why we introduced the template for introducing yourself to the
community, in the Peering wiki page, showing how many keys you have
loaded. It cut down on people joining with 0 keys, expecting recon to
do all the work, and new peers complaining that their SKS was hanging.
Per <https://sks-keyservers.net/status/> the lower bound of keys to be
included is: 5105570
You have: 5109664
Using <http://keyserver.ispfontela.es:11371/pks/lookup?op=stats> as a
starting point, and skipping your in-house 11380 peers, opening all the
5109604 keys.niif.hu
5065412 keys.sbell.io
5107576 sks.mbk-lab.ru
5109585 pgp.neopost.com
5108773 pgp.uni-mainz.de
5109639 pgpkeys.urown.net
4825075 pgp.key-server.io
<can't connect> sks.funkymonkey.org
5084241 keyserver.iseclib.ru
5109254 keyserver.swabian.net
5109628 sks-cmh.semperen.com
<sks down behind proxy> keys-02.licoho.de
5109629 keyserver.dobrev.eu
5109121 sks.mirror.square-r00t.net
5109629 keyserver.escomposlinux.org
5108778 keyserver.lohn24-datenschutz.de
If your in-house peers are way behind, fix that.
Comment out all peers with fewer than 5_100_000 keys. Restart sks and
sks-recon.
The 284,000 key difference is pretty severe. Since that peer isn't
getting updates, they're probably hanging on peering and causing even
more problems for you.
Disable peering _at least_ with those three hosts.
Whenever SKS isn't performing right, the _first_ step after looking for
errors in logs should always be a Peering Hygiene Audit. Find the peers
who are sufficiently behind that their keeping the peering up is
anti-social and likely causing _you_ problems, comment out the peering
entries, restart (for a completely clean slate) and then reach out to
those peers to ask "Hey, what's up?".
Regards,
-Phil

--
Paul Fontela
keyserver.ispfontela.es 11370 # Paul Fontela <***@ispfontela.es> 0x31743FFC33E746C5
a.0.na.ispfontela.es 11370 # Paul Fontela Gmail <***@gmail.com> 0x3D7FCDA03AAD46F1

John Zaitseff

2018-06-26 10:16:48 UTC

Permalink

Hi, everyone,

Post by Paul Fontela
If nothing has been modified in the configuration of the server or
in the SKS service, what has happened?

As others have commented at length, could this indeed be related to
malicious or problematic keys?

Post by Paul Fontela
I have seen that some other servers that are also hosted on Amazon
datacenters are suffering from the same problem, could it be
Amazon, I do not know, I can not answer that yet.

The problem is definitely more widespread than Amazon. I am seeing
the same issues on my physical server located in Sydney, Australia.

My server has plenty of memory and disk space, so that is not an
issue (/var/lib/sks/DB is currently 118GB), but one processor core
continually goes in and out of being 100% utilised by the
single-threaded "sks db" process.

I can confirm that I have not changed any major OS component nor the
SKS daemon itself--I'm running an up-to-date Debian installation,
uptime is currently 48 days, and the problems appeared the same time
everyone else's did, just a couple of weeks ago.

Happy to provide log files if anyone is debugging; I myself have not
spent much time on this, nor looked through the SKS source code.

By the way, I tried Phil Pennock's suggestion of removing peers that
were significantly behind mine in terms of number of keys, but that
made no difference to the situation.

Yours truly,

John Zaitseff

--
John Zaitseff ,--_|\ The ZAP Group
Phone: +61 2 9643 7737 / \ Sydney, Australia
E-mail: ***@zap.org.au \_,--._* http://www.zap.org.au/
v