After some research into the inner workings of the phpBB software (which powers DarwinCentral and a huge number of other internet forums), I found something unexpected.
As you may recall, for a while now we've been having problems with high CPU demand, the forum has been auto-offlining itself when CPU hits a certain max level, and since search-bots seemed to be a bunch of the load, we turned most of them (other than Google) off in the Admin Control Panel.
Still, things have been running at high CPU for a while, getting really bad a few days ago until I did some digging and found what was really going on. It counter-intuitive, and I wonder how many other phpBB admins aren't aware of this (and doing the wrong thing).
Here's the bot page from the Admin Control Panel:
See that "Deactivate" option on each bot? Gee, you'd think that would disable that bot from visiting your site and using forum resources, right? What I found is that a) no it doesn't, and b) deactivating a bot entry actually puts *more* load on the forum.
Nerdy explanation: Every request to the forum is tracked in a "session", and sessions are managed in the phpbb3_sessions database table. If you're logged in as a member, all of your activity over a long period (several hours) is handled by a single long-lived session record.
But forum requests by non-logged-in users generate a fresh session for every single web "hit" -- if they look at 20 threads even over just a minute or two, that generates 20 session records in the forum database (each of which is assigned to user_id "1", which is the "anonymous" user record). Here are the most recent records in the sessions table (I've truncated the session_ip_address column to protect privacy):
These session records (like all others) hang around the database for several hours until they expire of "old age". When there get to be too many of them, forum CPU goes up and response time goes down, because just about everything anyone does with the forum requires at least a search (and sometimes an insertion/cleanup) of the sessions table records, so the more records that have to be waded through, the longer that operation takes, and if forum hits are coming faster than sessions table operations can be completed, a traffic jam occurs and things just spiral out of control.
Here's a post of mine from last year describing the same problem, back when bad "forum cookie" settings were causing other issues along with a growth in the sessions table (because every page hit even by members looked like a new session): viewtopic.php?p=1198741#p1198741
This also explains why the server would go back to low CPU usage for a while after I rebooted it, which I often did when CPU usage was getting out of control. It worked, but I didn't understand exactly why until now -- when the forum software reboots, it clears the sessions table and starts fresh. However it wouldn't take long for the size to grow too large again (and CPU usage to rise), due to the following issue.
Long story short (yeah, I know, too late), what the 'bots" page actually does is help the forum software recognize multiple hits by a particular bot, so that it can consolidate all of its activity into a single re-usable session record (as if it were a pseudo-member of the forum) rather than generating a new session record for each of its many, many hits on the forum. It can do this because every hit on the forum contains information about what kind of browser is requesting the info (see the "session_browser" column in the above screenshot), which is normally used in web traffic to help an HTTP request know how best to format the resulting pageview depending upon what your browser can handle, but well-behaved bots use it to basically say "Hi I'm the Google bot [or whatever]" (see for example the second row in the above screenshot, that's the "AhrefsBot" visiting which also announces it's compatible with a Mozilla browser).
The bots table here contains a list of search strings to match with a particular visiting bot (e.g. "AhrefsBot"). A problem arises when new bots get unleashed on the world, which aren't already "recognized" by the pre-loaded "bots table". Those don't get recognized as a bot, and each and every one of their forum hits generates a new sessions table record, and we're back to the problem of tens of thousands of records in the sessions table which slows down the forum until the traffic jam occurs again -- not an overload of bandwidth traffic, but an overload of forum database churning.
Once I figured this all out I sifted through the sessions table to find all the unrecognized bot activity, and I created new "bot table" entries for them so that from now on they'll have their traffic folded into single sessions table records, which will keep the size of the sessions table down to a manageable size. Check out the before-and-after in this graph of DarwinCentral CPU activity over the past two weeks, it's pretty obvious where I added the new "bot recognizers":
I also added a bot entry for the "pagespeed_mod" feature, which isn't technically a bot. It's a server-side add-on that caches recently viewed forum images as people view them, under the theory that it's likely that other people will be viewing the same pages soon and the images can be served up from the in-memory copy rather than hitting the disk again. For some reason (I can think of several that would make sense) this add-on fetches the images by making its own "web request" via the forum itself rather than just reading it directly off the disk, so it generates session records as well, and it generates one session per image fetch, which is obviously the same as the "unrecognized bot" problem. And unlike bots that don't visit all the time, the pagespeed_mod gets triggered (often multiple times) whenever someone views a page that has image(s). So i made a "bot" entry for it which helps lower the size of the sessions table by about another 2000 records. I'm surprised this isn't a standard thing for phpBB forums since they use that add-on a lot.
And now for the counter-intuitive thing. You'd think that using the "deactivate" option on a bot entry would block that bot, but it doesn't. All it does is stop the forum from recognizing the bot anymore, which means that when it does visit it generates thousands of session table records instead of just one, which means that deactivating a bot entry *increases* forum CPU load, not decreases it.
There is in theory a way to actually block a bot, but it's messy and doesn't always work, since it relies on the bot itself looking for and then honoring the "please leave us alone" request. This is done via the "robots.txt" file that can be placed on a website, which contains a list of bots you want to stop visiting your site. But setting it up is a pain, and it's not worth doing as long as the actual amount of bot traffic isn't abusive, which it doesn't seem to be. As long as we keep the size of the sessions table down, our CPU level should be just fine.
You do not have the required permissions to view the files attached to this post.