1

Topic: iRedApd bottleneck?

==== REQUIRED BASIC INFO OF YOUR IREDMAIL SERVER ====
- iRedMail version (check /etc/iredmail-release): 1.3.2
- Deployed with iRedMail Easy or the downloadable installer? installer
- Linux/BSD distribution name and version: Debian 10
- Store mail accounts in which backend (LDAP/MySQL/PGSQL):  MariaDB
- Web server (Apache or Nginx): nginx
- Manage mail accounts with iRedAdmin-Pro? yes
- [IMPORTANT] Related original log or error message is required if you're experiencing an issue.
====

Dear Zhang,

today we noticed an unexpected behaviour of our iRedMail server.

During a period of many incoming emails, sending mail was almost impossible and very slow.

Further analysis showed that iRedAPD was running at almost 100% CPU and that the process was the only limiting factor of the server. Please see below screenshot of TOP, CPU load and time of iRedAPD.
I/O of the server as all the other 7 cores were not busy/a limiting factor.

Is it possible that iRedAPD is the bottleneck of the server causing everything to slow down when many emails are processed?

I don't know if this is implemented but would it be possible to fork a new iRedAPD child on each incoming connection/email?

This would balance the load to all available CPU cores and the bottleneck would be gone.

What do you think?
Am I the only one having that kind of issue?

Take care and all the best!
Bernhard

Post's attachments

Screenshot at 2021-01-18 08-37-58.png
Screenshot at 2021-01-18 08-37-58.png 25.95 kb, file has never been downloaded. 

You don't have the permssions to download the attachments of this post.

2

Re: iRedApd bottleneck?

broth wrote:

I don't know if this is implemented but would it be possible to fork a new iRedAPD child on each incoming connection/email?

This is the default.

broth wrote:

Is it possible that iRedAPD is the bottleneck of the server causing everything to slow down when many emails are processed?

It's a known issue that iRedAPD may not be able to handle as many requests as possible when CPU resource is not enough (it's not same as iRedAPD takes 100% CPU). for example, if some program takes a lot CPU (e.g. 90-100%) resource, Postfix will report "timeout" error while connecting / waiting for iRedAPD to finish the requests. Another situation is, if server is a VPS and running on a shared-resource physical server, and some other "neighbour" VPS takes a lot (shared) CPU resource. I personally consider it as a Python issue, not iRedAPD itself.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

3

Re: iRedApd bottleneck?

Thanks for your reply!

The server is a dedicated virtual machine.

As previously mentioned, CPU or I/O was not a limiting factor.
There was plenty spare CPU resources available.

I can monitor iredapd and its child processes with following command:

watch pstree -lap $(pgrep -f iredapd)

So next time many emails are arriving and iRedAPD is at 100% CPU, I should see many childs, correct?

Is there anything else I can do in order to further track down possible root causes when the issue happens next?

4

Re: iRedApd bottleneck?

broth wrote:

So next time many emails are arriving and iRedAPD is at 100% CPU, I should see many childs, correct?

No. It doesn't spawn multiple system processes, it's multi-threading.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

5

Re: iRedApd bottleneck?

ZhangHuangbin wrote:

No. It doesn't spawn multiple system processes, it's multi-threading.

I checked for threads using

top -H -p <pid>

As you can see in attached screenshot, only one thread is shown while many emails were sent to that server.

Hopefully I am wrong but from what I can see iredapd is running with one single thread only.
This explains the bottleneck we experience on the server.

Post's attachments

Screenshot at 2021-01-28 08-26-43.png
Screenshot at 2021-01-28 08-26-43.png 91.88 kb, file has never been downloaded. 

You don't have the permssions to download the attachments of this post.

6

Re: iRedApd bottleneck?

broth wrote:

This explains the bottleneck we experience on the server.

iRedAPD is wrote with Python "asyncore" and "asynchat" modules:
https://docs.python.org/3/library/async … e-asyncore

I'm not sure whether the "top -H" can show the threads.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

7

Re: iRedApd bottleneck?

Dear Zhang,

I'd like to bring this topic up again because it causes still issues on my iredmail installation in times of heavy email traffic (e.g. hundreds of emails being received):

- iredapd consumes 100% CPU (1 core only)
- email reception is delayed
- email sending of customers is delayed (it takes a long time until email is actually accepted by the server, in the meantime thunderbird shows a progress bar..)

I've read the documentation of asyncore (which is deprecated BTW since python 3.6) and it clearly states:

"that lets you have nearly all the advantages of multi-threading, without actually using multiple threads."

I understand that I/O is asynchronous but actual iredapd policy processing is not.

This explains why there are no visible threads in "top -H" and why iredapd is never using more than one CPU core.

I am not yet seeking for a solution or fix but I hope that we both understand the root cause.

The successor ot asyncore has a well written documentation about blocking code:

https://docs.python.org/3/library/asyncio-dev.html

So as a baseline all I/O can be handled by that library and processing of the received commands/emails/messages will have to be done in a separate thread.

This will ensure that all server ressources are used in cases of high load.

Feedback welcome ^_^

Take care!
Cheers,
Bernhard

8

Re: iRedApd bottleneck?

BTW: I just finished some tests using xstress.

10 threads sending 1 email each to my iredmail server will cause iredapd to get to 100% within a very short time.

Can you reproduce the issue in your test environment?

I'll be happy to point my xstress to your server smile

9

Re: iRedApd bottleneck?

Hi Broth,

- Could you please share the details of how you produce the 100% CPU usage in a local testing machine?
- I understand the situation that asyncore/asynchat modules are deprecated since Python 3.6, just need time to migrate it to asyncio, or wait for PR contribution.
- Maybe you can try this "gevent" edition too: https://github.com/iredmail/iRedAPD/blo … -gevent.py

Note: above link is the latest revision of gevent module support, it may not work by simply replacing the "iredapd.py" file, some tweaks are required.

If i want to rewrite iRedAPD, i may rewrite it with Golang.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

10

Re: iRedApd bottleneck?

Dear Zhang,

>- Could you please share the details of how you produce the 100% CPU usage in a local testing machine?

I was performing the tests on my production machine which was recently installed from the scratch on Debian 10.

I downloaded and compiled xstress on a cloud vserver:

https://iweb.dl.sourceforge.net/project … 219.tar.gz

This software is just sending emails to the destination server.

I added the IP of the test machine to postfix "myhosts" to allow SMTP without authentication.

Then I started xtresss:

xstress -t 10 -m 1

This sends 10 messages each in a dedicated thread.

With only that few emails, you can watch iredapd to increase its load to 100% CPU


I was unable to reproduce the issue on a new clean install without mail data/accounts.

This leads me to believe the problem is somewhere in the policy processing of iredapd.

Can you help me to understand what steps are done in iredapd for each message?

Is there something like a profiler in python which helps to identify where the process gets delayed in processing?

To me this "feels" like the problem being some kind if internal list lookup, looping though a list of elements (graylisting maybe). The data source (mariadb) is unsuspicious.

Best regards,
Bernhard

11

Re: iRedApd bottleneck?

broth wrote:

To me this "feels" like the problem being some kind if internal list lookup, looping though a list of elements (graylisting maybe). The data source (mariadb) is unsuspicious.

Try these tunings to avoid DNS queries first.

- Add settings in iRedAPD config file /opt/iredapd/settings.py:

CHECK_SPF_IF_LOGIN_MISMATCH = False

- Disable greylisting plugin.

Let's see how it goes.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

12 (edited by broth 2021-05-07 15:04:06)

Re: iRedApd bottleneck?

Dear Zhang,

thanks for your hint! ^_^

Without any changes the email send test with xstress takes around 11 seconds for 10 messages.

ZhangHuangbin wrote:

- Disable greylisting plugin.

This made a huge difference!
iRedAPD load barely noticable on server.

The time it took to send 10 messages went down to 2 seconds!

ZhangHuangbin wrote:

CHECK_SPF_IF_LOGIN_MISMATCH = False

This setting made no difference.

Conslusion:

Continue searching for the cause in the graylisting plugin.

On my production server I enabled the plugin again to avoid a spam flood.

Best regards,
Bernhard

13

Re: iRedApd bottleneck?

ok, try one more setting in /opt/iredapd/settings.py:

GREYLISTING_BYPASS_SPF = False

It's used by greylisting plugin too.

Note: a fast (upstream) DNS server helps a lot for mail server.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

14

Re: iRedApd bottleneck?

Dear Zhang,

the proposed setting did no improvement in processing time.

I did some programming in the graylisting plugin and found that a slight improvement can be made by using sets instead of a lists for whitelists and _wls.

Anyhow, I think the main cause is the size of the data returned by SQL for the query to "greylisting_whitelists"

In my case there are 138.054 elements returned.

iredapd shows a process_time=1.0867s for each message.

I conclude that the main cause of the slow processing is the amount of elements returned by SQL to iRedAPD.

Is there anything we can do to improve the situation?
Reducing elements is obvious but will have functional drawbacks.

Can we reduce the count of elements returned by optimizing thre SQL query?
I guess MySQL/MariaDB is much faster in checking whitelist smile

Thanks and best regards,
Bernhard

15 (edited by broth 2021-05-09 19:00:11)

Re: iRedApd bottleneck?

I made following modification:

- let SQL return lowercase results
- using sets for whitelists and _wls

This improved the performance by some percent.


    whitelists = set()

    for tbl in ['greylisting_whitelist_domain_spf', 'greylisting_whitelists']:
        # query whitelists based on recipient
        sql = """SELECT LOWER(sender)
                   FROM %s
                  WHERE account IN %s""" % (tbl, sqlquote(recipients))

        logger.debug('[SQL] Query greylisting whitelists from `{}`: \n{}'.format(tbl, sql))
        qr = conn.execute(sql)
        records = qr.fetchall()

        _wls = {v[0] for v in records}

        # check whether sender (email/domain/ip) is explicitly whitelisted
        if client_address in _wls:
            logger.info('[%s] Client IP is explictly whitelisted for greylisting service.' % (client_address))
            return True

        _wl_senders = set(senders) & set(_wls)
        if _wl_senders:
            logger.info('[{}] Sender address is explictly whitelisted for greylisting service: {}'.format(
                client_address, ', '.join(_wl_senders)))
            return True

        whitelists.update(_wls)

Currently, as "greylisting_whitelists' has many records (CSA whitelist), this is the main cause for qr.fetchall() to spend so much time.

Maybe I have to disable the big whitelist because thinkable workarounds are too complicated to implement:

- let SQL return only matching records
- using SQL stored procedures (depending on SQL server might be difficult)


Am I the only iRedAPD user with many whitelisted servers?

At least the bottleneck has been identified ^_^

16

Re: iRedApd bottleneck?

broth wrote:

I made following modification:

Mind posting a patch/diff for easier review?

broth wrote:

Currently, as "greylisting_whitelists' has many records (CSA whitelist), this is the main cause for qr.fetchall() to spend so much time.

How many records do you have? I don't expect it takes too long.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

17

Re: iRedApd bottleneck?

Dear Zhang,

in "graylisting_whitelists" I have around 138.054 elements.
SQL dump of greylisting_whitelists table is attached.

This allows you to check the performance figure.

Here is your diff:

--- greylisting.py.org    2021-05-10 05:20:10.294292798 +0200
+++ greylisting.py    2021-05-10 05:20:40.261923972 +0200
@@ -45,11 +45,11 @@
     @ip_object -- object of IP address type (get by ipaddress.ip_address())
     """
 
-    whitelists = []
+    whitelists = set()
 
     for tbl in ['greylisting_whitelist_domain_spf', 'greylisting_whitelists']:
         # query whitelists based on recipient
-        sql = """SELECT sender
+        sql = """SELECT LOWER(sender)
                    FROM %s
                   WHERE account IN %s""" % (tbl, sqlquote(recipients))
 
@@ -57,7 +57,7 @@
         qr = conn.execute(sql)
         records = qr.fetchall()
 
-        _wls = [str(v[0]).lower() for v in records]
+        _wls = {v[0] for v in records}
 
         # check whether sender (email/domain/ip) is explicitly whitelisted
         if client_address in _wls:
@@ -70,7 +70,7 @@
                 client_address, ', '.join(_wl_senders)))
             return True
 
-        whitelists += _wls
+        whitelists.update(_wls)
 
     logger.debug('[%s] Client is not explictly whitelisted.' % (client_address))

From the log you can clearly see that it takes very long to get processed.

May 10 05:21:17 mail iredapd [1xx.6x.1x.1x] RCPT, xxxx@gmail.com -> xxxx@example.com, DUNNO [sasl_username=, sender=xxxx@gmail.com, client_name=xxx.xxxxxx.net, reverse_client_name=xxx.xxxxxx.net, helo=xxx.xxxxxx.net, encryption_protocol=, encryption_cipher=, server_port=25, process_time=1.1399s]

While iRedAPD is busy, processing of other messages is delayed (e.g. sending an email in thunderbird)


ZhangHuangbin wrote:

I don't expect it takes too long.

I totally agree, I wasn't expecting this either


Best regards,
Bernhard

Post's attachments

greylisting_whitelists.sql.gz 1.64 mb, file has never been downloaded. 

You don't have the permssions to download the attachments of this post.

18

Re: iRedApd bottleneck?

Check the SQL command mentioned in your patch:

     for tbl in ['greylisting_whitelist_domain_spf', 'greylisting_whitelists']:
         # query whitelists based on recipient
-        sql = """SELECT sender
+        sql = """SELECT LOWER(sender)
                    FROM %s
                   WHERE account IN %s""" % (tbl, sqlquote(recipients))

We have UNIQUE INDEX for `account, sender`, this query should use the index since `account` is the first column used in query. But i wonder whether it makes any difference if you create a new index for `account` in these 2 tables?

Since it doesn't use `LIMIT X`, it may go through whole table each time, this is bad.

Also, did you check how many sql records were returned by this query? How about add a logging line like this to log it: `logger.info("...")`.

----

Buy me a cup of coffee ($5) to support iRedMail:

buy me a cup of coffee

19

Re: iRedApd bottleneck?

ZhangHuangbin wrote:

Also, did you check how many sql records were returned by this query? How about add a logging line like this to log it: `logger.info("...")`.

Yes I did. _wls has a length of more than 130 thousand elements.

I don't think this is a problem with the database:

MariaDB [iredapd]> select lower(sender) from greylisting_whitelists where account in ('user@example.com', '@example.com', '@.', '@.example.com', '@.com');

..
..
..
138066 rows in set (0.095 sec)

As you see, execution time of the query is 95ms
(the lower(sender) costs around 18ms but saves looping through the list in python)

IMHO the man problem is the time required for python to fetch the complete result set into memory (fetchall call).

There are several options to solve the issue:

1. get rid of so many whitelist records
-> easy workaround without actually fixing the root cause

2. enhance iredapd so that whitelist checking is done in SQL instead of python
-> this will probably give the best performance at the cost of development and test effort

3. check my small modifications as seen in the patch and keep the basic structure, documentation/warning to make user aware about potential performance impacts when using many whitelist records
-> reasonable


Have you tried to import my SQL dump into your database?