1

Topic: 504 Bad Gateway

==== REQUIRED BASIC INFO OF YOUR IREDMAIL SERVER ====
- iRedMail version (check /etc/iredmail-release): 0.99
- Deployed with iRedMail Easy or the downloadable installer? downloadable installer
- Linux/BSD distribution name and version: Centos 7.5
- Store mail accounts in which backend (LDAP/MySQL/PGSQL): LDAP
- Web server (Apache or Nginx): Nginx
- Manage mail accounts with iRedAdmin-Pro?  Yes - v4.0
- [IMPORTANT] Related original log or error message is required if you're experiencing an issue.
====

Our Help Desk gets a 504 Bad Gateway message every few days when logging into iRedAdminPro.   A reboot of the server always seems to fix the problem.   If I don't reboot approximately every 6 days this problem occurs.

Today one of our users stated he received a 504 Bad Gateway message when attempting to send a message.   Apparently it didn't appear until he clicked to send a message.   This problem occurred shortly after a reboot due to the 504 error occurring with iRedAdminPro.

Has anyone seen this error?   Any suggestion on a fix?

Thanks,
Bob

2

Re: 504 Bad Gateway

I think I may have found the problem.   Sogo was complaining it was running out of vMem and terminating processes.   I've changed sogo.conf appropriately.   Hopefully this will solve the problem.

Bob

3

Re: 504 Bad Gateway

I had initially changed the vMem from 500 (my original setting) to 1024.   Within an hour or so I began noticing the same memory error again, so I increased it to 4096.   SOGo was running really well at that point.   This morning our Help Desk called to stated iRedAdminPro wasn't working.   I discovered when accessing it "Internal Server Error" appeared.   Checking the logs found a huge amount of Clam errors related to memory as well as kernel out of memory errors.   

I'm running in a virtual environment, so adding ram is simple.   I'm going to double the ram to 64GB for this server now.

I'll post again with an update in case anyone finds this thread useful smile

Bob

4

Re: 504 Bad Gateway

- How much SOGo child processes do you configure it to start? You can find the setting in /etc/sysconfig/sogo.
- How much RAM does this server have right now? 32GB?
- Do you have error message like "no child available" in /var/log/sogo/sogo.log?

----

Does my reply help a little? How about buying me a cup of coffee ($5) as an encouragement?

buy me a cup of coffee

5

Re: 504 Bad Gateway

ZhangHuangbin wrote:

- How much SOGo child processes do you configure it to start? You can find the setting in /etc/sysconfig/sogo.
- How much RAM does this server have right now? 32GB?
- Do you have error message like "no child available" in /var/log/sogo/sogo.log?

The server currently has 64GB of ram.

We currently have 500 child processes for SOGo, but 700 for a separate EAS instance.

I had received a lot of the "no child available" messages previously, which is why I increased the number of child processes.   It looks like over the past two days the "no child" messages have reappeared.   I might need to add to the number of child processes.

I backed off the vMem setting to 2048MB.   The 4096MB setting seemed to be causing problems.   Once I did so the SOGo interface began running very well.   

We're still encountering problems with iRedAdminPro displaying a 504 error.   When these errors appear the following error appears in the Nginx error log:
2020/05/07 11:32:44 [error] 1696#0: *2126550 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.181.234.6, server: _, request: "GET /iredadmin/dashboard?checknew HTTP/1.1", upstream: "uwsgi://127.0.0.1:7791", host: "stmail.luzerne.edu", referrer: "https://stmail.luzerne.edu/iredadmin"


I just tried restarting the iredadmin service and that appears to have corrected the problem.   I'll try restarting the service again next time this problem occurs.   It occurs daily.

Thanks,
Bob

6

Re: 504 Bad Gateway

Since my last message iRedAdminPro has become almost unusable.   It's down more than it's up.   Oddly, SOGo is performing better than ever!

During outage of iRedAdminPro I receive the same error in the Nginx error.log as I mentioned in my last message.   Stopping and starting the iredadmin service doesn't seems to help, but a reboot does.

I'm out of date on both iRedMail and iRedAdminPro.   I'm going to try upgrading them to see if the situation improves.

Bob

bdushok wrote:
ZhangHuangbin wrote:

- How much SOGo child processes do you configure it to start? You can find the setting in /etc/sysconfig/sogo.
- How much RAM does this server have right now? 32GB?
- Do you have error message like "no child available" in /var/log/sogo/sogo.log?

The server currently has 64GB of ram.

We currently have 500 child processes for SOGo, but 700 for a separate EAS instance.

I had received a lot of the "no child available" messages previously, which is why I increased the number of child processes.   It looks like over the past two days the "no child" messages have reappeared.   I might need to add to the number of child processes.

I backed off the vMem setting to 2048MB.   The 4096MB setting seemed to be causing problems.   Once I did so the SOGo interface began running very well.   

We're still encountering problems with iRedAdminPro displaying a 504 error.   When these errors appear the following error appears in the Nginx error log:
2020/05/07 11:32:44 [error] 1696#0: *2126550 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.181.234.6, server: _, request: "GET /iredadmin/dashboard?checknew HTTP/1.1", upstream: "uwsgi://127.0.0.1:7791", host: "stmail.luzerne.edu", referrer: "https://stmail.luzerne.edu/iredadmin"


I just tried restarting the iredadmin service and that appears to have corrected the problem.   I'll try restarting the service again next time this problem occurs.   It occurs daily.

Thanks,
Bob

7

Re: 504 Bad Gateway

Did you ever check the CPU/RAM usage while iRedAdmin-Pro is down? How much CPU/RAM does SOGo use?

----

Does my reply help a little? How about buying me a cup of coffee ($5) as an encouragement?

buy me a cup of coffee

8

Re: 504 Bad Gateway

ZhangHuangbin wrote:

Did you ever check the CPU/RAM usage while iRedAdmin-Pro is down? How much CPU/RAM does SOGo use?

Yes.   At all times, including during outages, memory usage is low.   I don't recall ever seeing more than 20GB of ram being used.   The server is allocated 64GB.

CPU fluctuates a lot.   It seems to average about 40% utilization with some spikes to 80% and occasional spikes to 100%, but these are very brief.

We run in a VMware environment.   I've already allocated 8 processors to iRed which concerns me.   Due to the method VMware ESXi uses to multitask it's possible allocating too many CPUs can slow a server.   Servers only get processor time when the number of cores equal to the number of vCPUs they are assigned are available.   All hosts in our cluster have 48 cores.   This VM will be placed in a "RDY" state if 8 cores aren't available when it needs to do something.   

I'm going to need to monitor the state of the host during slow/down periods of iRed to see the if it is staying in RDY state.

Is it possible to somehow cluster two or more instances of iRedMail?

Thanks,
Bob

9

Re: 504 Bad Gateway

The problem, or a variation of it, just occurred again.   This time our Help Desk was complaining they were receiving "internal server error" when they tried accessing iRedMailAdmin.   I checked and received the same error.  I checked the Nginx error.log file and found nothing.   I restarted the iRedMailAdmin service, no change.   I restarted the Nginx service, no change.

I check memory usage.   A little under 11GB of 64GB of ram was in use at that time.

The CPU usage was interesting.   I tinkered with this while it was down for about 10 minutes.   CPU utilization was around 80% just up until the Help Desk reported the problem.   At that time it dropped to about 40%.   It looks like maybe a service crashed?

I restarted the server and everything is now running fine.

Bob

10

Re: 504 Bad Gateway

After the reboot some users are complaining they can't send e-mail.   I've checked and I also can't send e-mail.   Sogo displays the normal "sending..." popup followed by an empty window containing an exclamation point.   A message indicating "gateway timeout" appeared for a few moments on the error window.

Checking the log shows the following around the time the message was sent:
May 11 13:25:14 sogod [3275]: [ERROR] <0x0x55d8b7ed1060[WOHttpTransaction]> client disconnected during delivery of response for <WORequest[0x0x55d8b7e5b010]: method=POST uri=/SOGo/so/bdushok@student.luzerne.edu/Mail/0/folderDrafts/newDraft1589217770-1/send app=SOGo rqKey=so rqPath=bdushok@student.luzerne.edu/Mail/0/folderDrafts/newDraft1589217770-1/send> (len=84): the socket was shutdown


I am seeing four "No child available" messages in the logs, but none near the time of my inability to send an e-mail.

BTW, postfix appears to be running fine on this server.   

This server is under a much heavier load than normal.   Everyone is now working from home and relying on the server more than if they were in offices.  I have a feeling the higher load is a part of this problem.   Something needs to be tuned to handle the new load.

Thanks,
Bob

11

Re: 504 Bad Gateway

I just had a thought..... maybe the problem is kernel related.    I checked the number of connections:

[root@stmail nginx]# netstat | wc -l
12595

I know this number isn't completely accurate, but it seems high.   

Next, I checked the kernel param for max connections....\

[root@stmail nginx]# sysctl -a | grep somaxconn
net.core.somaxconn = 2000

Seems sort of low.   Next, check the backlog:

[root@stmail nginx]# sysctl -a | grep netdev_max_backlog
net.core.netdev_max_backlog = 1000

Again, seems low for the amount of server traffic.

Lastly, I checked the number of worker connections for Nginx.   It was 2048.   Again, might be low for this amount of traffic.

As a test I changed net.core.somaxconn and net.core.netdev_max_backlog to 8192 in sysctl.conf.   I also changed the worker connections in Nginx (nginx.conf) to 4096.   

I'll report back to let you know if this has helped.

Bob

12

Re: 504 Bad Gateway

There's a known issue that if server is under extreme heavy load, python applications may crash, including iRedAPD and iRedAdmin(-Pro). Postfix/Nginx work fine and they have internal solution to handle this.

The problem here is why the heavy load lasts for a long time.

bdushok wrote:

As a test I changed net.core.somaxconn and net.core.netdev_max_backlog to 8192 in sysctl.conf.   I also changed the worker connections in Nginx (nginx.conf) to 4096.   

Try to increase Nginx worker_connections setting to 8192.

- Also, do you have netdata installed? Try to check its charts, maybe it will give you some hints.
- What's the max sogo child processes defined in /etc/sysconfig/sogo?
- What's the number of open file limit for sogo service? Do you have /etc/systemd/system/sogod.service.d/override.conf? If not, try to create it with content below and restart sogo servicce:

[Service]
LimitNOFILE=infinity

- Do a lot users use ActiveSync service?

----

Does my reply help a little? How about buying me a cup of coffee ($5) as an encouragement?

buy me a cup of coffee

13

Re: 504 Bad Gateway

Zhang,

I've changed the Nginx worker_connections to 8192.

I don't have netdata installed.

When you ask about child processes for sogo, do you mean the prefork setting in /etc/sysconfig/sogo?   If so, this is set to 500.   I'm running a separate instance of Sogo for EAS (ActiveSync).   That instance has a workers count of 650.

Within /etc/security/limits.conf I have:

nginx   soft    nofile  200000
nginx   hard   nofile  999999
sogo     soft   nofile  200000
sogo    hard    nofile  999999
sogoeas   soft  nofile  200000
sogoeas   hard  nofile  999999

I did not have an override.conf for sogod.service.d, but I did have one for sogod-eas.service.d (sogod-eas is the systemd service file I created for the Sogo instance used for EAS (Activesync).

This file included both LimitNOFILE=infinity as well as TasksMax=infinity.   

I've added an override.conf file for sogod.service.d and added LimitNOFILE=infinity.

I've restarted.

Thanks for your assistance.   I'll report back with an update in a day or so after this config has been in place for awhile.

Bob


ZhangHuangbin wrote:

There's a known issue that if server is under extreme heavy load, python applications may crash, including iRedAPD and iRedAdmin(-Pro). Postfix/Nginx work fine and they have internal solution to handle this.

The problem here is why the heavy load lasts for a long time.

bdushok wrote:

As a test I changed net.core.somaxconn and net.core.netdev_max_backlog to 8192 in sysctl.conf.   I also changed the worker connections in Nginx (nginx.conf) to 4096.   

Try to increase Nginx worker_connections setting to 8192.

- Also, do you have netdata installed? Try to check its charts, maybe it will give you some hints.
- What's the max sogo child processes defined in /etc/sysconfig/sogo?
- What's the number of open file limit for sogo service? Do you have /etc/systemd/system/sogod.service.d/override.conf? If not, try to create it with content below and restart sogo servicce:

[Service]
LimitNOFILE=infinity

- Do a lot users use ActiveSync service?