1 (edited by broth 2020-08-12 01:12:02)

Topic: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

==== REQUIRED BASIC INFO OF YOUR IREDMAIL SERVER ====
- iRedMail version (check /etc/iredmail-release): 1.3.1
- Deployed with iRedMail Easy or the downloadable installer? installer
- Linux/BSD distribution name and version: Debian 10
- Store mail accounts in which backend (LDAP/MySQL/PGSQL): MariaDB
- Web server (Apache or Nginx): nginx
- Manage mail accounts with iRedAdmin-Pro? yes
- [IMPORTANT] Related original log or error message is required if you're experiencing an issue.
====

Hello all!

We are planning to use "Auto learn spam/ham with Dovecot imap_sieve plugin" as described on https://docs.iredmail.org/dovecot.imapsieve.html

While testing, I can observe email files being put into the according folders spam and ham for learning.
Thats very nice!

Nevertheless I found that when moving an email back and forth between folders, those get duplicated in the according spam/ham folders.

I don't know how SpamAssassin handles duplicates in SPAM and HAM learning.
Data gets duplicated and disk space wasted (possible DoS cause due to bad behaving customers).


Is this intended behaviour?
What if I enhance the imapsieve_copy script to use an unique identifier for each message instead of "${RANDOM}${RANDOM}"?
This would allow to check for existing messages and delete re-classified SPAM messages from HAM.


Best regards,
Bernhard

BTW: Moving folders into Junk does not cause anything to happen. Is this desired?

----

Spider Email Archiver: On-Premises, lightweight email archiving software developed by iRedMail team. Supports Amazon S3 compatible storage and custom branding.

2

Re: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

broth wrote:

I don't know how SpamAssassin handles duplicates in SPAM and HAM learning.

Duplicate messages is like only one message.

broth wrote:

Data gets duplicated and disk space wasted (possible DoS cause due to bad behaving customers).

Messages will be removed after scanned/learnt, and the cron job is ran every 10 minutes (default interval in our tutorial), i don't think it's a big deal.

broth wrote:

Is this intended behaviour?

Yes.

broth wrote:

What if I enhance the imapsieve_copy script to use an unique identifier for each message instead of "${RANDOM}${RANDOM}"?
This would allow to check for existing messages and delete re-classified SPAM messages from HAM.

You're free to do such improvement.

broth wrote:

BTW: Moving folders into Junk does not cause anything to happen. Is this desired?

Moving "folders" instead of mesages to Junk?
hmm, i didn't test this before, and Dovecot document doesn't mention this either. I'm afraid what you see is what we can expect from Dovecot.

3

Re: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

Thanks for your quick feedback!

If SA treats duplicate messages as only one, I'm fine.

But is it treaten like SPAM or HAM when it's e.g. 2 times in SPAM and 1 time in HAM?

ZhangHuangbin wrote:

Moving "folders" instead of mesages to Junk?
hmm, i didn't test this before, and Dovecot document doesn't mention this either. I'm afraid what you see is what we can expect from Dovecot.

I just tried to test "classic" customer behaviour and mistakes.
When moving a folder, nothing happens and that's great smile

4

Re: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

broth wrote:

But is it treaten like SPAM or HAM when it's e.g. 2 times in SPAM and 1 time in HAM?

Good question, but i'm afraid that i don't have a accurate answer for you. Here's my presume based on what i learn from SA website and experience, i cannot tell you whether it's correct right now because i didn't do exact tests/exams yet.

- 2 times spam, it's like only one spam since it's duplicate.
- If you feed SA with same message as HAM, then it overwrites old data (SPAM) and it becomes HAM.

FYI:

https://cwiki.apache.org/confluence/dis … adLearning

5

Re: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

Thanks for your link and update.

I don't like the idea of mails being duplicated in SPAM/HAM when they get moved.

At the end, we don't know if the mail is learned as SPAM or HAM.

I implemented a fix for that problem on my server and it's working fine since some weeks

/etc/dovecot/sieve/pipe/imapsieve_copy

#!/usr/bin/env bash
# Author: Zhang Huangbin <zhb@iredmail.org>
# Purpose: Read full email message from stdin, and save to a local file.

# Usage: bash imapsieve_copy <email> <spam|ham> <output_base_dir>

export USER="$1"

if [ $2 == "spam" ]; then
    export MSG_TYPE="spam"
    export ALT_MSG_TYPE="ham"
else
    export MSG_TYPE="ham"
    export ALT_MSG_TYPE="spam"
fi


export OUTPUT_BASE_DIR="/var/vmail/imapsieve_copy"

export OUTPUT_DIR="${OUTPUT_BASE_DIR}/${MSG_TYPE}"
export ALT_OUTPUT_DIR="${OUTPUT_BASE_DIR}/${ALT_MSG_TYPE}"

#export FILE="${OUTPUT_DIR}/${USER}-$(date +%Y%m%d%H%M%S)-${RANDOM}${RANDOM}.eml"

export OWNER="vmail"
export GROUP="vmail"

for dir in "${OUTPUT_BASE_DIR}" "${OUTPUT_DIR}"; do
    if [[ ! -d ${dir} ]]; then
        mkdir -p ${dir}
        chown ${OWNER}:${GROUP} ${dir}
        chmod 0700 ${dir}
    fi
done

export TEMP=$(mktemp)

cat > ${TEMP} < /dev/stdin

export CS=$(head -n 50 ${TEMP} | md5sum | cut -c -16)

export FILE="${USER}-$(date +%Y%m%d)-${CS}.eml"

mv ${TEMP} ${OUTPUT_DIR}/${FILE}
rm ${ALT_OUTPUT_DIR}/${FILE} 1> /dev/null



# Logging
export LOG='logger -p local5.info -t imapsieve_copy'
[[ $? == 0 ]] && ${LOG} "Copied one ${MSG_TYPE} email reported by ${USER}: ${FILE}"

I know thats not pretty but it works smile

Instead of using some random numbers for the filename, it generates a checksum from the mail header (should be unique enough).
It will place the email in the final folder and deletes it from the other.

When you do a "watch find /var/vmail/imapsieve_copy" while moving emails, you can see them jumping happily.

6

Re: SPAM Autolearn/dovecot: Spam-Ham-Spam move proposal

Additional note:

When we allow customers to move mails and maybe correct mistakes (moved mails to junk and/or declared junk as ham), 10 minutes interval might be too short for cronjob.

In my installation I let the cronjob run only once a day in the late evening:

- This will train SA with valid data
- customers usually not working anymore
- preparing the server for nightly spam