KB Article #176821

Double count of weighted keywords in HTML email messages

Problem

Policy is implemented to catch emails that contain specific words from a wordlist. Each word has a weight of 5. The policy triggers on all emails that score at least 10. Test email message is sent which has two words in the wordlist. However, in Message tracking each of those words appears several times, for a total score of 20, instead of 10.

Where does the multiple count come from? Is there a way to be sure it only counts each instance of the word once?

Resolution

This will happen if the email message is sent in HTML format which by design contains two representations of the email content - one in HTML and one in Plain text format (for non-HTML email clients). As both parts contain the text in the email, each word will appear twice in the raw MIME structure of the message and be visible to the Policy Engine respectively.

To configure MailGate to just scan the HTML part, in the policy using the abovementioned wordlist, specify the following option under Advanced Options in Keyword Match part of the policy: "Do not scan plain text MIME part if an alternative HTML part is available".

Note: Ignoring the plain text part introduces a potential security risk since the Plain text and HTML parts can have different content.

Practically, this means that if a specially crafted message which has different plain-text part than its HTML part, may come undetected by the policy. Such message however will still be represented by its HTML part in the mail client, so the risk remains more theoretical than practical.