Categorization — How Does it Work?

Article #: Product Castle
  Guardian All

Summary

What follows is an explanation of how the Smoothwall web filter categorizes content.

Description

Domains

The Smoothwall uses several elements of a web request to categorize content. Typically, we look at the request for a domain, any search terms that can be extracted from the URL (to understand what a user is searching for) and finally the actual content loaded on the page. The content that is analyzed is the HTML behind the page, not just what is displayed in the browser.

To categorize domains, a database of pre-categorized domains, sub-domains, and URLs is used. These are updated daily, the status of your blocklist can be checked by going to System > Maintenance > Licenses and looking at the Blocklist subscription section. From here an update can be performed, or a whole new blocklist can be downloaded.

If a domain exists in the database, then the website is categorized, and the policy table is applied. Domain categorization can happen in a few different ways, depending on what format the URL has been added to the category in.

The below table may be enlightening when adding domains or URLs to a category. It shows the effect of blocking a particular format of URL. For instance, if example.com/path is blocked, and a user attempts to access example.com, they will not be blocked. They will however, be blocked on subdomain.example.com/path with the same path. Domain categorization is greedy, and it will match as much as possible.

  URL being browsed to (and if it is a match)
example.com example.com/path subdomain.example.com subdomain.example.com/path
URL in the Smoothwall example.com Yes Yes Yes Yes
example.com/path No Yes No Yes
subdomain.example.com No No Yes Yes
subdomain.example.com/path No No No Yes

URL Patterns

The Smoothwall is able to categorize some sites by examining the construction of the URL. For instance, many games sites may contain the phrases games and unblocked in their URL. When this happens, a request will be categorized and the policy table will decide the outcome.

Search Term Extraction

If a search engine is being used, then once a query is executed, search terms will be extracted from the URL if present. We support over 80 different services, and are adding more constantly. If even a single match occurs, then the request will be categorized. If a search term matches a category, then the policy table will decide the outcome.

Search term extraction will, on an HTTPS site, require a HTTPS decrypt and inspect policy to be enabled. HTTPS hides all but the domain name, so we cannot extract search terms without a decrypt and inspect policy.

To learn more about how Smoothwall handles search engines, go to How can Smoothwall make Search Engine browsing safer?.

Dynamic Content Filtering

Once the previous steps have been completed the content will be fetched, and examined. This is known as dynamic content filtering. This allows the Smoothwall to categorize never before seen content.

When examining a page’s content, the HTML behind the page is checked, not just what is displayed on the screen. This can sometimes be the cause of a block page, as content that is not displayed is triggering the web filter. This often happens with modern web pages, as resources are loaded before they are needed to give a more responsive experience.

Once a page has enough content that it exceeds the threshold, it will be categorized, and if that category is blocked, a block page will be displayed. The reasons for the block page being displayed can be clearly seen, and should help you understand what has caused the block.

HTTPS pages will require a decrypt and inspect policy to be enabled in order for the dynamic content filter to be effective. For more information about creating a decrypt and inspect policy, see https://help.smoothwall.net/Latest/Content/modules/guardian3/cgi-bin/guardian/https.htm

Overblocking vs Underblocking

No web filtering solution is perfect and Smoothwall is no exception. As a web filtering solution we must decide to either underblock, potentially allowing seriously dangerous content through, or overblock, which might block some safe content.

Given that we are a protector of vulnerable people, most of all children, we are extremely wary of letting dangerous content through the Guardian web filter, and as such we tend to overblock.

We appreciate that this can be frustrating for some system administrators. However, we strongly believe that it is better to block some safe websites, rather than allowing seriously dangerous and inappropriate content to slip through.

We are actively working to improve our dynamic content analysis rules to reduce overblocking. However, if you encounter a domain or URL that is incorrectly categorized, please report it. We welcome your feedback.

Feedback

If you feel that we have incorrectly categorized a web page, and believe it should be changed, then please use the following form to let us know:

https://uk.smoothwall.com/provide-blocklist-feedback/

Attribution:

Last updated: Author: Contributions by:
27th March 2017 Will Laycock-Smith