Categorization — How Does it Work?

Article #:

Product

Castle

 

Guardian

All

Summary

What follows is an explanation of how the Smoothwall web filter categorizes content.

Description

Domains

The Smoothwall uses several elements of a web request to categorize content. Typically, we look at the request for a domain, any search terms that can be extracted from the URL (to understand what a user is searching for) and finally the actual content loaded on the page. The content that is analyzed is the HTML behind the page, not just what is displayed in the browser.

To categorize domains, a database of pre-categorized domains, sub-domains, and URLs is used. These are updated daily, the status of your blocklist can be checked by going to System > Maintenance > Licenses and looking at the Blocklist subscription section. From here an update can be performed, or a whole new blocklist can be downloaded.

If a domain exists in the database, then the website is categorized, and the policy table is applied. Domain categorization can happen in a few different ways, depending on what format the URL has been added to the category in.

The below table may be enlightening when adding domains or URLs to a category. It shows the effect of blocking a particular format of URL. For instance, if example.com/path is blocked, and a user attempts to access example.com, they will not be blocked. They will however, be blocked on subdomain.example.com/path with the same path. Domain categorization is greedy, and it will match as much as possible.

 

URL being browsed to (and if it is a match)

example.com

example.com/path

subdomain.example.com

subdomain.example.com/path

URL in the Smoothwall

example.com

Yes

Yes

Yes

Yes

example.com/path

No

Yes

No

Yes

subdomain.example.com

No

No

Yes

Yes

subdomain.example.com/path

No

No

No

Yes

URL Patterns

The Smoothwall is able to categorize some sites by examining the construction of the URL. For instance, many games sites may contain the phrases games and unblocked in their URL. When this happens, a request will be categorized and the policy table will decide the outcome.

Search Term Extraction

If a search engine is being used, then once a query is executed, search terms will be extracted from the URL if present. We support over 80 different services, and are adding more constantly. If even a single match occurs, then the request will be categorized. If a search term matches a category, then the policy table will decide the outcome.

Search term extraction will, on an HTTPS site, require a HTTPS decrypt and inspect policy to be enabled. HTTPS hides all but the domain name, so we cannot extract search terms without a decrypt and inspect policy.

To learn more about how Smoothwall handles search engines, go to How can Smoothwall make Search Engine browsing safer?.

Dynamic Content Filtering

Once the previous steps have been completed the content will be fetched, and examined. This is known as dynamic content filtering. This allows the Smoothwall to categorize never before seen content.

When examining a page’s content, the HTML behind the page is checked, not just what is displayed on the screen. This can sometimes be the cause of a block page, as content that is not displayed is triggering the web filter. This often happens with modern web pages, as resources are loaded before they are needed to give a more responsive experience.

Once a page has enough content that it exceeds the threshold, it will be categorized, and if that category is blocked, a block page will be displayed. The reasons for the block page being displayed can be clearly seen, and should help you understand what has caused the block.

HTTPS pages will require a decrypt and inspect policy to be enabled in order for the dynamic content filter to be effective. For more information about creating a decrypt and inspect policy, see https://help.smoothwall.net/Latest/Content/modules/guardian3/cgi-bin/guardian/https.htm

Feedback

If you feel that we have incorrectly categorized a web page, and believe it should be changed, then please use the following form to let us know:

https://uk.smoothwall.com/provide-blocklist-feedback/

Attribution:

Last updated:

Author:

Contributions by:

27th March 2017

Will Laycock-Smith

 

 

Copyright © 2000-2016 Smoothwall All rights reserved.