We all know automatic profanity filters on message boards and elsewhere on the Internet can be ridiculously and unimaginatively strict. But the problem is much worse: Sometimes, it sees swearwords inside other words.
Also imagine at school you had a last name like Dick, Butts, Weiner, or Gaylord, but now just try getting past the first stage of an online registration process.
This was a problem faced by Ben Gash, who has kindly allowed us use his LinkedIn post to show the problems that he has faced when trying to promote an apprenticeship vacancy on the governments apprenticeship portal for his company AACS Ltd. In this blog Sophie Read our Head of Marketing discusses this further.
This is known as the Scunthorpe Problem, after an incident in 1996 when AOL’s rather simple-minded dirty-word filter prevented residents of several English towns and counties — among them Scunthorpe, Penistone, Lightwater and Middlesex from creating accounts with AOL because it matched strings within the town names to “banned” words. Since it also checked the town names against the postal codes, users from these towns could not get around it by entering modified versions of the names — they were darned if they did, darned if they didn’t.
Websites routinely use tools to prevent users from making accounts with fake or obscene words—but overzealous filters and poorly written code often flag innocent phrases that either happen to contain obscene words within them, or are legitimate use cases of such words
For more information on this also see Tom Scott’s YouTube video “Why Web Filters Don’t Work: Penistone and the Scunthorpe Problem”
Other examples of the Scunthorpe problem include
- In the months leading up to Super Bowl XXX, some web searches were being filtered because the Roman numeral for the game can also be used to identify porn.
- Jeff Gold attempted to register a domain name for his mushroom website in 1998. The name he wanted, shitakemushrooms.com, was blocked by an InterNIC dirty word filter.
- A Scottish man named Craig Cockburn reported he was unable to use his surname with Hotmail back in 2004. He was eventually able register with the name ‘C0ckburn’.
- In 2006, a woman named Linda Callahan was prevented from creating an email address on Yahoo! because her name contained the word ‘allah’. Yahoo! later reversed the ban.
- Dr. Herman Libshitz ran into issues in 2008 when trying to create a Verizon email address due to a particular string of letters in his last name.
- One option is to use python packages to find profanity. The Profanity-filter created by Roman Infliankas (https://github.com/rominf/profanity-filter) works well on most words, but still misses some individual words, and phrases but it did well to avoid flagging Scunthorpe type words.
- Another option was a package called profanity-check created by Victor Zhou (https://github.com/vzhou842/profanity-check), which uses machine learning and not an explicit list of words to censor again sometimes it does fail to pick up certain phrases
- There is also the option of using a custom list to stop “bad” words being used. However the problem with using a list is that whatever is not on the list will be missed, so the list has to be extensive. The largest list of words Joseph Bell could find was a list of banned Google search terms consisting of 1400 individual words (https://github.com/jbell1991/profanity-filter-solving-scunthorpe-problem/blob/master/bad_single.csv) and over 200 phrases (https://github.com/jbell1991/profanity-filter-solving-scunthorpe-problem/blob/master/bad_phrases.csv)