Block AI bots an efficient way

Date: 2024-12-14
Tags: AI, administration, crawling, http, network, nftables, security
Robot with spider legs

With hipe on useless AI technologies, including ChatGPT or some other inefficient AI models, Lot of malicious companies, thinking they will found El Dorado, try to take profit of all datas they can obtain on the web, without respecting Netiquette conventions.

Suchir Balaji, a whistleblower that described OpenAI illegal reusage of other creator content has been found dead in its US partment, see "Police rules out foul play in death of OpenAI whistleblower Suchir Balaji"

As a result, more and more web service see their traffic grow insanely with most part of these bots, often saturing services like a DDoS. Search Engine crawlers generally limit their request to 1 time every few seconds to avoid overload of web servers, but recent AI bot crawlers scan as fast as possible. ClaudeAI is one of these famous bots. Until recently they at least respected "User Agent" field, that describe the user agent used, allowing web services to select which user agent they want to serve to. But as most services are blocking these chat bots due to their non-respect of crawling conventions, some newcomers to Internet, think that hiding or forging User Agent as a user Browser would be a good solution to still thieve datas to train their models.

As a result, services start to block too much things and Web users start to complain to not be able to visit regular web sites, detected as AI bots, they are blocked. And a fun sideffect, is that ibots train on bader and bader result after every new generation, degrading AI database quality that degrade the content etc....

Here are the methods I used to block an efficient way a large part of badly managed AI bots without blocking end Users.

If you have any question or suggestion, you can contact me on Fediverse à @popolon@mastodon.social

My blocking method is in 2 step:

  1. Blocking useless IP ranges that don't have any end users, and are laregly responsible of these bad crawls by firewall (I give nftables, iptables and ipset versions here).
  2. Blocking the other one depending on their user agent. This already filter a large part of them.

This could be expanded with fail2ban filter or other equivalent anti-DoS/spy/brute force scripts.

blocking IP ranges by firewall

As most bots come from Amazon, there is no end user on Amazon servers, and they are cheap for entry services but become more expansive than traditional hosters for any real business, and there is no way an Amazon hosted virtual server need to crawl my own site, I block their IP ranges access to HTTP (80) and HTTPS (443) ports.

I made a shell script that create nftables defines rules for Amazon IP ranges on GNU/Linux, can be used for other purpose for Amazon users.

What this bash script does:

You need two tools to make it work: curl and jq

Under Arch and derivatives (Manjaro, BredOS,...):

sudo pacman -S --needed curl jq

With Debian and derivatives (ARMbian, Ubuntu, Deepin, PopOS!...)

sudo apt install curl jq

You can dowmload here:

You must give it execution rights:

chmod +x ip-ranges_amazon_to_nftables.sh

Change the destination dir DIR= to a writable one by user.

Here is the content of the script:

#!/usr/bin/env bash
AMAZON_IP_RANGES=https://ip-ranges.amazonaws.com/ip-ranges.json
IPRANGES_json=/tmp/ip-ranges.json
#DIR=/etc/nftables.d    # uncomment this and comment following line for replacing by cron
DIR=/tmp
IPV4=${DIR}/defines-amazon_ipv4.nft
IPV6=${DIR}/defines-amazon_ipv6.nft

# --- download file if not already here ---
if [ ! -e ${IPRANGES_json} ]
then
  curl -Ro ${IPRANGES_json} ${AMAZON_IP_RANGES}
fi

# --- create DIR if it not already exists ---
if [ ! -e ${DIR} ]
then
  mkdir ${DIR}
fi

# ------ create IPv4 define ------
echo "define amazon_ipv4 = {" >${IPV4} ## sed 's/"//g'|
jq .prefixes[].ip_prefix ${IPRANGES_json} | sed 's/"//g'| while read IP
do
  echo "  ${IP}," >>${IPV4}
done
echo "}" >>${IPV4}

# ------ create IPv6 define file ------
echo "define amazon_ipv6 = {" >${IPV6} 
jq .ipv6_prefixes[].ipv6_prefix ${IPRANGES_json} | sed 's/"//g'| while read IP
do
  echo "  ${IP}," >>${IPV6}
done
echo "}" >>${IPV6}

Load your custom definaes at the begining of the /etc/nftables.conf:

include "nftables.d/defines-*.nft"

Then inside of your chain input {} directive, for IPv4 as example, we consider $public_interface is your MAC interface and $public_ip is your public ip, you can also just filter on all interface independently of any ports (second exemple)

Exemple of definition of public ip/interface

define public_ip = 1.2.3.4
define public_interface = eth0

Filter only for public_ip on public_interface

table inet filter {
  chain input {
  [...] # your other directives
  iif $public_interface tcp dport { 80,443 } ip saddr $amazon_ipv4 ip daddr $public_ip drop \
    comment "HTTP/S amazon WAN drop" 
  [...] # your other directives
}

Filter an all interfaces with any destination IP address:

table inet filter {
  chain input {
  [...] # your other directives
  tcp dport { 80,443 } ip saddr $amazon_ipv4 drop \
    comment "HTTP/S amazon WAN drop"
  [...] # your other directives

usage of ipset with iptables

in the case if ipset, you need to install ipset beside iptables

pacman -S ipset                     # Arch Linux bases
apt install ipset ipset-persistent  # Debian based

In yout iptables rules, simply add:

-A INPUT -p tcp -m multiport --dports 80,443 -m set --match-set IPV4_AMAZON src -m comment --comment "block IPV4_AMAZON by IPset" -j ACCEPT
-A INPUT -p tcp -m multiport --dports 80,443 -m set --match-set IPV6_AMAZON src -m comment --comment "block IPV6_AMAZON by IPset" -j ACCEPT

blocking user agents with nGinx or forks by nGinx main authors

On Nginx, or better, one of the forks by main authors of nGINX: Angie (sources on their own Gitea instance), and freenginx (sources on their hg server and github mirror

Create a /etc/nginx/rules/antibot.conf file with following rules (download antibot.conf here):


# Antibot for batbots

if ($http_user_agent ~ (AhrefsBot|amazonbot|Amazonbot|anthropic-ai|Bytespider|ClaudeBot|Claude-Web|FacebookBot|GPTBot|ChatGPT-User|Googlebot|Google-Extended|GoogleOther|Omgili|PetalBot|SemrushBot|Twitterbot|webprosbot)) {
  access_log off;    # Don't saturate logs with useless bots
  error_log off;     # can't work in http directive
  log_not_found off; # can't work in http directive
  return 444;        # nginx special "drop connection" code
}

Then of each of your server{} directives (not in http{} directive directly) you need to add this directive:

server {
  listen 443;
  http2 on;

  include rules/antibot.conf;

  [...] # your own rules
}

And then reload your nginx configuration:

/usr/sibn/nginx -t      # Test current nginx configuration
systemctl reload nginx  # reload nginx configuration
Tags: AI, administration, crawling, http, network, nftables, security