With hipe on useless AI technologies, including ChatGPT or some other inefficient AI models, Lot of malicious companies, thinking they will found El Dorado, try to take profit of all datas they can obtain on the web, without respecting Netiquette conventions.
Suchir Balaji, a whistleblower that described OpenAI illegal reusage of other creator content has been found dead in its US partment, see "Police rules out foul play in death of OpenAI whistleblower Suchir Balaji"
As a result, more and more web service see their traffic grow insanely with most part of these bots, often saturing services like a DDoS. Search Engine crawlers generally limit their request to 1 time every few seconds to avoid overload of web servers, but recent AI bot crawlers scan as fast as possible. ClaudeAI is one of these famous bots. Until recently they at least respected "User Agent" field, that describe the user agent used, allowing web services to select which user agent they want to serve to. But as most services are blocking these chat bots due to their non-respect of crawling conventions, some newcomers to Internet, think that hiding or forging User Agent as a user Browser would be a good solution to still thieve datas to train their models.
As a result, services start to block too much things and Web users start to complain to not be able to visit regular web sites, detected as AI bots, they are blocked. And a fun sideffect, is that ibots train on bader and bader result after every new generation, degrading AI database quality that degrade the content etc....
Here are the methods I used to block an efficient way a large part of badly managed AI bots without blocking end Users.
If you have any question or suggestion, you can contact me on Fediverse à @popolon@mastodon.social
My blocking method is in 2 step:
This could be expanded with fail2ban filter or other equivalent anti-DoS/spy/brute force scripts.
As most bots come from Amazon, there is no end user on Amazon servers, and they are cheap for entry services but become more expansive than traditional hosters for any real business, and there is no way an Amazon hosted virtual server need to crawl my own site, I block their IP ranges access to HTTP (80) and HTTPS (443) ports.
I made a shell script that create nftables defines rules for Amazon IP ranges on GNU/Linux, can be used for other purpose for Amazon users.
What this bash script does:
https://ip-ranges.amazonaws.com/ip-ranges.json
to /tmp/ip-ranges.json
, if it doesn't alread existsjq
command the ip-rangs.json
file and create /tmp/defines-amazon_ipv4.nft
and /tmp/defines-amazon_ipv6.nft
files whith these ranges, you can chose to use it for blocking or any other purpose. The 2 last files have to be copied in /etc/nftables.d/ you can simply edit the DIR and remplace /tmp to /etc/nftables.d/ to have it copy automatically under a cron rootYou need two tools to make it work: curl
and jq
Under Arch and derivatives (Manjaro, BredOS,...):
sudo pacman -S --needed curl jq
With Debian and derivatives (ARMbian, Ubuntu, Deepin, PopOS!...)
sudo apt install curl jq
You can dowmload here:
ip-ranges_amazon_to_nftables.sh
IPset is very limited in term of hashes, and convert IP ranges to whole list of single IP addresses, so, it's super slow ant take lot of ram, and saturate default buffers. Tale the time to install nftables instead, it's very easy and more clean than iptables, and there are good example on Arch Linux in
/usr/share/nftables/
and/usr/share/doc/nftables/examples/
(you can down and untar the Arch package on other distros) In most distro, the iptables package now contains iptables-nft converson rules and iptables is just a top layer On Arch Linux its iniptables-nft
package seepacman -Ql iptables-nft
On debianiptables-nft
,iptables-nft-save
,iptables-nft-restore
and the same withip6tables-nft
etc, seeapt-cache show iptables
ip-ranges_amazon_to_ipset.sh The 3 uses the same principles to generate rules. You then need to add them to your own rules.
You must give it execution rights:
chmod +x ip-ranges_amazon_to_nftables.sh
Change the destination dir DIR=
to a writable one by user.
Here is the content of the script:
#!/usr/bin/env bash
AMAZON_IP_RANGES=https://ip-ranges.amazonaws.com/ip-ranges.json
IPRANGES_json=/tmp/ip-ranges.json
#DIR=/etc/nftables.d # uncomment this and comment following line for replacing by cron
DIR=/tmp
IPV4=${DIR}/defines-amazon_ipv4.nft
IPV6=${DIR}/defines-amazon_ipv6.nft
# --- download file if not already here ---
if [ ! -e ${IPRANGES_json} ]
then
curl -Ro ${IPRANGES_json} ${AMAZON_IP_RANGES}
fi
# --- create DIR if it not already exists ---
if [ ! -e ${DIR} ]
then
mkdir ${DIR}
fi
# ------ create IPv4 define ------
echo "define amazon_ipv4 = {" >${IPV4} ## sed 's/"//g'|
jq .prefixes[].ip_prefix ${IPRANGES_json} | sed 's/"//g'| while read IP
do
echo " ${IP}," >>${IPV4}
done
echo "}" >>${IPV4}
# ------ create IPv6 define file ------
echo "define amazon_ipv6 = {" >${IPV6}
jq .ipv6_prefixes[].ipv6_prefix ${IPRANGES_json} | sed 's/"//g'| while read IP
do
echo " ${IP}," >>${IPV6}
done
echo "}" >>${IPV6}
Load your custom definaes at the begining of the /etc/nftables.conf
:
include "nftables.d/defines-*.nft"
Then inside of your chain input {}
directive, for IPv4 as example, we consider $public_interface is your MAC interface and $public_ip is your public ip, you can also just filter on all interface independently of any ports (second exemple)
Exemple of definition of public ip/interface
define public_ip = 1.2.3.4
define public_interface = eth0
Filter only for public_ip on public_interface
table inet filter {
chain input {
[...] # your other directives
iif $public_interface tcp dport { 80,443 } ip saddr $amazon_ipv4 ip daddr $public_ip drop \
comment "HTTP/S amazon WAN drop"
[...] # your other directives
}
Filter an all interfaces with any destination IP address:
table inet filter {
chain input {
[...] # your other directives
tcp dport { 80,443 } ip saddr $amazon_ipv4 drop \
comment "HTTP/S amazon WAN drop"
[...] # your other directives
in the case if ipset, you need to install ipset beside iptables
pacman -S ipset # Arch Linux bases
apt install ipset ipset-persistent # Debian based
In yout iptables rules, simply add:
-A INPUT -p tcp -m multiport --dports 80,443 -m set --match-set IPV4_AMAZON src -m comment --comment "block IPV4_AMAZON by IPset" -j ACCEPT
-A INPUT -p tcp -m multiport --dports 80,443 -m set --match-set IPV6_AMAZON src -m comment --comment "block IPV6_AMAZON by IPset" -j ACCEPT
On Nginx, or better, one of the forks by main authors of nGINX: Angie (sources on their own Gitea instance), and freenginx (sources on their hg server and github mirror
Create a /etc/nginx/rules/antibot.conf
file with following rules (download antibot.conf here):
# Antibot for batbots
if ($http_user_agent ~ (AhrefsBot|amazonbot|Amazonbot|anthropic-ai|Bytespider|ClaudeBot|Claude-Web|FacebookBot|GPTBot|ChatGPT-User|Googlebot|Google-Extended|GoogleOther|Omgili|PetalBot|SemrushBot|Twitterbot|webprosbot)) {
access_log off; # Don't saturate logs with useless bots
error_log off; # can't work in http directive
log_not_found off; # can't work in http directive
return 444; # nginx special "drop connection" code
}
Then of each of your server{}
directives (not in http{}
directive directly) you need to add this directive:
server {
listen 443;
http2 on;
include rules/antibot.conf;
[...] # your own rules
}
And then reload your nginx configuration:
/usr/sibn/nginx -t # Test current nginx configuration
systemctl reload nginx # reload nginx configuration