Couple weeks ago, I started a small project to help a friend buying air tickets. My naive imagination told me that, all I need to do is to open the airline’s website, enter some query information, and keep refreshing until a ticket present. It didn’t start out well.
Tech stack
- Python 3
- Selenium
- Chrome driver
- Chrome browser
I’m new to the web crawling/scrapping area, but I have done some testing with Selenium before. I thought I could just figure out where to click, enter flight information, search, and then keep refreshing.
The problems
Soon I realized that the airline website has a mechanism to block this kind of automation. After finished coding the functionality above, I decided to run my code. The refresh time interval was set to 30 seconds. It went well for about 20 minutes. So I padded myself on the shoulder and thought everything works perfectly and went on doing some housework.
An hour later, I came back and see a Google reCAPTCHA screen. Okay, the airline website knew that I was using a bot to refresh for a new airline ticket. I added some random sleep time on top of my 30-second refresh interval. After another 10 minutes or so, my IP address is blocked. 😶
Solution
The first thing I could think of is using a VPN. So I bought a VPN service so that I could continue my development of the bot. That helped temporarily solved the blocked IP address issue.
Next, I have to solve the Google reCAPTCHA problem, there are 2 ways to go. One is the use some sort of AI library to break it. It’s rather difficult going this route because Google reCAPTCHA is very sophisticated and hard to get by(Otherwise, they’d be out of business already😂). That left me to the second option, fake a user agent header, which is my only alternative for now. A user agent header is a piece of information in your browser send through HTTP/HTTPS request. It contains information of what browser you’re using and what operating system you’re on. The basic principle is to randomly generate a new identity of my browser and trick Google reCAPTCHA that it’s a different device. I found this perfect library called fake-useragent that does the exact thing for me, and it works!
After that, the rest is just code to enter flight information and search. Most of the code can be reused for UI testing and web crawling. So I decided to make part of it open source. Here I started the PyWebTest project.
https://github.com/lokarithm/PyWebTest
Cheers,
Lok
