-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asynchronous-web-scraping #1
Comments
Thank you for this blog great work! you've also tried to help me on Reddit which I really appreciate - As Im new in web scraping in general & am still trying to find the best method, my question is: even though Asynchronous scraping does it in lightening speed without putting too much pressure on the server (I expect), then how can I avoid being identified as a bot & avoid being blacklisted from the site? as most the documentation I've read so far pointing to even put a time.sleep() or even a random time.sleep() & try to mimic a human - so how do you avoid being blocked by a website using Asynchronous scraping? I appreciate your answer with an actual code or e.g. Headers you use please |
@al22xx you've just stumbled on to the biggest subject in this particular medium! The big problem of web-scraping is scaling and bot detection avoidance. To quickly summarize it: some websites want to serve pages to only to humans but not bots, so how can they tell the difference between the two? One way is that users usually execute javascript that is included in the page but bots don't (i.e. python with Another way to detect bots is to track their IPs. There are varying quality of IPs: data center ips, residential ips and mobile ips - in that order. So your bot might need to use proxies to avoid being identified as one power-user (no person can visit 1_000 pages a minute, right?) So to summarize - there are two ways for targets to track of clients: their javascript execution and their connection IP. There are a lot of ways to deal with this issue and it entirely depends on your project. Recently, I've been teaming up with the guys at https://scrapfly.io which offers a middleware service for exactly that: ensuring your http requests are undetected and reliable. There's a free plan - check it ou! :) |
Thank you for your response, I have to admit I have written that question a while ago & meant to ask you for a while - I know you mention couple of methods here Throttling & Leaky Bucket but as a novice I was wondering why we come up with a super fast scraping method only then have to slow it down for not to be detected. I will look into https://scrapfly.io thank you for all your great work! |
@al22xx regarding the throttling issue - it's much easier to scale down something than to scale it up. Think of it this way: it's easier to control how much you eat when you have unlimited supply of food - if you have a little bit of food, well then, you're just starving! Some websites are bot friendly and we can really push it, some aren't and we must slow down for optmal performance. Also, this rate is often not set in stone, for example a lot of website would have scaling anti-bot protection so it's more aggressive during peak hours (daytime) and less aggressive during off hours (like night time) - so if we're smart with our scraping strategy we can push our speeds quite a bit! As for medium.com - unfortunately it doesn't pay enough to relinquish styling and publishing control just yet but thanks for the kind words and the suggestion! :) |
Scrapecrow - Asynchronous Web Scraping: Scaling For The Moon!
Educational blog about web-scraping, crawling and related data extraction subjects
https://scrapecrow.com/asynchronous-web-scraping.html
The text was updated successfully, but these errors were encountered: