Web Scraping 101 — What to Keep in Mind Before Embarking on this Journey
Hi all, I am starting a series on web scraping that will contain articles on getting started with your web scraping journey, in a safe and efficient manner!
Before we begin, what is web scraping?
An indispensable part of data engineering is data collection. Data can be collected from numerous sources such as IoT devices and embedded systems, or can be bought from vendors. Data can also be collected from a webpage, and this is done via a technique known as web scraping or web harvesting.
Web scraping is extremely useful when it comes to gathering information on product reviews, market figures, press releases, or any other important updates and communications made predominantly via websites, and available to the public. Its free, easy and will give you readily available useful data for your analytical systems. In the current internet era, with numerous websites to our disposal, it is an invaluable and extremely useful skill to learn.
What to keep in mind before starting to web scrape?
There are certain rules one must follow in order to ethically and efficiently scrape information —
- Follow robots.txt
Most websites contain a robots.txt site (that can be found in www.{websitedomain}.com/robots.txt) . This is meant to prevent bots from scraping certain parts of the website. It is up to us whether to make our bot follow robots.txt or not. However, it is unethical to not follow robots.txt.
2. Be mindful of logins
Sometimes some information we are interested in (for example tweets or Instagram posts), are behind a login, and we need to create an account to be able to access that information. In general when we encounter a login, its best to avoid scraping that website since the information behind the login is meant to be private.
3. Do not cause disruptions
Some websites are very sensitive to traffic and therefore any spike beyond the normal could result in the website crashing, and their day-to-day flow getting disrupted. Therefore it is essential to be aware of what kind of website we are dealing with and accordingly channel the traffic created by our bot.
4. Be aware of terms and conditions and privacy statements
While robots.txt is usually enough to keep us safe from accessing un-intended content, it is still good to be aware of the terms and conditions of the webpage, to know that we are not breaking any rules. If one is performing web scraping in their organization, it is important to find out and follow the rules and regulations that their organization has in place for the practice of ethical web scraping.
5. Tread carefully with CDNs
Some websites deploy CDNs such as Akamai and Cloudflare that could block us from the website with the slightest indication of us being a bot, or provide us with a facade , hiding the actual information. In this series we will learn how to identify if the website uses a CDN. Ensure you are matching the contents of the raw site to be scraped, with the UI, always, to be safe.
6. Always prefer JSON over HTML, and HTML over Javascript.
When it comes to scraping, information displayed on the UI can be present either in an API with JSON format content, or it could be embedded in the HTML tags. In some rare cases, information could also be in a Javascript format (which is a nightmare!). If you are able to find a JSON source, always go for that. JSON content can be fitted into any UI, and hence in case of a website UI update, the underlying information is less likely to change.
Conclusion
Hope these tips help with the preparation to embark on the journey of data collection using web scraping.
Stay tuned for more in this series!