Web Scraping 101 — What to Keep in Mind Before Embarking on this Journey

Samyukta Hariharan
3 min readAug 5, 2023

--

Hi all, I am starting a series on web scraping that will contain articles on getting started with your web scraping journey, in a safe and efficient manner!

Before we begin, what is web scraping?

An indispensable part of data engineering is data collection. Data can be collected from numerous sources such as IoT devices and embedded systems, or can be bought from vendors. Data can also be collected from a webpage, and this is done via a technique known as web scraping or web harvesting.

Web scraping is extremely useful when it comes to gathering information on product reviews, market figures, press releases, or any other important updates and communications made predominantly via websites, and available to the public. Its free, easy and will give you readily available useful data for your analytical systems. In the current internet era, with numerous websites to our disposal, it is an invaluable and extremely useful skill to learn.

What to keep in mind before starting to web scrape?

There are certain rules one must follow in order to ethically and efficiently scrape information —

  1. Follow robots.txt

Most websites contain a robots.txt site (that can be found in www.{websitedomain}.com/robots.txt) . This is meant to prevent bots from scraping certain parts of the website. It is up to us whether to make our bot follow robots.txt or not. However, it is unethical to not follow robots.txt.

2. Be mindful of logins

Sometimes some information we are interested in (for example tweets or Instagram posts), are behind a login, and we need to create an account to be able to access that information. In general when we encounter a login, its best to avoid scraping that website since the information behind the login is meant to be private.

3. Do not cause disruptions

Some websites are very sensitive to traffic and therefore any spike beyond the normal could result in the website crashing, and their day-to-day flow getting disrupted. Therefore it is essential to be aware of what kind of website we are dealing with and accordingly channel the traffic created by our bot.

4. Be aware of terms and conditions and privacy statements

While robots.txt is usually enough to keep us safe from accessing un-intended content, it is still good to be aware of the terms and conditions of the webpage, to know that we are not breaking any rules. If one is performing web scraping in their organization, it is important to find out and follow the rules and regulations that their organization has in place for the practice of ethical web scraping.

5. Tread carefully with CDNs

Some websites deploy CDNs such as Akamai and Cloudflare that could block us from the website with the slightest indication of us being a bot, or provide us with a facade , hiding the actual information. In this series we will learn how to identify if the website uses a CDN. Ensure you are matching the contents of the raw site to be scraped, with the UI, always, to be safe.

6. Always prefer JSON over HTML, and HTML over Javascript.

When it comes to scraping, information displayed on the UI can be present either in an API with JSON format content, or it could be embedded in the HTML tags. In some rare cases, information could also be in a Javascript format (which is a nightmare!). If you are able to find a JSON source, always go for that. JSON content can be fitted into any UI, and hence in case of a website UI update, the underlying information is less likely to change.

Conclusion

Hope these tips help with the preparation to embark on the journey of data collection using web scraping.

Stay tuned for more in this series!

--

--

Samyukta Hariharan
Samyukta Hariharan

Written by Samyukta Hariharan

Research Engineer in AI/Data. Learning and writing about all things Data.

No responses yet