Webscraping 101 — How to find the right APIs using Chrome Developer Tools and Postman
Welcome to article 2 in the series Webscraping 101. Before we begin here are some quick points to note:
Summary of steps involved in web scraping
→ Understand the use case
→ Identify the source from which the data of interest originates
→ Write design logic to scrape all the data in a scheduled manner
→ If it is a JSON API, make required calls to the API and collect the data using keys and indices
→ If it is the source HTML, make required calls to the website url and create the appropriate xPaths to get the data of interest.
→ Use choice of programming language to convert the design logic to code
Preferred Source of Data
Due to ease of readability and change management, an API returning data in JSON format is the most preferred way to read data into the scraper. However, if the data is only present in the HTML tag, we don’t have an option but to proceed with a tricky scraper that may not be able to fit itself for every website change that could take place. In that case, it would solely rely on our expertise in writing smart xPaths that can adapt to minor website changes.
Let’s get started!
This article will cover the use of Chrome Developer Tools to find the APIs from where data present in the UI is sourced. I will also cover as to how you can call these APIs yourself and read the data into your preferred programming language.
You can try the below steps on any website you are interested to scrape. For the purpose of this article, I will be demonstrating how to scrape the list of Limited Time Deals of Mens Athletic Shoes (https://www.ebay.com/b/Mens-Athletic-Shoes/15709/bn_57918?*_trkparms=*pageci%3A*%7Cparentrq%3A*iid%3A0). I have ensured this URL is not blocked by ebay.com/robots.txt .
- Use of Chrome Developer Tools
Accessing Chrome Developer Tools is completely free. Load the website, click the 3 dots on the top right of your Chrome browser, and follow the instructions below.
Once you access developer tools, you should see a dashboard on the right or bottom of your screen.
Note the various options to select from, on the top bar — Elements, Console, Sources, Network, Performance, Memory and so on. We are mainly interested in the network tab during this part of the process.
In the Network Tab, you will find a variety of URLs that run behind the website you want to scrape. If you don’t see any URLs, refresh your page.
Once the URLs load, it is time to find the source for the content that we want to scrape. Here, I am interested in the list of shoes in the “Limited Time Deals” section. Therefore, I am going to look for where a term in the first item in the list, “VERSABLAST” occurs.
Click the magnifying glass next to the filter icon, and a search bar will open up. Type “VERSABLAST” and the number of APIs will drastically reduce to just a few options.
I found something like a key value pair in the search results and so I clicked on it. Under the response tab, there are a bunch of HTML tags, which is quite confusing! But in the midst of all that HTML, there seems to be a JSON! This is my cue, and this is what I am interested in scraping. To confirm that this is the source, you can also search for any other item on the “Limited Time Deals” list, and you should be able to find it.
Once I locate the API that gives this response (which in this case is the website itself), I will copy the cURL command required to generate this result.
Under the Headers tab you can observe a GET request being made, with a bunch of request and response headers. The cURL command helps us in avoiding manual copying of the request headers and the body of the request (in case of a POST request). Now we will see how to leverage this cURL command that we have copied, in Postman.
- Use of Postman
Now let us replicate this API call on Postman. Head over to Postman and click the Import button on the top left hand side.
Paste the cURL command in the Raw Text box and Continue. Click Import.
You will see that all Headers are imported and the request is GET. Click Send. The status should be 200 OK. This means that the request was sent successfully and we have gotten the right response as well.
At this stage here are few things to note:
→ This tutorial covers only GET requests. POST requests would include a Body as well that may have to be modified based on the information we want.
→ Sometimes all the headers are not required for making the request. You can try eliminating headers one by one until the request fails (is not 200). At that stage you know the minimum number of headers required to make the request work.
→ Getting a 200 response may not be very simple for more complex websites. So if you don’t receive a 200 for your website, there could be multiple reasons — including but not limited to — cookies changing, IP address location, Auth token missing, URL parameters missing, wrong request type, and so on.
Now, getting back.
Within the received response, copy the complete JSON part containing the data of interest. That’s right, the complete JSON, no matter how long it is.
Paste it into a JSON formatter online, such as jsonformatter.org .
Now search for VERSABLAST on the right side. You may come across multiple occurrences of this term, but choose the one that has the price of the shoes in the same area.
And voila! We are able to find all information such as the Listing ID, exact name of the shoes, the current price and the original price that was struck through, as well as the % discount.
As we can see the JSON and UI match.
Now there is only one thing left to do before we move on to Python. Click the </> icon on the top right of the Postman window and select “Python — Requests” from the drop down menu. You will find the Python code to make this GET request using the Requests module. While I will be using Scrapy instead of requests in the upcoming article for web scraping on Python, this piece of code will surely be useful to get the headers, URL and payload without manually having to type it all.
Copy it, head to Jupyter notebook and paste it in one of the cells.
In the next article we will cover on how to use the python module Scrapy to scrape the website, using this information.
Hope this was helpful!