How to scrape data using Python?
Hi all, There was a moment when I was curious like, you are now to learn web scraping.
So, let me share my experience, how I learned web scraping using python packages. To be precise, I used BeautifulSoup to scrape the data, and trust me it is super easy to scrape the data from the web. The only thing you have to keep in the mind is, you should know “what exactly I want to scrape?”
In order to scrape the data, I imported these two modules/packages the
- requests
- BeautifulSoup (if the BeautifulSoup is not installed then try running this command in your environment ‘pip install beautifulsoup4’ (ignore the quote symbols))
Note: The names of these modules are case sensitive.
import requests
from bs4 import BeautifulSoup
Once these are successfully imported then, you have to declare a variable that will store the address of the URL you want to scrape. So, here in my case, I scraped the redBus about page(https://www.redbus.com/info/aboutus). Once we had fixated the URL, we have to request the specified URL using the get method in the request module. Let me show you how
URL = ‘https://www.redbus.com/info/aboutus'
response = requests.get(URL)
This get method will request and retrieve the data from a specified URL. In short, requests.get(URL) will send the request to the redBus website and save the response from the server in an object called response (object name), here in our case.
To get the fetched data in the HTML form, we are using html.parser.
Let’s look at how it’s done.
Soup = BeautifulSoup(response.content, ‘html.parser’)
#if i try to print the soup using prettify function
print(Soup.prettify())
#then we will get output in this way.
As we had discussed whenever we make a request to the specific URL, it returns a response object, which would be used to access different features such as headers, content, etc… Here response.content returns the content in the form of bytes, which is further parsed in the HTML form.
Now, we are just one step ahead to scrape the data. Now, to showcase in this blog, I’m scraping this page.
See the above image, when you click the URL mentioned above you will get redirected to this page. We have to go to the developer tools.In my case, I’m using the chrome, (CTRL + SHIFT + I ) is sufficient to go to the developer tools. Once you are in it, just keep the cursor on the text or numerical data you want to extract. Here while scraping, I noted few things which I want to extract that are :
- About us Title
- Description of About us
- Management Team Title
- Names of the Management Team
- Description of Management Team
All the HTML content which is extracted is stored in the variable named Soup, in that Soup object, we want to extract the data which is more relative to us. In our case, the first and the foremost thing I want is About us (Title) which is present in the ‘class’ = ‘Red XCN’.
RedbusAbout = Soup.findAll(attrs= {‘class’: ‘Red XCN’})[0].text
In the above code snippet, I’m finding all the attributes which have the class name as Red XCN and extracting the most related data I want from the Soup object. For instance, if I run this code.
Soup.findAll(attrs= {‘class’: ‘Red XCN’})#the output will be [<h3 class="Red XCN">About us</h3>,
<h3 class="Red XCN" id="BirdsEye">Management Team</h3>]
In our case, we only want the first line of the output, because it has ‘About us’ in it. In order to extract the first line from the snippet, we are using the indexing concept, where the first line represents the 0th index here.
# here the [0], represents the Oth index, which will help us the extract only the fisrst line of the code.Soup.findAll(attrs= {‘class’: ‘Red XCN’})[0]#the output for this code is:<h3 class="Red XCN">About us</h3>
If you had observed it clearly, the above output is in the HTML form, in order to extract only the text from the above output, we are using the .text extension.
Soup.findAll(attrs= {‘class’: ‘Red XCN’})[0].text#output 'About us'
Now we had saved the whole output in the variable name RedbusAbout, I believe now you got the idea of this line.
RedbusAbout = Soup.findAll(attrs= {‘class’: ‘Red XCN’})[0].text
Similarly, we had extracted all the data, we want
AboutRedBus = Soup.findAll(attrs= {‘class’: ‘western’})[0].textManagementTitle = Soup.findAll(attrs= {'class': 'Red XCN'})[1].textName1 = Soup.findAll(attrs= {'class': 'Red TextBold XCN'})[0].text[0:14]Name2 = Soup.findAll(attrs= {'class': 'Red TextBold XCN'})[1].text[0:11]Name1Desc = Soup.findAll(attrs= {'class' : 'western'})[1].textName2Desc = Soup.findAll(attrs= {'class' : 'western'})[2].textlastline = Soup.findAll('div',attrs= {'class': False, 'id' : False})[3].text.replace('\n', '')
Now, try to do it on your own, if you feel you got stuck, then follow my youtube video, which demonstrates the same.
Now, if you observe the variable lastline, it doesn’t have any class or id, in that case just keep the attributes as False, like class is False (that means the class is not present) and id is False (this means id is not present), and ‘div’ means to find all the patterns with word ‘div’. In short, the whole line says that from Soup find all the lines which have ‘div’ in it and which doesn’t have the class and id attributes in them. Once we found it we had indexed it [3] and replace the ‘\n’ words with the ‘ ’ empty words.
lastline = Soup.findAll('div',attrs= {'class': False, 'id' : False})[3].text.replace('\n', '')