Web Scraping Using Python
What is Web Scraping
Web scraping is a method to extract data from websites. Using web scarping we can store unstructured website data to structured website data. There are different ways of scrape websites. In this, we’ll do web scraping with python.
In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.
Requirement for Web Scraping
- Selenium
- Beautiful Soup
- Request
- lxml
Installation
We have to install following libraries and packages for web scraping and for that wehave to run following commands in python editor.
pip install selenium
!apt install chromium-chromedriver
pip install beautifulsoup4
Import Libraries
We have to import following libraries.
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
Website URL
driver.get(“https://www.flipkart.com/search?q=mi+mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_2_na_na_na&as-pos=1&as-type=RECENT&suggestionId=mi+mobiles%7CMobiles&requestId=1fc15655-826d-4cb0-8cd9-cd9660380f80&as-backfill=on")
content = driver.page_source
soup = BeautifulSoup(content)
Webdriver
As we all know, to get the content from the website we need to provide a web driver of the browser we are using.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘ — headless’)
chrome_options.add_argument(‘ — no-sandbox’)
chrome_options.add_argument(‘ — disable-dev-shm-usage’)
wd = webdriver.Chrome(‘chromedriver’,chrome_options=chrome_options)
driver=webdriver.Chrome(‘chromedriver’,chrome_options=chrome_options)
Fetching data from website
First we have to create empty list for getting data from website
products=[] #List to store the name of the product
prices=[] #List to store price of the product
features=[] #List to store rating of the product
Now, you have to run a loop and fetch all data and append it to the list. To fetch the data we first need to inspect the webpage and then we need to check the class name of that particular div tag and we need to write it in the code.
for a in soup.findAll(‘a’,href=True, attrs={‘class’:’_1fQZEK’}):
name=a.find(‘div’,attrs={‘class’:’_4rR01T’})
price=a.find(‘div’,attrs={‘class’:’_30jeq3 _1_WHN1'})
feature=a.find(‘div’,attrs={‘class’:’fMghEO’})
products.append(name.text)
prices.append(price.text)
features.append(feature.text)
Data Frame
df = pd.DataFrame({‘Product Name’:products,’Price’:prices , ‘Feature’:features})print(df.head(10))
Convert data into CSV file
df.to_csv(‘products.csv’, index=False, encoding=’utf-8')