Web scraping from Wikipedia pages using Python

Photo by Scott Graham on Unsplash

Learning basics of Web scraping from scratch and implementing it in real scenarios.

Web scraping is an automatic process of extracting information from the web.

Introduction to Web scraping and python

Steps in Web scraping

A brief list of python libraries used for web scraping

Photo by Carlos Muza on Unsplash

Practical Implementation- Scraping Wikipedia

Step 1: How to use python for web scraping?

pip install virtualenv
python -m pip install selenium
python -m pip install requests
python -m pip install urllib3
Sample image

Step 2: Introduction to Requests library

URL: https://en.wikipedia.org/wiki/Main_Page
import requests
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
page
page.status_code
page.content
Photo by Arseny Togulev on Unsplash

Step 3: Introduction to Beautiful Soup for page parsing

pip install bs4

Code Walk-Through:

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Photo by h heyerlein on Unsplash

Step 4: Digging deep into Beautiful Soup further

list(soup.children)soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
soup.find('p')

Step 5: Exploring page structure with Chrome Dev tools and extracting information

Chrome dev tools
Photo by Andy Kelly on Unsplash
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, 'html.parser')
object = soup.find(id="mp-left")
items = object.find_all(class_="mp-h2")
result = items[0]
print(result.prettify())

Results

Conclusion and Digging deeper into Web scraping

Photo by Kevin Ku on Unsplash

Although web scraping opens up many doors for ethical purposes, there can be unintended data scraping by unethical practitioners which creates a moral hazard to many companies and organizations where they can retrieve the data easily and use it for their own selfish means.

Uses of web scraping

I will also soon be publishing it on GeeksforGeeks.

Thank You! ❤

Junior @NITP🌍 ❯ Intern @Dataly ❯ Innovations Lead @dscnitp ❯ Projects Head @hackslash-nitp ❯ 🙅OSH Mentor @anitab-org ❯ ASI @alexa-dev-hub ❯ Mentor @OpenMined