Web Scraping By Python

Posted on  by 



How to Setup the Scraping Project. Our setup is pretty simple. Just create a folder and install Beautiful Soup, pandas, and requests. To create a folder and install the libraries, enter the commands given below. I am assuming that you have already installed Python 3.x. Mkdir scraper pip install beautifulsoup4 pip install requests pip install pandas. With Python tools like Beautiful Soup, you can scrape and parse this data directly from web pages to use for your projects and applications. Let's use the example of scraping MIDI data from the internet to train a neural network with Magenta that can generate classic Nintendo-sounding music.

  1. Web Scraping With Python By Ryan Mitchell
  2. Python Web Scraping Tools
  3. Web Scraping Python Github

Reaching the most potential clients is very important for most startups. In this way, they can generate better leads. One of the easiest ways to have a good clientage is to have as many business email addresses as possible and send them your service details time and again.

They are many scraping tools present on the internet that provide these services for free, but they have withdrawal data limits. They also offer unlimited data extraction limits, but they are paid. Why pay them when you can build one with your own hands?

Web scraping python to excel

This article will demonstrate how easy it is to build a simple web crawler in Python. Although it will be a very simple example but for beginners, it will be a learning experience, especially for those who are new to web scraping. This will be a step-by-step tutorial that will help you get email addresses without any limits.

Let’s start with the building process of our intelligent web scraper. I will divide the whole code into different pieces by commenting on what’s going on so that you can get a deeper insight into how the whole process works. I will also share the entire code at the end of the post to fully analyze the whole process.

Step 1: Importing Modules

We will be using the following six modules for our project.

The details of the imported modules are given below:

  1. re is for regular expression matching.
  2. requests for sending HTTP requests.
  3. urlsplit for dividing the URLs into component parts.
  4. deque is a container that is in the form of a list used for appending and popping on either end.
  5. BeautifulSoup for pulling data from HTML files of different web pages.
  6. pandas for email formatting into DataFrame and for further operations.

Step 2: Initializing Variables

In this step, we will initialize a deque that will save scraped URLs, unscraped URLs, and a set of saving emails scraped successfully from the websites.

Scraping

Duplicate elements are not allowed in a set, so they are all unique.

Step 3: Starting the Scraping Process

  1. The first step is to distinguish between the scraped and unscraped URLs. The way to do this is to move a URL from unscraped to scraped.
  1. The next step is to extract data from different parts of the URL. For this purpose, we will use urlsplit.

urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment, identifier).

I can’t show sample inputs and outputs for urlsplit() due to confidential reasons, but once you try, the code will ask you to input some value (website address). The output will display the SplitResult(), and inside the SplitResult() there would be five attributes.

This will allow us to get the base and path part for the website URL.

  1. This is the time to send the HTTP GET request to the website.
Web scraping by python code
  1. For extracting the email addresses we will use the regular experession and then add them to the email set.
Web

Regular expressions are of massive help when you want to extract the information of your own choice. If you are not comfortable with them, you can have a look at Python RegEx for more details.

  1. The next step is to find all linked URLs to the website.

The <a href=””> tag indicates a hyperlink that can be used to find all the linked URLs in the document.

Then we will find the new URLs and add them in the unscraped queue if they are not in the scraped nor in the unscraped.

When you try the code on your own, you will notice that not all the links are able to be scraped, so we also need to exclude them,

Web Scraping With Python By Ryan Mitchell

Step 4: Exporting Emails to a CSV file

Web

To analyze the results in a better way, we will export the emails to the CSV file.

If you are using Google Colab,you can download the file to your local machine by

As already explained, I can’t show the scrapped email addresses due to confidentiality issues.

[Disclaimer! Some websites don’t allow to do web scraping and they have very intelligent bots that can permanently block your IP, so scrape at your own risk.]

Complete Code

Wrapping Up

In this article, we have explored one more wonder of web scraping by showing a practical example of scraping email addresses. We have tried the most intelligent approach by making our web crawler by using Python and its easiest and yet powerful library called BeautfulSoup. Web Scraping can be of massive help if done rightfully considering your requirements. Although we have written a very simple code for scraping email addresses, it is totally free of cost, and also, you don’t need to rely on other services for this. I tried my level best to simplify the code as much as possible and also added room for customization so you optimize it according to your own requirements.

Python Web Scraping Tools

If you are looking for proxy services to use during your scraping projects, don’t forget to look at ProxyScraperesidential and premium proxies.

Web Scraping Python Github

That was all for this article. See you in the next ones!





Coments are closed