Let’s use Python web scraping to download MP3s

Everyone loves to listen to music. Sometimes you want to download your favorite song mp3 or your favorite album mp3. Downloading hundreds of mp3 files manually was…tiresome.

One day while I was downloading my favorite songs from an MP3 website, a question popped up in my mind: Why am I downloading all these files manually? That’s when I started thinking of writing an automation script for this.

Sounds like fun work to me, as I was working on some web scraping projects at that time. the idea was to input the artist name, and it scraped the source code for mp3 files from that artist and download them onto my HDD by creating a separate folder with the artist name. Let’s dive in and understand each step.

First thing – First

We need one extra python library ‘BeautifulSoup’ which we need to install before we move further. Following is the instruction on how to install it.

pip install beautifulsoup4

Now we all set. Let’s dig in

Import Required Libraries to your python program

From the library we just installed we import the BeautifulSoup module to our program. We use the BeautifulSoup library to parse the HTML content of the site URL.

from bs4 import BeautifulSoup

Then we need to import another library to handle the HTTP request. For that, we use the requests module from python.

import requests

We have now imported libraries that need us to scrape the site and download the MP3 files. Now we need two more python libraries that we use to access the os file system to write our MP3 files.

import os
import sys

Now we have all the libraries and modules imported to our program.

We’ll set few Variables

We need the user to input the artist’s name in order to download the MP3 files for that artist. For that, we need a string variable to accept system inputs. module sys provide a method to read input values and store them in a string array. we need to provide the correct index to ready the user input. For us, it’s the 1st position of the array.

artist_name = sys.argv[1]

Now we need four more string variables to store directory name, URL with artist name, file extension that we are going to download, and one for storing the file name.

dirname = artist_name
url = 'https://www.sinhalasongs.lk/sinhala-songs-download/' + artist_name + '/'
# URL look like this now. https://www.sinhalasongs.lk/sinhala-songs-download/amaradeva/
ext = 'mp3'
filename = ''

Check Validity of the User Input

Using a simple if block we can check whether the user has entered the artist name or not. If our artist_name variable is not empty, that means the user has provided the artist name. If the user has not provided any input then the program will notify the user with a message “Artist name needs to provide.” and terminate the program

if artist_name is not None:
   #we will write our code here.
else:
   print('Artist name needs to provide.')

Read and Parse the HTML

Using request.get() method we can read the URL and after that, we convert it to text so that we can give it as an input to our HTML parser (BeautifulSoup).

page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')

Find all the Anchor Tags and Loop

Now, we have our HTML code. What we need to do is find all the <a> tags on the page. This is where we find the href property which has the URL for Artists’ Home Page. Once we collect all the <a> tags we need to loop each one of them with a condition in place to find the URLs where the artist’s name is present. If we find one anchor tag where href value is set to a URL with the artist name, then we need to read that URL and have to do the Read and Parse steps to that URL too. This means now we have navigated to the Artists home page where all the MP3s are listed for that artist.

for node in soup.find_all('a'):
   if node.get('href').find(artist_name) != -1:
      url2 = node.get('href') + '/'
      page2 = requests.get(url2).text
      soup2 = BeautifulSoup(page2, 'html.parser')

Loop all the Anchor Tags in Artist Home Page

Now we have the HTML source for Artist’s home page, we need to find the <a> tags which contain the link to the MP3 file. Once we find it, we need to create the folder with Artist’s name and download the MP3 file to that location. Here I have used a try block to capture any errors while downloading and ignore them to move to the next URL. We do not want our program to stop if there is one broken link to the MP3 file.

for node2 in soup2.find_all('a'):
   if node2.get('href').endswith(ext):
   try:
      if not os.path.exists(dirname):
         os.mkdir(dirname)
         for txt in node2.get('href').split('/'):
            if txt.endswith(ext):
               filename = txt
               r = requests.get(node2.get('href'), allow_redirects=True)
               os.chdir('C:/Python/' + dirname)
               open(filename, 'wb').write(r.content)
               os.chdir('C:/Python/')
   except IndexError:
      continue

Other Sites

While trying to download MP3s from another website, I realized that the source codes were different. Hence, the links had to be dealt with differently. Since I had already parsed the URL, I knew its source and had to do a few tweaks to the program to make it work.

And there it was! My very own MP3 downloading web scraping tool. Why waste hours downloading files manually when you can copy-paste a link and let Python do its magic?

Here’s what my overall code looked like:

from bs4 import BeautifulSoup
import requests
import os
import sys


artist_name = sys.argv[1]
dirname = artist_name

url = 'https://www.sinhalasongs.lk/sinhala-songs-download/' + artist_name + '/'
ext = 'mp3'
filename = ''

if artist_name is not None:
   page = requests.get(url).text
   soup = BeautifulSoup(page, 'html.parser')
   for node in soup.find_all('a'):
      if node.get('href').find(artist_name) != -1:
         url2 = node.get('href') + '/'
         page2 = requests.get(url2).text
         soup2 = BeautifulSoup(page2, 'html.parser')
         for node2 in soup2.find_all('a'):
            if node2.get('href').endswith(ext):
               try:
                  if not os.path.exists(dirname):
                     os.mkdir(dirname)
                     for txt in node2.get('href').split('/'):
                        if txt.endswith(ext):
                           filename = txt
                           r = requests.get(node2.get('href'), allow_redirects=True)
                           os.chdir('C:/Python/' + dirname)
                           open(filename, 'wb').write(r.content)
                           os.chdir('C:/Python/')
               except IndexError:
                  continue
else:
   print('Artist name needs to provide.')

Want to try it? Feel free to fork, clone, and star it on my Github. Have ideas to improve it? Create a pull request!

Github Link: https://github.com/cry4rock/Mp3FileDownloader