Deep Learning Music with Python

Music is a primary expression of our humanity. But like anything in the natural and man-made world, it’s made up from patterns that obey a set of laws. In this tutorial we will see how to set up a Deep Learning network in Python. Using a case study of recognising a composer’s ‘musical signature’ from a few bars of music, you will learn some tips and tricks for designing your own effective machine learning data sets. Let’s get started!


Important: The code in this tutorial is licensed under the GNU 3.0 open source license and you are free to modify and redistribute the code, given that you give others you share the code with the same right, and cite my name (use citation format below). You are not free to redistribute or modify the tutorial itself in any way. By reading on you agree to these terms. If you disagree, please navigate away from this page.

Troubleshooting: I assume intermediate knowledge of Python for this tutorials. If you don’t have this, please try a few more basic tutorials first or follow an entry-level course on Coursera or similar. This also means you know how to interpret errors. Don’t immediately panic and flood my inbox please :)! Read the thing, google for a solution, only then ask for help. Part of learning to program is learning to debug on your own as well. If you really can’t figure it out, feel free to let me know.

Citation format
van Gent, P. (2017). Deep Learning Music. A tech blog about fun things with Python and embedded electronics. Retrieved from: http://www.paulvangent.com/2017/12/07/deep-learning-music/

Find the GitHub here

Like the work here? Making content takes time and effort. If you like the work here, feel free to buy me a beer or support the blog.


Introduction
I’ve been playing the piano for about 15 years at the time of writing this, and have been composing as a hobby for a few of those years. One of the things that always struck me with (piano) sheet music is that I can open a book, see a piece I’ve never seen before, and make a quite accurate educated guess as to who composed it. This is something my brain has learned to do by looking at many different examples of sheet music over the years. Each composer has a unique signature based not only on personal style, but also the time period in which he/she lived. While my performance on this recognition task is above chance level, it’s not anywhere near-perfect. I wondered if a Deep Network could do better.

In this tutorial I will go over the process of building a deep neural network and evaluate its ability to recognise a composer’s ‘musical signature’ based on a few measures of music. The focus lies more on developing an effective data set than on building the deep network, since this is the step tutorials often gloss over by using a pre-fab set. While easier and less time consuming, this also means you’re stuck repeating other people’s work. You might have a cool idea but don’t know how to collect and structure the data. For each set the process is of course different, but to illustrate hoe this works we will go over this process with an example dataset here. Along the way this might give you ideas on how to create your own data set, and how to extract the right information from the source data. We will use a famous deep neural architecture (ResNet50) in Python using Google’s TensorFlow and the Keras framework.

What you will need:

  • Python 3 (TensorFlow does not work on 2.7);
  • Google’s TensorFlow;
  • The Keras framework;
  • A good GPU with lots of vram (early work here was done on a GTX760 with 2GB vram but that turned out to be too tight, I switched to a GTX1070 with 8GB vram).
    • Note that a GPU is completely optional, but it will speed up training times dramatically. It can make the difference between training for a night, or training for several weeks.

Installing Packages
The first thing to do is make sure your environment is set up. Install Python, I use the Anaconda Distribution of Python since it comes with a lot of useful packages preinstalled. Some people I work(ed) with also swear by the ‘Jupyter Notebook‘ you can use with this distribution. While I’m more of a Visual Studio kind-of-guy, I can definitely appreciate what the Jupyter Notebook does.

TensorFlow is a Deep Learning framework developed by Google. You need to install a few dependencies along with TensorFlow, and the Nvidia CUDA toolkit if you want to use your GPU for training. Installation instructions can be found here.

Because TensorFlow can have a bit of a steep learning curve at first, we use Keras on top of TensorFlow to make it more accessible. I like how it helps develop and iterate model variants really fast. Installation instructions here.


The Imperfect Finish?
Before delving into the data set and the model, I want to get something out of the way. “Help! I did so and so, and my model only reached 75% accuracy” or “Why am I not getting 100% accuracy?” are types of questions I get very often. Should you always strive for the maximum obtainable accuracy? Yes. Should you always keep going until you reach it? Absolutely not. Is the maximum perfomance always 100%? No sir. So what, then, is enough? Where is the finish line?

Let’s talk about Bayes Error (also named the irreducible error). It’s an important concept for solving any Deep Learning project, because it can help you determine where the finish line for your project is. Think of the Bayes Error as being the absolute limit of performance given a set of data and a task: if there’s not enough variance in the data to perform a task perfectly, nothing we can ever invent or build reliably will. So how do we determine value of Bayes Error? We can’t.

At least, we can’t find it theoretically. Determining the absolute Bayes Error is quite impossible until we design a perfect system which then runs into a non-maximum performance limit. Because it’s perfect nothing can do better, so there’s our value for Bayes Error. Once that happens we would still need to prove we really found the value of Bayes Error. After all, how do we prove our system is perfect if it doesn’t score 100% correct?

This might be a little too theoretical for practical purposes, so let’s boil it down to brass tacks. Bayes Error in Deep Learning is often taken as at least the performance of a trained human expert or group of trained experts on a task. This is because often the best known performance on a task comes from humans (so far..). This performance superiority is especially clear on tasks involving visual data. Biological vision systems have evolved to be really really good since they are a primary survival tool. Let’s illustrate my point with a few examples. Look at this image:

source

Did you see it? Good, you get to live another day. Your brain probably spotted the tiger so fast you didn’t even have time to think about it. If the tiger would have moved, you would likely have spotted it even faster. Computer Vision has traditionally struggled with these types of ‘overlapping objects with lots of patterns and occlusion’ problems. Now look at the following image:


source

Here even my vision system is struggling to spot the cats. Armed with the knowledge that there’s two mountain lions hiding in the bush near the center, I still cannot see them. So what is Bayes Error if the dataset had a lot of these pictures? Humans don’t do well here, but could a very advanced deep net do better and detect them? Would it make sense to spend a month or more building a big data set, and then another month trying different architectures and fine-tuning the models? I would say don’t bother.

And here’s the main point of this section: getting 100% accuracy is rare, and often not the goal of professional Deep Learning teams at all. A better question is: “when would I consider the problem solved sufficiently?”, or “where is the finish line?”. In other words, what level of performance is enough? This is especially important with Deep Learning, where building a model and training it through many epochs can take days, weeks, months, or more depending on available hardware, data set size, and problem complexity. Thinking about this before diving in will help you decide when to stop optimizing and call it quits. In practice this could save you lots of time.

Enfin, you’re here to learn about Deep Learning! So, let’s get to the meat of this tutorial.


Generating the Data Set – Step 1
The first thing we need to do is create our dataset and decide where the data will come from. I will be creating a set from piano sheet music of seven famous composers: Bach, Beethoven, Brahms, Chopin, Grieg, Liszt, and Mozart. I have already extracted the pages from the PDF files as .jpg images. You can download the set here (heads up: 1.5GB) and follow along if you like (made possible by the Petrucci Music Library), you can make your own music dataset (there is a lot of PDF sheet music available online), or create your own set with any other image data type.

The second decision is what our individual sample size is going to be. It’s better to not train the network on very large images, since this will increase our video-ram (vram) requirements far beyond feasible levels, and will also increase classification time of the network. Reducing the resolution of a full page of sheet music is not the best solution since it quickly becomes hard to read, so let’s split all the sheet music into images of individual rows. This also has the bonus of making a larger data set than if we had just full pages.

Can we do this automatically? Look at the page of music, what do you see that can be used? The first thing I noticed was that often, there is horizontal white-space between the rows of music. We can exploit that: if we take the range (maximum – minimum) of each row of pixels, we might be able to cut the page. Using:

~

from scipy import ndimage
import matplotlib.pyplot as plt

def open_image(path):
    image = ndimage.imread(path, mode='L') #open in grayscale mode 'L'
    return image
    
def line_contrast(page_image):
    line_contr =[]
    for line in page_image: #determine range per line
        line_contr.append(max(line) - min(line))
    return line_contr
    
if __name__ == '__main__':
    #open image and determine pixel ranges
    image = open_image('Beethoven68.jpg')
    line_contr = line_contrast(image)
    #plot the whole thing
    plt.plot(line_contr)
    plt.title('Row-wise pixel value range')
    plt.xlabel('Horizontal row #')
    plt.ylabel('Pixel value range on row')
    plt.show()

With this sample score page and the code above we get the following:

Well that seems to work quite well, the six bars are represented very clearly in the peaks of the resulting signal, and the white-space as dips. We can then use a simple rule-based function to detect blocks with a high variation and of a minimum vertical size. A quick and dirty way is:

~

def find_rows(line_contr):
    detected_rows = []
    row_start = 0
    row_end = 0
    detect_state = 0 #0 if previous line was not part of a row
    cur_row = 0
    for contrast in line_contr:
        if contrast < 50 and detect_state == 0:
            row_start = cur_row
        elif contrast >= 50 and detect_state == 0:
            row_start = cur_row
            detect_state = 1
        elif contrast < 50 and detect_state == 1: #if end of row, evaluate AOI height
            row_end = cur_row
            rowheight = row_start - row_end
            if abs(rowheight) >= 150:
                detected_rows.append((row_start, row_end))
            detect_state = 0
        elif contrast >= 50 and detect_state == 1:
            pass
        else:
            print("unknown situation, help!, detection state: " + str(detect_state))
        cur_row += 1
    return detected_rows

 

Adding this to the code, then running it like so:

~

from scipy import misc #import an extra module to save the image later

image = open_image('Beethoven68.jpg') #open the file
line_contr = line_contrast(image) #get the contrast ranges
detected_rows = find_rows(line_contr) #find the rows

for row in detected_rows: 
    #mark detected beginning and end of row of music by setting those rows of pixels to black
    image[row[0]] = 0
    image[row[1]] = 0

misc.imsave('output.jpg', image) #save the image

 

Leads to:

As you can see it’s marking the rows quite well, so the last thing we need to do is write a function to slice the rows, resize them, and save them to separate files. For training our deep learning network all images in the training set need to be of the same dimensions. We add that to the code. Let’s also add in a test to check whether we are indeed slicing one line of music or whether we detect multiple lines as one (mark this as an error for now), by limiting the maximum height we allow:

~

import os
from glob import glob

def checkpath(filepath):
    if not os.path.exists(filepath):
        os.makedirs(filepath)

def save_rows(sliced_rows, composer):
    path = 'Dataset/%s/' %composer
    checkpath(path)
    for row in sliced_rows:
        file_number = len(glob(path + '*'))
        misc.imsave(str(path) + '/' + str(file_number) + '.jpg', row)

def slice_rows(page_image, detected_rows, composer):
    sliced_rows = []
    max_height= 350
    max_width = 2000
    for x,y in detected_rows:
        im_sliced = np.copy(page_image[x:y])
        new_im = np.empty((max_height, max_width))
        new_im.fill(255)
        if im_sliced.shape[0] <= max_height:
            new_im[0:im_sliced.shape[0], 0:im_sliced.shape[1]] = im_sliced
            sliced_rows.append(new_im)
        elif max_height < im_sliced.shape[0] < 1.25 * max_height:
            im_sliced = im_sliced[0:max_height, 0:im_sliced.shape[1]]
            new_im[0:im_sliced.shape[0], 0:im_sliced.shape[1]] = im_sliced
            sliced_rows.append(new_im)
        else:
            print("Skipping block of height: %s px" %im_sliced.shape[0])
            checkpath('Dataset/%s/Errors/' %composer)
            file_number = len(glob('Dataset/%s/Errors/*' %composer))
            #save to error dir for manual inspection
            misc.imsave('Dataset/%s/Errors/%s_%s.jpg' %(composer, file_number, composer), im_sliced)
    return sliced_rows

With that functioning, I went ahead and downloaded most of the piano works of the selected composers. With a few more helper functions included, the code blocks so far ran on the entire dataset. Download the full Python file here. Alternatively if you’re just here for the Deep Learning fun, the processed dataset is here (medium images) and here (small images).


Going Deep
Time to get to the fun part and train the network. First we need to load the data into memory. To do this let’s write a little snippet that does so:

~

def load_dataset(datapath):
    folders = glob('%s/*' %datapath)
    composers = [x.split('\\')[-1] for x in folders] #populate list with composer names from folders
    X_train = []
    Y_train = []

    for folder in folders: #go over all data folders
        files = glob('%s\\*.jpg' %folder) #in each data folder, detect all image data
        print('working on composer: %s' %(folder.split('\\')[-1])) #let us know what's going on
        for f in files:
            im = ndimage.imread(f, mode='L') #open image in grayscale mode to save memory (mode 'L')
            im = im/255 #normalise image data
            im = im.reshape(im.shape[0], im.shape[1], 1) #reshape to have explicit 1-channel color data (Keras and TensorFlow expect this)
            X_train.append(im) #put training image into data array
            Y_train.append(composers.index(folder.split('\\')[-1])) #put correct label into label array

    return np.asarray(X_train), np.asarray(Y_train)

 

Now on to the deep net. Deep Nets can be tricky to design and fine-tune. In a later tutorial I might go more in-depth on designing your own, but for now instead of reinventing the wheel let’s take an existing network architecture called a ResNet50(1), implemented in Keras(2) and work with that. Download the file here and put it in the same folder as your program.

Now import it, compile the model, set the parameters, and run it:

~

import numpy as np
from glob import glob
from scipy import ndimage
from keras import callbacks
from keras.optimizers import Adam, Adamax, SGD, RMSprop

import ResNet50

def convert_to_one_hot(Y, C):
    #function to return a one-hot encoding
    Y = np.eye(C)[Y.reshape(-1)].T
    return Y

if __name__ == '__main__':
    print('setting model')
    model = ResNet50.ResNet50(input_shape = (70, 400, 1), classes = 7)

    print('compiling model...')
    epochs = 100
    learning_rate = 0.001
    lr_decay = 0.001/100
    optimizer_instance = SGD(lr=learning_rate, decay=lr_decay)
    model.compile(optimizer=optimizer_instance, loss='categorical_crossentropy', metrics=['acc'])

    print('loading dataset......')
    datapath = 'Dataset_Train_Medium/'
    datapath_val = 'Dataset_Dev_Medium/'
    X_train, Y_train = load_dataset(datapath)
    X_test, Y_test = load_dataset(datapath_val)

    print('Applying one-hot-encoding')
    Y_train = convert_to_one_hot(Y_train, 7).T
    Y_test = convert_to_one_hot(Y_test, 7).T

    print('setting up callbacks...')
    nancheck = callbacks.TerminateOnNaN()
    filepath = 'Models/weights-{epoch:02d}-accuracy-{acc:.2f}.hdf5'
    saver = callbacks.ModelCheckpoint(filepath, monitor='acc', verbose=1, save_best_only=False, mode='max', period=1)
    logger = callbacks.CSVLogger('Models/trainingresults.log')
    callbacklist = [nancheck, saver, logger]

    print('starting model fitting')
    model.fit(X_train, Y_train, validation_data = (X_test, Y_test), epochs=epochs, batch_size=72, callbacks=callbacklist)

 

A few remarks about the code. In line 16, we call the function from the Keras implementation of the ResNet50 and pass it the required parameters (image size and number of classes).

In lines 18-23 several things happen. We define the number of epochs (full passes through the dataset) used to fit the network. We also define the learning rate and its decay function (starting at 0.001, decaying to 0). This technique of starting with a higher learning rate and then decaying over time is important when training for long periods of time. For those familiar with gradient descent optimisation techniques: a higher learning rate in the beginning helps the model take bigger steps towards the optimum. Decaying the learning rate as the model comes closer to the optimum helps take smaller steps and prevents constantly ‘over-shooting’ it. If you’re not familiar with this, no worries. After defining this we choose an optimiser, in this case Stochastic Gradient Descent. I’ve also tested the model with Adam, Adamax and RMSprop (root mean square propagation). The Adam optimiser default implementation was numerically unstable for this problem (after many epochs the loss would spike, the accuracy would drop to chance level, and it would not recover). I did not bother to tune the hyperparameters to fix this problem, since SGD, Adamax, and RMSprop worked with little tuning. SGD converged quickest and reached the best validation accuracy, so this is implemented here. We then compile the model, and use categorical cross-entropy as loss function. Finally we tell the model that we are interested in the accuracy ‘acc’, so that it will display it along with the training loss.

In lines 25-33 we load the training and validation data sets, and set the label objects to have one-hot-encoding, which is a requirement when using categorical cross-entropy.

In lines 35-40 we set what are called ‘callbacks‘ in Keras. Callbacks are executed after each training epoch and can be used for anything you like, often this is to output model statistics or parameters that can help spot problems in training early on. Here I set up three callbacks: a “terminate on NaN” that will terminate training if loss goes to NaN (often indicative of an error), a “Model Checkpoint” callback that will save our model weights every epoch so that we can choose the best performing one (sometimes a model will overfit in later epochs, making an earlier version the more optimal choice), and a “CSV logger” callback that will save our training and validation losses and accuracies to a file for later revision.

Finally, on line 43 we fit the model and feed the dev to keep testing model performance. If you get memory errors and related crashes, change the minibatch size down. You may need to tweak the learning rate and/or the optimizer if the loss doesn’t start converging after several iterations.

I let it run for a few hours:

~

epoch,acc,loss,val_acc,val_loss
0,0.26596873394,8.12754559003,0.160439559424,3.34268599762
1,0.458725331228,2.49067383588,0.263736266143,2.07939970939
2,0.56277852403,1.618642787,0.268131864112,1.91241733635
3,0.620216454,1.39953651252,0.389010995299,1.6868816774
4,0.665346255892,1.1554503056,0.397802200029,1.76883114249

.......

95,0.999646318004,0.00257732888695,1.0,0.00568562985991
96,0.999929263601,0.00182615904544,1.0,0.00205823136976
97,0.999575581604,0.00276638044431,1.0,0.00220210704379
98,1.0,0.00186036216549,1.0,0.00256703818082
99,0.999858527201,0.00240185359843,1.0,0.00254284693251

In the end it reached 100% validation accuracy, wow!


Wrapping Up
My computer can now beat me in recognising composers from a few bars of music, cheeky little bastard. Actually the very high performance sent me on a quest to find where I fucked up because I had not expected it to work so well. This is why I split the data into a trainings, development and test set. This is considered good practise and now you see why: as a double check. Even on the test set the performance was 99.5%. This achievement deserves a name and a logo for the algorithm! I’ll dub him: Bytehoven.

Before wrapping up, I thought it would be cool to explore how Bytehoven perceives the world we present to him. As with all deep networks trained on visual data: earlier layers in the network should self-organise to recognise simple shapes or textures, with more complex constructs being respresented deeper in the network. I came across the keras-vis repository, a tool to visualise what’s happening in your network. Let’s poke about a bit in Bytehoven’s brains.

First let’s take a look at how Bytehoven perceives the world, by using Keras-Vis to visualise the activations of the last network layer. This gives us the deep network activations acriss its entire visual field. Using a random snippet from Beethoven and one from Chopin, we get:

 

 

 

I thought that was pretty cool! You can clearly see how the network encodes the images in terms of neuron activations throughout its visual field.

As a last detail, take a look at the filters throughout the network as visualised below. Each row represents three random filters from a particular network layer. Those at the top are early in the network, and going down on the image you see filters that occur deeper in the network, until at the end you have the activation for a piece from Beethoven. You can clearly see how the early filters detect simple shapes and patterns, and moving deeper in the network these simple shapes are built into more complicated patterns. Note that this is just a subset of filters, there’s about fifteen thousand in the resnet50.

I hope you liked this tutorial. Making content takes time and effort. If you like the work here, feel free to buy me a beer or support the blog.

Next steps for Bytehoven are increasing the composers the algorithm knows about, and getting it to recognise orchestral scores as well. Keep an eye on the Github!


References

  1. https://github.com/BVLC/caffe/wiki/Model-Zoo#resnets-deep-residual-networks-from-msra-at-imagenet-and-coco-2015
  2. Adapted from https://github.com/fchollet/keras/blob/master/keras/applications/resnet50.py

Leave a Reply