Deep Learning Music with Python

Music is a primary expression of our humanity. But like anything in the natural and man-made world, it’s made up from patterns that obey a set of laws. In this tutorial we will see how to set up a Deep Learning network in Python. Using a case study of recognising a composer’s ‘musical signature’ from a few bars of music, you will learn some tips and tricks for designing your own effective machine learning data sets. Let’s get started!

Important: The code in this tutorial is licensed under the GNU 3.0 open source license and you are free to modify and redistribute the code, given that you give others you share the code with the same right, and cite my name (use citation format below). You are not free to redistribute or modify the tutorial itself in any way. By reading on you agree to these terms. If you disagree, please navigate away from this page.
Troubleshooting: I assume intermediate knowledge of Python for this tutorials. If you don’t have this, please try a few more basic tutorials first or follow an entry-level course on Coursera or similar. This also means you know how to interpret errors. Don’t immediately panic and flood my inbox please :)! Read the thing, google for a solution, only then ask for help. Part of learning to program is learning to debug on your own as well. If you really can’t figure it out, feel free to let me know.
Citation format
van Gent, P. (2017). Deep Learning Music. A tech blog about fun things with Python and embedded electronics. Retrieved from:

Find the GitHub here
Like the work here? Making content takes time and effort. If you like the work here, feel free to buy me a beer or support the blog.

I’ve been playing the piano for about 15 years at the time of writing this, and have been composing as a hobby for a few of those years. One of the things that always struck me with (piano) sheet music is that I can open a book, see a piece I’ve never seen before, and make a quite accurate educated guess as to who composed it. This is something my brain has learned to do by looking at many different examples of sheet music over the years. Each composer has a unique signature based not only on personal style, but also the time period in which he/she lived. I would classify my performance on this recognition task as ‘slightly above chance level’. It’s not anywhere near perfect. I wondered if a Deep Network could do better.
In this tutorial I will go over the process of building a deep neural network and evaluate its ability to recognise a composer’s ‘musical signature’ based on a few measures of music. The focus lies more on developing an effective data set than on building the deep network, since this is the step tutorials often gloss over by using a pre-fab set. While easier and less time consuming, this also means you’re stuck repeating other people’s work. You might have a cool idea but don’t know how to collect and structure the data. For each set the process is of course different, but to illustrate how this works we will go over this process and create an example dataset. Along the way this might give you ideas on how to create your own data set, and how to extract the right information from the source data. We will use a famous deep neural architecture (ResNet50) in Python using Google’s TensorFlow and the Keras framework.
What you will need:

  • Python 3 (TensorFlow does not work on 2.7);
  • Google’s TensorFlow;
  • The Keras framework;
  • A good GPU with lots of vram (early work here was done on a GTX760 with 2GB vram but that turned out to be too tight, I switched to a GTX1070 with 8GB vram).
    • Note that a GPU is completely optional, but it will speed up training times dramatically. It can make the difference between training for a night, or training for several weeks.

Installing Packages
The first thing to do is make sure your environment is set up. Install Python, I use the Anaconda Distribution of Python since it comes with a lot of useful packages preinstalled. Some people I work(ed) with also swear by the ‘Jupyter Notebook‘ you can use with this distribution. While I’m more of a Visual Studio kind-of-guy, I can definitely appreciate what the Jupyter Notebook does.
TensorFlow is a Deep Learning framework developed by Google. You need to install a few dependencies along with TensorFlow, and the Nvidia CUDA toolkit if you want to use your GPU for training. Installation instructions can be found here.
Because TensorFlow can have a bit of a steep learning curve at first, we use Keras on top of TensorFlow to make it more accessible. I like how it helps develop and iterate model variants really fast. Installation instructions here.

The Imperfect Finish?
Before delving into the data set and the model, I want to get something out of the way. “Help! I did so and so, and my model only reached 75% accuracy” or “Why am I not getting 100% accuracy?” are types of questions I get very often. Should you always strive for the maximum obtainable accuracy? Yes. Should you always keep going until you reach it? Absolutely not. Is the maximum perfomance always 100%? No sir. So what, then, is enough? Where is the finish line?
Let’s talk about Bayes Error (also named the irreducible error). It’s an important concept for solving any Deep Learning project, because it can help you determine where the finish line for your project is. Think of the Bayes Error as being the absolute limit of performance given a set of data and a task: if there’s not enough variance in the data to perform a task perfectly, nothing we can ever invent or build reliably will. So, we need to determine the value of Bayes Error for our project.
Except we can’t.
At least, we can’t find it theoretically. Determining the absolute Bayes Error is quite impossible until we crash into it. Let’s do a little thought experiment, let’s assume we somehow manage to design a perfect system. We train the system on a dataset, and find that it runs into a non-maximum performance limit, meaning it doesn’t predict or recognize everything correctly. Since in our thought experiment the system is perfect, by definition nothing can do better, ever. There we have it, the value for Bayes Error for this data set is equal to the performance of our perfect system. Of course once that happens, we would still need to prove we really found the value of Bayes Error. Think about it, how do we prove the system in our thought experiment is perfect if it doesn’t score 100% correct?
It turns out designing a perfect system might not be the best way to go. Surprise! In Deep Learning, Bayes Error is often taken as at least the performance of a trained human expert or group of trained experts on a task. It’s about wondering “what is the best known performance for task X that we know of?”, and assuming Bayes Error is at least that. Often the best known performance on a task comes from humans (so far..), so often this is taken as a gold standard for a deep learning project. This helps set goals and boundaries and can help you decide when to call it quits and be satisfied with the results.
The human performance superiority is still especially clear on tasks involving visual data. Biological vision systems have evolved to be really really good since they are a primary survival tool. Let’s illustrate with an example. Look at this image:


Did you see it? Good, you get to live another day. Your brain probably spotted the tiger so fast you didn’t even have time to think about it. If the tiger would have moved, you would likely have spotted it even faster. Computer Vision has traditionally struggled (and still does) with these types of ‘overlapping objects with lots of different patterns and occlusion’ problems. Now look at the following image:


Here even my vision system is struggling to spot the cats. Armed with the knowledge that there’s two mountain lions hiding in the bush near the center, I still cannot see them. So what is Bayes Error if the dataset had a lot of these pictures? Humans don’t do well here, but could a very advanced deep net do better and detect them? Would it make sense to spend a month or more building a big data set of similar images, and then another month trying different architectures and fine-tuning the models? I would say don’t bother.
Of course this example is a bit hyperbolic, but it does show the main point of this section: getting 100% accuracy is rare. Often it’s not the goal of professional Deep Learning teams at all, so don’t make it your goal. A better question is: “when would I consider the problem solved sufficiently?”, or “where is the finish line?”. In other words, what level of performance is enough? This is especially important with larger convolutional or recurrent neural nets, where building a model and training it through many epochs can take days, weeks, months, or more even with powerful hardware. Thinking about this before diving in will help you decide when to stop optimizing and call it quits. In practice this could save you lots of time and prevent you getting demotivated.
Enfin, you’re here to learn about Deep Learning! So, let’s get to the meat of this tutorial.

Generating the Data Set – Step 1
The first thing we need to do is create our dataset and decide where the data will come from. I will be creating a set from piano sheet music of seven famous composers: Bach, Beethoven, Brahms, Chopin, Grieg, Liszt, and Mozart. I have already extracted the pages from the PDF files as .jpg images. You can download the set here (heads up: 1.5GB) and follow along if you like (made possible by the Petrucci Music Library), you can make your own music dataset (there is a lot of PDF sheet music available online), or create your own set with any other image data type.
The second decision is what our individual sample size is going to be. It’s better to not train the network on very large images, since this will increase our video-ram (vram) requirements far beyond feasible levels, and will also increase classification time of the network. Reducing the resolution of a full page of sheet music is not the best solution since the information in music notation is already quite dense. This means if we shrink it too much we throw away a lot of information: if pixels that define whether a note lies on or between the lines start to blend we’re in trouble. Let’s be conservative and split all the sheet music into images of individual rows. This also has the bonus of making a larger data set than if we had just full pages since we get five samples from one page of music.
Can we do this automatically? Here you will start to think like a data scientist: doing manual work will result in high quality data, but can we automate and do things much faster? It doesn’t matter if it’s not perfect: real world data will likely be imperfect too (yes, perfect datasets are better, but manual work required to create them is often impractical or impossible). Look at the page of music, what do you see that can be used? The first thing I noticed was that often, there is horizontal white-space between the rows of music. We can exploit that: if we take the range (maximum – minimum) of each row of pixels, we might be able to cut the page. Using:

from scipy import ndimage
import matplotlib.pyplot as plt

def open_image(path):
    image = ndimage.imread(path, mode='L') #open in grayscale mode 'L'
    return image

def line_contrast(page_image):
    line_contr =[]
    for line in page_image: #determine range per line
        line_contr.append(max(line) - min(line))
    return line_contr
if __name__ == '__main__':
    #open image and determine pixel ranges
    image = open_image('Beethoven68.jpg')
    line_contr = line_contrast(image)
    #plot the whole thing
    plt.title('Row-wise pixel value range')
    plt.xlabel('Horizontal row #')
    plt.ylabel('Pixel value range on row')

With this sample score page and the code above we get the following:

Well that seems to work quite well, the six bars are represented very clearly in the peaks of the resulting signal, and the white-space as dips. We can then use a simple rule-based function to detect blocks with a high variation and of a minimum vertical size. A quick and dirty way is:
def find_rows(line_contr):
    detected_rows = []
    row_start = 0
    row_end = 0
    detect_state = 0 #0 if previous line was not part of a row
    cur_row = 0
    for contrast in line_contr:
        if contrast < 50 and detect_state == 0:
            row_start = cur_row
        elif contrast >= 50 and detect_state == 0:
            row_start = cur_row
            detect_state = 1
        elif contrast < 50 and detect_state == 1: #if end of row, evaluate AOI height
            row_end = cur_row
            rowheight = row_start - row_end
            if abs(rowheight) >= 150:
                detected_rows.append((row_start, row_end))
            detect_state = 0
        elif contrast >= 50 and detect_state == 1:
            print("unknown situation, help!, detection state: " + str(detect_state))
        cur_row += 1
    return detected_rows

Adding this to the code, then running it like so:

from scipy import misc #import an extra module to save the image later

image = open_image('Beethoven68.jpg') #open the file
line_contr = line_contrast(image) #get the contrast ranges
detected_rows = find_rows(line_contr) #find the rows

for row in detected_rows:
    #mark detected beginning and end of row of music by setting those rows of pixels to black
    image[row[0]] = 0
    image[row[1]] = 0

misc.imsave('output.jpg', image) #save the image

Leads to:

As you can see it’s marking the rows quite well, so the last thing we need to do is write a function to slice the rows, resize them, and save them to separate files. For training our deep learning network all images in the training set need to be of the same dimensions (how to handle variable input will be dealt with in a later more in-depth post). We add that to the code. Let’s also add in a test to check whether we are indeed slicing one line of music or whether we detect multiple lines as one (mark this as an error for now), by limiting the maximum height we allow:

import os
from glob import glob

def checkpath(filepath):
    if not os.path.exists(filepath):

def save_rows(sliced_rows, composer):
    path = 'Dataset/%s/' %composer
    for row in sliced_rows:
        file_number = len(glob(path + '*'))
        misc.imsave(str(path) + '/' + str(file_number) + '.jpg', row)

def slice_rows(page_image, detected_rows, composer):
    sliced_rows = []
    max_height= 350
    max_width = 2000
    for x,y in detected_rows:
        im_sliced = np.copy(page_image[x:y])
        new_im = np.empty((max_height, max_width))
        if im_sliced.shape[0] <= max_height:
            new_im[0:im_sliced.shape[0], 0:im_sliced.shape[1]] = im_sliced
        elif max_height < im_sliced.shape[0] < 1.25 * max_height:
            im_sliced = im_sliced[0:max_height, 0:im_sliced.shape[1]]
            new_im[0:im_sliced.shape[0], 0:im_sliced.shape[1]] = im_sliced
            print("Skipping block of height: %s px" %im_sliced.shape[0])
            checkpath('Dataset/%s/Errors/' %composer)
            file_number = len(glob('Dataset/%s/Errors/*' %composer))
            #save to error dir for manual inspection
            misc.imsave('Dataset/%s/Errors/%s_%s.jpg' %(composer, file_number, composer), im_sliced)
    return sliced_rows

With that functioning, I went ahead and downloaded most of the piano works of the selected composers. With a few more helper functions included, the code blocks so far ran on the entire dataset. Download the full Python file here. Alternatively if you’re just here for the Deep Learning fun, the processed dataset is here (medium images, >=6GB VRAM recommended) and here (small images >=1.5GB VRAM recommended).

Going Deep
Time to get to the fun part and train the network. First we need to load the data into memory. To do this let’s write a little snippet that does so:

def load_dataset(datapath):
    folders = glob('%s/*' %datapath)
    composers = [x.split('\\')[-1] for x in folders] #populate list with composer names from folders
    X_train = []
    Y_train = []
    for folder in folders: #go over all data folders
        files = glob('%s\\*.jpg' %folder) #in each data folder, detect all image data
        print('working on composer: %s' %(folder.split('\\')[-1])) #let us know what's going on
        for f in files:
            im = ndimage.imread(f, mode='L') #open image in grayscale mode to save memory (mode 'L')
            im = im/255 #normalise image data
            im = im.reshape(im.shape[0], im.shape[1], 1) #reshape to have explicit 1-channel color data (Keras and TensorFlow expect this)
            X_train.append(im) #put training image into data array
            Y_train.append(composers.index(folder.split('\\')[-1])) #put correct label into label array
    return np.asarray(X_train), np.asarray(Y_train)

Now on to the deep net. Deep Nets can be tricky to design and fine-tune. In a later tutorial I might go more in-depth on designing your own, but for now instead of reinventing the wheel let’s take an existing network architecture called a ResNet50(1), implemented in Keras(2) and work with that. Download the file here and put it in the same folder as your program.
Now import it, compile the model, set the parameters, and run it:

import numpy as np
from glob import glob
from scipy import ndimage
from keras import callbacks
from keras.optimizers import Adam, Adamax, SGD, RMSprop
import ResNet50

def convert_to_one_hot(Y, C):
    #function to return a one-hot encoding
    Y = np.eye(C)[Y.reshape(-1)].T
    return Y
if __name__ == '__main__':
    print('setting model')
    model = ResNet50.ResNet50(input_shape = (70, 400, 1), classes = 7)
    print('compiling model...')
    epochs = 100
    learning_rate = 0.001
    lr_decay = 0.001/100
    optimizer_instance = SGD(lr=learning_rate, decay=lr_decay)
    model.compile(optimizer=optimizer_instance, loss='categorical_crossentropy', metrics=['acc'])
    print('loading dataset......')
    datapath = 'Dataset_Train_Medium/'
    datapath_val = 'Dataset_Dev_Medium/'
    X_train, Y_train = load_dataset(datapath)
    X_test, Y_test = load_dataset(datapath_val)
    print('Applying one-hot-encoding')
    Y_train = convert_to_one_hot(Y_train, 7).T
    Y_test = convert_to_one_hot(Y_test, 7).T
    print('setting up callbacks...')
    nancheck = callbacks.TerminateOnNaN()
    filepath = 'Models/weights-{epoch:02d}-accuracy-{acc:.2f}.hdf5'
    saver = callbacks.ModelCheckpoint(filepath, monitor='acc', verbose=1, save_best_only=False, mode='max', period=1)
    logger = callbacks.CSVLogger('Models/trainingresults.log')
    callbacklist = [nancheck, saver, logger]
    print('starting model fitting'), Y_train, validation_data = (X_test, Y_test), epochs=epochs, batch_size=72, callbacks=callbacklist)

A few remarks about the code. In line 16, we call the function from the Keras implementation of the ResNet50 and pass it the required parameters (image size and number of classes).
In lines 18-23 several things happen. We define the number of epochs (full passes through the dataset) used to fit the network. We also define the learning rate and its decay function (starting at 0.001, decaying to 0). This technique of starting with a higher learning rate and then decaying over time is important when training for long periods of time. For those familiar with gradient descent optimisation techniques: a higher learning rate in the beginning helps the model take bigger steps towards the optimum. Decaying the learning rate as the model comes closer to the optimum helps take smaller steps and prevents constantly ‘over-shooting’ it. If you’re not familiar with this, no worries. After defining this we choose an optimiser, in this case Stochastic Gradient Descent. I’ve also tested the model with Adam, Adamax and RMSprop (root mean square propagation). The Adam optimiser default implementation was numerically unstable for this problem (after many epochs the loss would spike, the accuracy would drop to chance level, and it would not recover). I did not bother to tune the hyperparameters to fix this problem, since SGD, Adamax, and RMSprop worked with little tuning. SGD converged quickest (surprisingly!) and reached the best validation accuracy, so this is implemented here. We then compile the model, and use categorical cross-entropy as loss function. Finally we tell the model that we are interested in the accuracy ‘acc’, so that it will display it along with the training loss.
In lines 25-33 we load the training and validation data sets, and set the label objects to have one-hot-encoding, which is a requirement when using categorical cross-entropy.
In lines 35-40 we set what are called ‘callbacks‘ in Keras. Callbacks are executed after each training epoch and can be used to call any function you like, often this is to output model statistics or parameters that can help spot problems in training early on. Here I set up three callbacks: a “terminate on NaN” that will terminate training if loss goes to NaN (often indicative of numerical overflow due to exploding gradients, for those familiar with the blood&guts of neural nets), a “Model Checkpoint” callback that will save our model weights every epoch so that we can choose the best performing one (sometimes a model will overfit in later epochs, making an earlier version the more optimal choice), and a “CSV logger” callback that will save our training and validation losses and accuracies to a file for later revision.
Finally, on line 43 we fit the model and feed the dev set to keep testing model performance. If you get memory errors and related crashes, change the minibatch size down. You may need to tweak the learning rate and/or the optimizer if the loss doesn’t start converging after several iterations.
I let it run for a few hours:


In the end it reached 100% validation accuracy, wow!

Wrapping Up
My computer can now beat me easily in recognising composers from a few bars of music, the cheeky little thing! Actually the very high performance sent me on a quest to find where I fucked up, because remember: 100% is rare. I split the data into a training, development and test set. This is considered good practise and I hope you see why. A set used to evaluate performance in the fitting phase, even if independent of the fitting, can still influence which model you select. Having another, completely ‘foreign’ set functions as a double check opportunity!. Even on the test set the performance was 99.5%. I think this achievement deserves a name and a logo for the algorithm! I’ll dub him: Bytehoven. Other contenders were 8Bithoven, M0110zart and TchAIkovsky.

It is noteworthy (heh..) that in the training set, full compositions were put in either the data set, development set or the test set. I did this because usually elements of the music repeat, as certain themes or rhythmic patterns recur throughout a given piece. Can you see why I split it this way? If I were to randomly sample xx% into each set, this might unfairly bias the classifier to higher accuracy, because it would recognise similarities within a piece rather than between pieces (which is the composer’s style we’re interested in).

Bytehoven’s Brains
Before wrapping up, I thought it would be cool to explore how Bytehoven perceives the world we present to him. As with all deep networks trained on visual data: earlier layers in the network should self-organise to recognise simple shapes or textures, with more complex constructs being respresented deeper in the network. I came across the keras-vis repository, a tool to visualise what’s happening in your network. Let’s poke about a bit in Bytehoven’s brains.
First let’s take a look at how Bytehoven perceives the world, by using Keras-Vis to visualise the activations of the last network layer. This gives us the deep network activations across its entire visual field. Using a random snippet from Beethoven and one from Chopin, we get:

I thought that was pretty cool! You can clearly see how the network encodes the notation in terms of neuron activations throughout its visual field.
As a last detail, take a look at the filters throughout the network as visualised below. Each row represents three random filters from a particular network layer. Those at the top are early in the network, and going down on the image you see filters that occur deeper in the network, until at the end you have the activation for a piece from Beethoven. You can clearly see how the early filters detect simple shapes and patterns, and moving deeper in the network these simple shapes are built into the more complicated patterns the network uses to decide what the composer is. Not surprisingly, circular textures (notes are rounded) and repeating rhythmic textures (rhythmn is a defining feature of music) are encoded clearly in the filters! Note that this is just a subset of filters, there’s about fifteen thousand individual filters in the resnet50 that work together to form a prediction.

I hope you liked this tutorial. Making content takes time and effort. If you like the work here, feel free to buy me a beer or support the blog.
Next steps for Bytehoven are increasing the composers the algorithm knows about, and getting it to recognise orchestral scores as well. Keep an eye on the Github!


  2. Adapted from

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.