A (Not So) Little Thing Called Shutil



     I'm currently wrapping up my capstone Data Science project for Flatiron and I've chosen to work on an image classification problem with 2 classes using Convolutional Neural Networks. As is well known in the machine learning world, CNNs require very large datasets; the more images the better! So I've spent the first 3 1/2 weeks of the final phase meticulously researching and gathering images from various sources.

     I've downloaded, manually saved, and scrubbed numerous websites to finally amass a dataset containing over 30,000 images. I made sure to save each image to it's corresponding class folder when saving manually; other times I've downloaded pre-labeled datasets already containing hundreds of images to later on merge with my existing data. Then came the task of moving the contents of each subfolder to its correct directory, a tedious task that offers no room for error when working with a classification problem.

    Enter Shutil!

    Shutil is a small yet extremely powerful standard Python utility module that simplifies and automates the process of moving files and directories from one location to another. It really is a priceless tool; in what would have taken hours to sort through, Shutil accomplished in 5 minutes!  I will walk through how I used this package in my project below.

    One of the pre-labeled datasets I've downloaded contained 2 files: one a .csv dataframe of the metadata and the other a folder containing about 10,000 images. The dataframe contained a number of columns however only 2 were important for the task at hand: 'image_id' and 'diag' (for diagnosis).  The 'diag' columns consisted of 7 unique values (each of the 7 values belonged to one of the 2 classes I was working with). Each image was assigned a unique image_id and one of the diagnosis values.

    The goal was to access the image_id of each image, identify it's diagnosis, locate it in the images folder and  finally move it to the correct class folder. Once the dataset was complete I'd then split for my train and test sets and save the sets as folders. 

    Before proceeding I imported the packages I needed for reading my data frame and moving the files:




    To begin I first displayed the 'diag' column understand it's values:



    Then I researched the meaning of each of the 7 unique values in the diagnosis column so I could determine where to correctly move the files. Since I was working on a skin cancer classification problem to determine if whether a lesion was benign or malignant, it was important to understand what the types of lesions were, what their abbreviations were (the values of the diagnosis column), and if whether or not it was cancerous or not. For example,  one of the values was 'nv', meaning nevus. I researched these lesions and discovered that they are not cancerous; and so images with a 'nv' diagnosis would be moved to the benign folder. 

    Next using list comprehensions I saved the image names and diagnosis' in variables:




    I then created my folders using os.mkdir:



    And then defined variables for each of my folders:

    And now on to the first example of my use of shutil in all its beauty!

    I defined a function that accessed each image by combining the image_id with the filetype (all of my images were converted to the .jpg format -- more on that below) for the full file name. Then my function found the diagnosis in my dataframe's column, and then using the copyfile() method from the shutil modul copied the image from the main folder to its correct class folder based on its diagnosis. Because my folders are directories, I used the os.path.join() method to access the folders. 

    So the syntax would be as follows:

        copyfile(os.path.join(folder_to_copy_from, image), os.path.join(folder_to_copy_to, image)

    Here is my for loop below:


    And voilà! My files have been successfully moved to their correct folders. So simple and convenient for moving batches of images! To be sure I checked my 2 directories:


Success!

    After sourcing more data and completing my dataset I was ready to split my directories into training and test sets. I was able to do so using the shutil module's copyfile() method once more by a function that first checked that the file wasn't empty and then defined the training and test set lengths. Each set was then shuffled before being moved to their respective folders:



    And then I called my function to action:


    And checked for success:


And there we have it! A quick, easy and fail-proof way of moving images, labeled or not, in batches from one folder to the next!

A quick note on the image format: When copying files in one batch I found the simplest method to be to convert all of files to one format first before looping over to avoid potentially being thrown an error. Thankfully I found this snippet of code on stackoverflow.com that converts from one format to another:


Since most of my files were in the .jpg format with only a couple hundred having the .jpeg type, I decided to convert them all to the majority type. To customize to your data, set the inputPath to the folder where your original files are located and then simply replace the format from ".jpeg" on the inputFiles variable to the format of your files, and the ".jpg" on the outputFile variable to your desired format. I'd recommend setting your output path to a separate folder than the original one to avoid duplicates, and then copying back after deleting the .jpegs. 

Thank you very much and I hope you enjoyed this post!

References: 

Comments

Popular Posts