Moving Files From One Folder to Another Using Python

 




   I've spent almost four weeks meticulously researching and gathering images from various sources for my capstone project at Flatiron. Since I've chosen to work on an image classification problem with 2 classes using Convolutional Neural Networks, I had to be sure to gather as many images as possible: as is well known in the machine learning world these neural networks require very large datasets; the more images the better!

     SI've manually downloaded and scrubbed numerous websites to finally put together a dataset containing over 28,000 images. Doing so was no easy feat: I had to be sure to save each image to it's corresponding class folder when saving manually; other times I downloaded pre-labeled datasets that already contained hundreds or thousands of images to later on merge with my existing data. After downloading, then came the issue of moving the contents of each subfolder to its correct directory, a tedious task that offers no room whatsoever for error when working with a classification problem. As I was already strapped for time - neural networks can take hours, if not days (sometimes weeks!) to train, depending on the size of the data set, I was forced to research ways to efficiently handle all of the images. 

    Then I discovered Shutil!

    Shutil is a small, yet extremely powerful, standard Python utility module that simplifies and automates the process of moving files and directories from one location to another. I've come to realize it really is a priceless tool; in what would have taken me hours to sort through, Shutil accomplished in 5 minutes! And because I am so grateful for this tool, I must share what I've learned by walking through my use of this package in my project below.

    One of the pre-labeled datasets I've downloaded contained 2 files: a .csv dataframe of the metadata and a folder containing about 10,000 images. The dataframe contained a number of columns; however only 2 were important for the task at hand: 'image_id' and 'diag' (for diagnosis).  The 'diag' columns consisted of 7 unique values (each of the 7 values belonged to one of the 2 classes I was working with). Each image was assigned a unique image_id and one of the diagnosis values.

    Mgoal was to access the image_id of each image, identify it's diagnosis, locate it in the images folder, and  finally move it to the correct class folder. Once the dataset was complete I'd then split for my train and test sets and save the sets as folders. 

   Before proceeding I imported the packages I needed for reading my data frame and moving the files:




 And then I proceeded.

To begin I first displayed the diagnosis ('diag') column to understand it's values:



    Then I researched the meaning of each of the 7 unique values in the diagnosis column so I could determine where to correctly move the files. Since I was working on a skin cancer classification problem to determine if whether a lesion was benign or malignant, it was important to understand what the types of lesions were, what their abbreviations were (the values of the diagnosis column), and if whether or not it was cancerous or not. For example,  one of the values was 'nv', which meant nevus. I researched these lesions and discovered that they are not cancerous; and so images with a 'nv' diagnosis would be moved to the benign folder. 

    Next using list comprehensions I defined variables for the image names and diagnosis':




    I then created my folders using os.mkdir:



    And then defined variables for each of my folders:

    And now I will show you the first example of my use of shutil in all its beauty!

    I defined a function that accessed each image by combining the image_id with the filetype (all of my images were converted to the .jpg format -- more on that below) for the full file name. Then my function found the diagnosis in my dataframe's column, and then using the copyfile() method from the shutil modul copied the image from the main folder to its correct class folder based on its diagnosis. Because my folders are directories, I used the os.path.join() method to access the folders. 

    So the syntax would be as follows:

    

    copyfile(os.path.join(folder_to_copy_from, image), os.path.join(folder_to_copy_to, image)

    Here is my for loop below:


    And just like that, my files have been successfully moved to their correct folders! To be sure I checked my 2 directories:


Success!  So simple and incredibly convenient for moving batches of images!

    After sourcing more data and completing my dataset I was ready to split my directories into training and test sets. I was able to do so using the shutil module's copyfile() method once more by a function that first checked that the file wasn't empty and then defined the training and test set lengths. Each set was then shuffled before being moved to their respective folders. Here is my function below:



    And then I called my function to action:


    A quick check to make sure it performed as it should:


And there we have it! A quick, easy and fail-proof way of moving images, labeled or not, in batches from one folder to the next!


A quick note on the image format: When copying files in one batch I found the simplest method to be to convert all of files to one format first before looping over to avoid potentially being thrown an error. Thankfully I found this snippet of code on stackoverflow.com that converts from one format to another:


Since most of my files were in the .jpg format with only a couple hundred having the .jpeg type, I decided to convert them all to the majority type. To customize to your data, set the inputPath to the folder where your original files are located and then simply replace the format from ".jpeg" on the inputFiles variable to the format of your files, and the ".jpg" on the outputFile variable to your desired format. I'd recommend setting your output path to a separate folder than the original one to avoid duplicates, and then copying back after deleting the .jpegs. 

Thank you very much and I hope you enjoyed this post!

To see the repository for my Skin Cancer Classification project, click here.

References: 

Comments

Popular Posts