A (Not So) Little Thing Called Shutil
I'm currently wrapping up my capstone Data Science project for Flatiron and I've chosen to work on an image classification problem with 2 classes using Convolutional Neural Networks. As is well known in the machine learning world, CNNs require very large datasets; the more images the better! So I've spent the first 3 1/2 weeks of the final phase meticulously researching and gathering images from various sources.
I've downloaded, manually saved, and scrubbed numerous websites to finally amass a dataset containing over 30,000 images. I made sure to save each image to it's corresponding class folder when saving manually; other times I've downloaded pre-labeled datasets already containing hundreds of images to later on merge with my existing data. Then came the task of moving the contents of each subfolder to its correct directory, a tedious task that offers no room for error when working with a classification problem.
Enter Shutil!
Shutil is a small yet extremely powerful standard Python utility module that simplifies and automates the process of moving files and directories from one location to another. It really is a priceless tool; in what would have taken hours to sort through, Shutil accomplished in 5 minutes! I will walk through how I used this package in my project below.
One of the pre-labeled datasets I've downloaded contained 2 files: one a .csv dataframe of the metadata and the other a folder containing about 10,000 images. The dataframe contained a number of columns however only 2 were important for the task at hand: 'image_id' and 'diag' (for diagnosis). The 'diag' columns consisted of 7 unique values (each of the 7 values belonged to one of the 2 classes I was working with). Each image was assigned a unique image_id and one of the diagnosis values.
The goal was to access the image_id of each image, identify it's diagnosis, locate it in the images folder and finally move it to the correct class folder. Once the dataset was complete I'd then split for my train and test sets and save the sets as folders.
Before proceeding I imported the packages I needed for reading my data frame and moving the files:
Then I researched the meaning of each of the 7 unique values in the diagnosis column so I could determine where to correctly move the files. Since I was working on a skin cancer classification problem to determine if whether a lesion was benign or malignant, it was important to understand what the types of lesions were, what their abbreviations were (the values of the diagnosis column), and if whether or not it was cancerous or not. For example, one of the values was 'nv', meaning nevus. I researched these lesions and discovered that they are not cancerous; and so images with a 'nv' diagnosis would be moved to the benign folder.
I then created my folders using os.mkdir:
I defined a function that accessed each image by combining the image_id with the filetype (all of my images were converted to the .jpg format -- more on that below) for the full file name. Then my function found the diagnosis in my dataframe's column, and then using the copyfile() method from the shutil modul copied the image from the main folder to its correct class folder based on its diagnosis. Because my folders are directories, I used the os.path.join() method to access the folders.
So the syntax would be as follows:
copyfile(os.path.join(folder_to_copy_from, image), os.path.join(folder_to_copy_to, image)
Here is my for loop below:
And then I called my function to action:
Comments
Post a Comment