Moving Files From One Folder to Another Using Python
I've spent almost four weeks meticulously researching and gathering images from various sources for my capstone project at Flatiron. Since I've chosen to work on an image classification problem with 2 classes using Convolutional Neural Networks, I had to be sure to gather as many images as possible: as is well known in the machine learning world these neural networks require very large datasets; the more images the better!
So I've manually downloaded and scrubbed numerous websites to finally put together a dataset containing over 28,000 images. Doing so was no easy feat: I had to be sure to save each image to it's corresponding class folder when saving manually; other times I downloaded pre-labeled datasets that already contained hundreds or thousands of images to later on merge with my existing data. After downloading, then came the issue of moving the contents of each subfolder to its correct directory, a tedious task that offers no room whatsoever for error when working with a classification problem. As I was already strapped for time - neural networks can take hours, if not days (sometimes weeks!) to train, depending on the size of the data set, I was forced to research ways to efficiently handle all of the images.
Then I discovered Shutil!
Shutil is a small, yet extremely powerful, standard Python utility module that simplifies and automates the process of moving files and directories from one location to another. I've come to realize it really is a priceless tool; in what would have taken me hours to sort through, Shutil accomplished in 5 minutes! And because I am so grateful for this tool, I must share what I've learned by walking through my use of this package in my project below.
One of the pre-labeled datasets I've downloaded contained 2 files: a .csv dataframe of the metadata and a folder containing about 10,000 images. The dataframe contained a number of columns; however only 2 were important for the task at hand: 'image_id' and 'diag' (for diagnosis). The 'diag' columns consisted of 7 unique values (each of the 7 values belonged to one of the 2 classes I was working with). Each image was assigned a unique image_id and one of the diagnosis values.
My goal was to access the image_id of each image, identify it's diagnosis, locate it in the images folder, and finally move it to the correct class folder. Once the dataset was complete I'd then split for my train and test sets and save the sets as folders.
Before proceeding I imported the packages I needed for reading my data frame and moving the files:
Then I researched the meaning of each of the 7 unique values in the diagnosis column so I could determine where to correctly move the files. Since I was working on a skin cancer classification problem to determine if whether a lesion was benign or malignant, it was important to understand what the types of lesions were, what their abbreviations were (the values of the diagnosis column), and if whether or not it was cancerous or not. For example, one of the values was 'nv', which meant nevus. I researched these lesions and discovered that they are not cancerous; and so images with a 'nv' diagnosis would be moved to the benign folder.
I then created my folders using os.mkdir:
I defined a function that accessed each image by combining the image_id with the filetype (all of my images were converted to the .jpg format -- more on that below) for the full file name. Then my function found the diagnosis in my dataframe's column, and then using the copyfile() method from the shutil modul copied the image from the main folder to its correct class folder based on its diagnosis. Because my folders are directories, I used the os.path.join() method to access the folders.
So the syntax would be as follows:
copyfile(os.path.join(folder_to_copy_from, image), os.path.join(folder_to_copy_to, image)
Here is my for loop below:
And then I called my function to action:











Comments
Post a Comment