background-shape
feature-image

Table of Contents


✅ Motivation

As a data scientist, you might be tempted to jump into modeling as soon as you can. I mean, that’s the fun part, right?

But trust me, if you skip straight to modeling without taking the time to really understand the problem and analyze the data, you’re setting yourself up for failure.

I’ve been there.

You might feel like a superstar, but you’ll have with a model that doesn’t work 🤦‍♂️.

But how do we even begin inspecting large datasets of images effectively and efficiently? And can we really do it on a local computer quickly, for free?

Sounds too good to be true eh?

It’s not, with 👇

⚡ fastdup

fastdup is a tool that let us gain insights from a large image/video collection. You can manage, clean, and curate your images at scale on your local machine with a single CPU. It’s incredibly easy to use and highly efficient.

At first, I was skeptical. How could a single tool handle my data cleaning and curation needs on a single CPU machine, especially if the dataset is huge? But I was curious, so I decided to give it a try.

And I have to say, I was pleasantly surprised.

fastdup lets me clean my visual data with ease, freeing up valuable resources and time.

Here are some superpowers you get with fastdup. It lets you identify:

fastdup superpowers. Source: fastdup GitHub.

fastdup superpowers. Source: fastdup GitHub.

In short, fastdup is 👇

  • Unsupervised: fits any visual dataset.
  • Scalable: handles 400M images on a single machine.
  • Efficient: works on CPU (even on Google Colab with only 2 CPU cores!).
  • Low Cost: can process 12M images on a $1 cloud machine budget.

The best part? fastdup is free.

It’s easy to get started and use. The authors of fastdup even used it to uncover over 1.2M duplicates and 104K data train/validation leaks in the ImageNet-21K dataset here.

tip

⚡ By the end of this post, you will learn how to:

  • Install fastdup and run it on your local machine.
  • Find duplicate and anomalies in your dataset.
  • Identify wrong/confusing labels in your dataset.
  • Uncover data leak in your dataset.

📝 NOTE: All codes used in the post are on my Github repo. Alternatively, you can run this example in Colab.

If that looks interesting, let’s dive in.

📖 Installation

To start, run:

pip install fastdup

Feel free to use the latest version available. I’m running fastdup==0.189 for this post.

🖼 Dataset

I will be using an openly available image classification dataset from Intel. The dataset contains 25,000 images (150 x 150 pixels) of natural scenes from around the world in 6 categories:

  1. buildings
  2. forest
  3. glacier
  4. mountain
  5. sea
  6. tree
Samples from dataset.

Samples from dataset.

tip

I encourage you to pick a dataset of your choice in running this example. You can find some inspiration here.

🏋️‍♀️ fastdup in Action: Discovering Data Issues

Next, download the data locally and organize them in a folder structure. Here’s the structure I have on my computer.

├── scene_classification
   ├── data
   │   ├── train_set
   │   |   ├── buildings
   │   |   |   ├── image1.jpg
   │   |   |   ├── image2.jpg
   │   |   |   ├── ...
   │   |   ├── mountain
   │   |   |   ├── image10.jpg
   │   |   |   ├── image11.jpg
   │   |   |   ├── ...
   |   ├── valid_set
   |   |   ├── buildings
   |   |   |   ├── image100.jpg
   │   |   |   ├── ...
   |   └── test_set
   └── report
        ├── train
        ├── train_valid
        └── valid

note

Description of folders:

  • data/ – Folder to store all datasets.
  • report/ – Directory to save the output generated by fastdup.

📝 NOTE: For simplicity, I’ve also included the datasets in my Github repo.

To start checking through the images, create a Jupyter notebook and run:

import fastdup
fastdup.run(input_dir='scene_classification/data/train_set/', 
            work_dir="scene_classification/report/train/")

note

Parameters for the run method:

  • input_dir – Path to the folder containing images. In this post, we are checking the training dataset.
  • work_dirOptional. Path to save the outputs from the run. If not specified, the output will be saved to the current directory.

📝 NOTE: More info on other parameters here.

fastdup will run through all images in the folder to check for issues. How long it takes depends on how powerful is your CPU. On my machine, with an Intel Core™ i9-11900 it takes under 1 minute to check through (approx. 25,000) images in the folder 🤯.

Once complete, you’ll find a bunch of output files in the work_dir folder. We can now visualize them accordingly.

The upcoming sections show how you can visualize duplicates, anomalies, confusing labels and data leakage. Read on.

🧑‍🤝‍🧑 Duplicates

First, let’s see if there are duplicates in the train_set. Let’s load the file and visualize them with:

from IPython.display import HTML
fastdup.create_duplicates_gallery(similarity_file='scene_classification/report/train/similarity.csv'
                                  save_path='scene_classification/report/train/', 
                                  num_images=5)

HTML('scene_classification/report/train/similarity.html')

note

Parameters for create_duplicates_gallery method:

  • similarity_file – A .csv file with the computer similarity generated by the run method.
  • save_path – Path to save the visualization. Defaults './'.
  • num_images – The max number of images to display. Defaults to 50. For brevity, I’ve set it to 5.

📝 NOTE: More info on other parameters here.

You’d see something like the following 👇

Fastdup Tool - Similarity Report


ImageDistanceFromTo
01.0scene_classification/data/train_set/buildings/9769.jpgscene_classification/data/train_set/street/7293.jpg
11.0scene_classification/data/train_set/sea/19255.jpgscene_classification/data/train_set/sea/7872.jpg
21.0scene_classification/data/train_set/street/3492.jpgscene_classification/data/train_set/street/373.jpg
31.0scene_classification/data/train_set/glacier/16044.jpgscene_classification/data/train_set/mountain/17775.jpg
41.0scene_classification/data/train_set/glacier/16827.jpgscene_classification/data/train_set/mountain/15770.jpg

info

We can already spot a few issues in the train_set:

  • On row 1, note that 19255.jpg and 7872.jpg are duplicates of the same class. We know this by the Distance value of 1.0. You can also see that they are exactly the same side-by-side. The same with row 2.

  • On row 0, images 9769.jpg and 7293.jpg are exact copies but they exist in both the buildings and street folders! The same can be seen on row 3 and row 4. These are duplicate images but labeled as different classes and will end up confusing your model!

For brevity, I’ve only shown 5 rows, if you run the code increasing num_images, you’d find more!

Duplicate images do not provide value to your model, they take up hard drive space and increase your training time. Eliminating these images improves your model performance, and reduces cloud billing costs for training and storage.

Plus, you save valuable time (and sleepless nights 🤷‍♂️) to train and troubleshoot your models down the pipeline.

You can choose to remove the images by hand (e.g. going through them one by one and hitting the delete key on your keyboard.) There are cases you might want to do so. But fastdup also provides a convenient method to remove them programmatically.

warning

The following code will delete all duplicate images from your folder. I recommend setting dry_run=True to see which files will be deleted.

📝 NOTE: Checkout the fastdup documentation to learn more about the parameters you can tweak.

top_components = fastdup.find_top_components(work_dir="scene_classification_clean/report/")
fastdup.delete_components(top_components, dry_run=False)

In fastdup, a component is a cluster of similar images. The snippet above removes duplicates of the same images (from the top cluster) ensuring you only have one copy of the image in your dataset.

That’s how easy it is to find duplicate images and remove them from your dataset! Let’s see if we can find more issues.

🦄 Anomalies

Similar to duplicates, it’s easy to visualize anomalies in your dataset:

fastdup.create_outliers_gallery(outliers_file='scene_classification/report/train/outliers.csv',            
                                save_path='scene_classification/report/train/', 
                                num_images=5)
HTML('scene_classification/report/train/outliers.html')

You’d see something like the following 👇

Fastdup Tool - Outliers Report


ImageDistancePath
14000.565134scene_classification/data/train_set/glacier/12723.jpg
13980.569811scene_classification/data/train_set/sea/12479.jpg
13940.576101scene_classification/data/train_set/sea/18268.jpg
13920.581651scene_classification/data/train_set/forest/5610.jpg
13870.586302scene_classification/data/train_set/glacier/3510.jpg

info

What do we find here?

  • Image 12723.jpg in the top row is labeled as glacier, but it doesn’t look like one to me.
  • Image 5610.jpg doesn’t look like a forest.

📝 NOTE: Run the code snippet and increase the num_images parameter to see more anomalies. Also, repeat this with valid_set and see if there are more.

All the other images above don’t look too convincing to me either. I guess you can evaluate the rest if they belong to the right classes as labeled. Now let’s see how we can programmatically remove them.

warning

The following code will delete all outliers from your folder. I recommend setting dry_run=True to see which files will be deleted.

📝 NOTE: Checkout the fastdup documentation to learn more about the function parameters.

fastdup.delete_or_retag_stats_outliers(stats_file="scene_classification_clean/report/outliers.csv", 
                                       metric='distance', filename_col='from', 
                                       lower_threshold=0.6, dry_run=False)

The above command removes all images with the distance value of 0.6 or below.

What value you pick for the lower_threshold will depend on the dataset. In this example, I notice that as distance go higher than 0.6, the images look less like outliers.

This isn’t a foolproof solution, but it should remove the bulk of anomalies present in your dataset.

💆 Wrong or Confusing Labels

One of my favorite capabilities of fastdup is finding wrong or confusing labels. Similar to previous sections, we can simply run:

df = fastdup.create_similarity_gallery(similarity_file="scene_classification/report/train/similarity.csv", 
                                  save_path="scene_classification/report/train/", 
                                  get_label_func=lambda x: x.split('/')[-2], 
                                  num_images=5, max_width=180, slice='label_score', 
                                  descending=False)
HTML('./scene_classification/report/train/topk_similarity.html')

note

In case the dataset is labeled, you can specify the label using the function get_label_func.

📝 NOTE: Check out the fastdup documentation for parameters description.

You’d see something like 👇

Fastdup Tool - Similarity Image Report, label_score


scoreinfo_frominfo_toImageSimilar
6280.0
labelforest
from3279.jpg
distancetolabel
10.9269986088.jpgmountain
00.91420415835.jpgmountain
10070.0
labelglacier
from10838.jpg
distancetolabel
10.9148542625.jpgmountain
00.9013473446.jpgmountain
10160.0
labelglacier
from10994.jpg
distancetolabel
10.9376053437.jpgmountain
00.9262747922.jpgmountain
10480.0
labelglacier
from11469.jpg
distancetolabel
10.93691412696.jpgmountain
00.9321991151.jpgmountain
10520.0
labelglacier
from11514.jpg
distancetolabel
10.9254167922.jpgmountain
00.91409014703.jpgmountain