Let's suppose we have a forklift workshop, and we want to be able to recognize the technicians and the forklifts in the workshop with an object recognition model. We could spend multiple hours trying to take different photos or searching for them in the internet to label them afterwards. The problem with this approach is that it is costly and we need a lot of photos to achieve good results. So, what we will do in this post is to train a model trained with a few real-world images with added KIADAM synthetic images. The results are available in the following Colab Notebook.
Dataset preparation
To preparate the datasets, we will first take some photos from our real-world forklift workshop. This is a sample image from that dataset:
These real-world images were taken just for the purpose to build this dataset and are close to 30 in number. To further test the recognition of people, we added some of the images of people used to train the COCO dataset (about 500 of them). This dataset will allow us to have some idea about how well trained is our model to recognize forklifts and people.
Testing dataset
For the testing dataset, we used about 50 of the person real images and about 20 real images with forklifts, to test if the model was training fine. This is the testing dataset that will be used to test both the base and the augmented datasets, and contains images that the model has not seen in the training step. An example of a forklift image that is used in testing is the following:
Dataset with added generated images
The second dataset that we will use to test our hypothesis is a copy of the first one, but with added generated images. We used about 12 pictures that contained about 90 forklifts in total, which we cropped to obtain the individual forklift images and used them as the objects to generate the dataset.
With this setup and some augmentations, we generated 500 images, which were added to the previous dataset to create the new dataset. The transformations used to create the images are the following:
Labeled objects:
Skew: 100% of affected images, ranging from -10 to 10 degrees
Rotate: 100% of affected images, ranging from -40 to 40 degrees
Cutout: 20% of affected images, with cutout range from 10% to 50% of the object, 1 to 3 cutouts, using black or gaussian cutouts
Color Distortion: 50% of affected images, ranging from 1 to 11 units
Blur: 10% of affected images, ranging from 3 to 5 degrees
Noise: 100% of affected images
Background:
Color Distortion: 50% of affected images, ranging from 10 to 20 degrees
Gamma Correction: 50% of affected images, ranging from 0.5 to 2 degrees
Composite Images:
Blur: 17% of affected images, ranging from 3 to 5 degrees
Noise: 20% of affected images
An example of generated image is the following:
After training a model with our new generated dataset and testing it in the previously mentioned training dataset, these are the results:
Model | Yolo11s |
Precision | 0.631 |
Recall | 0.52 |
mAP@50 | 0.553 |
Conclusion
We started with a real-world dataset with few forklift images and some images of people, and we augmented it adding some forklift generated images using our KIADAM tool. Using this method, we managed to get good results over a testing dataset, proving that this method is a good way to train an object recognition model.