A simple approach to predict future frames in video (PUBG) data using Python
Introduction :
It is impossible to predict the future! ( Unless you have a time stone -:) ). But predicting the immediate future is not very hard for us (Humans). We do it in the real-life quite often — while playing a game or watching a movie one can easily predict the immediate future move. This capability of our brain helps us in planning actions as well as taking decisions in advance. (for example — catching a ball in the air, dodging a stone coming at your face …etc).
We cannot dodge a bullet though as we are not fast enough. What if we could give this capability of our brain to the dumb machines and build an Ultron -:).
Question is — can machines really do it ?
The answer is “yes”. This little experiment proves it. Also, it’s an active area of research — Given a small video can you generate future video. Here is a glimpse of data generated by this experiment —
The following video is completely Artificial and generated by a deep-learning model.
How can I do it on my PC ?
You can do it on your own PC in just a few hours. Also GPU is not required.
If you are new to Computer Vision and Deep Learning, i would suggest you to get a basic understanding of the following topics before reading further —
Now let’s break the full thing into five small parts. We will go through each part one by one —
- Data Generation
- Data Preparation
- Model
- Results
- Conclusion
Data Generation :
Generating the data was super easy. With the help of screen recorder on my mobile device , I was able to capture a 15 minutes long video (which proved to be enough data for our little experiment). In this video, I put my PUBG character on sprinting mode and let it run continuously for ~15 minutes in random directions.
Once I had this video, all I needed to do was to cut it into multiple frames on regular time intervals. This video-to-frames conversion was really easy and quick usingOpenCV toolkit
. These frames were about ~25ms
apart which makes it 40fps (frames per second)
. The difference between two consecutive frames was little but it was fairly visible.
Here is a simple python code which converts a video-to-frames. (*frame rate may vary on different machines as it depends upon processing speed of your system.)
Data preparation :
Previous exercise gives us around 30k frames. Out of these 30k frames, first 29k frames are kept as training data and rest 1k frames as validation data.
These frames were really big in size 800 * 1200
. To make it simple and fast for the model , each frame is resized to 240 * 160
before passing it to the model for training.
here is a simple python function for data preprocessing.
This is how a processed image looks like —
‘After’ 1. Cropping the image to keep only the relevant part of image.
‘And’ 2. Resizing the image to a smaller size (240*160), as the original image was really big. (while resizing the image, aspect ratio is kept the same).
Model :
- Training Data :
Preprocessed frames from previous exercise are arranged in pairs such that each pair of frames (frame_x, frame_y)
has two continuous frames — for example if frame_x
appears in the video at n’th
position then frame_y
should be from the (n+1)’th
position.
In this manner , those 29k frames we kept for training data, can make 29k — 1
such training data pairs. Similarly remaining 1k validation frames can make 1k validation data pairs.
2. Model Architecture :
A Simple Encoder-Decoder model architecture is used in this experiment. Here encoder
part of the model is a three layered 2D CNN (convolutional neural network)
with MaxPooling
layers. Also the decoder
part of the model is 2D CNN with UpSampling layers
or (transpose convolutional neural network ). Kernel sizes of the convolution layers are kept in way such that we get same sized image after decoding.
Here is the architecture of the model implemented using keras api from tensorflow.
3. Model Training :
Now for each training pair, model takes frame_x
as input and learns to guess frame_y
(next immediate frame).
Gradient descent algorithm is used to train the model with mean squared error
as loss function…….
--- Model Parameters ---
batch_size : 32
epochs : 10
optimizer : 'adam'
learning_rate : 0.001
loss : 'mse'
After training for just 10 epochs
,learned weights are saved for inference.
Training time : approximately 8 hours (on CPU)
Hardware : Macbook Pro, 16GB, 2.6 GHz, Intel Core i7
Training and Validation loss graph
Results:
To check the performance of the model on unseen data, one random frame is picked from validation set. Now using this single frame we can generate any number of future frames by passing predicted frames again and again to the model as input.
Model was able to generate first 10–12 future frames with decent accuracy from a single input frame. It gets really noisy after 15th frame as the prediction error adds up at each new prediction. One important thing to notice is that the static parts of the frames (my PUBG control buttons) are intact as model is able to learn what is static and what is changing
also this part doesn’t get blurry .
Below are two example image sequences generated by the model, along with ground truth images on top.
Conclusion :
It was quite interesting to see that model was easily able to generate so many future images using just a single image. It was easy because this experiment was much controlled — The guy in that PUBG video was performing only one activity (always running ). So the model only needed to learn his movements as well as how the background is changing with time.
Predicting the real life scenarios is not that easy. Also it’s is an active area of research. This is hard because there are endless possibilities in real world scenarios. Multiple objects can change the environment simultaneously. To model such scenarios we need a better and more powerful model architecture. Also a large amount of good quality real world data.
thanks Raghav Bali and Dipanjan (DJ) Sarkar for the review.
Please do share your comments/ feedback with me.