Python Predicts PUBG Mobile

By | July 19, 2020

A simple approach to predict future frames in video (PUBG) data using Python

Image for post
My PUBG character 😀

Introduction :

It is impossible to predict the future! ( Unless you have a time stone -:) ). But predicting the immediate future is not very hard for us (Humans). We do it in the real-life quite often — while playing a game or watching a movie one can easily predict the immediate future move. This capability of our brain helps us in planning actions as well as taking decisions in advance. (for example — catching a ball in the air, dodging a stone coming at your face …etc).

We cannot dodge a bullet though as we are not fast enough. What if we could give this capability of our brain to the dumb machines and build an Ultron -:).

Question is — can machines really do it ?

The answer is “yes”. This little experiment proves it. Also, it’s an active area of research — Given a small video can you generate future video. Here is a glimpse of data generated by this experiment —

The following video is completely Artificial and generated by a deep-learning model.

Artificially Generated Video

How can I do it on my PC ?

You can do it on your own PC in just a few hours. Also GPU is not required.

If you are new to Computer Vision and Deep Learning, i would suggest you to get a basic understanding of the following topics before reading further —

  1. Open CV
  2. Convolutional Neural Networks
  3. Auto Encoders

Now let’s break the full thing into five small parts. We will go through each part one by one —

  1. Data Generation
  2. Data Preparation
  3. Model
  4. Results
  5. Conclusion

Data Generation :

Generating the data was super easy. With the help of screen recorder on my mobile device , I was able to capture a 15 minutes long video (which proved to be enough data for our little experiment). In this video, I put my PUBG character on sprinting mode and let it run continuously for ~15 minutes in random directions.

Once I had this video, all I needed to do was to cut it into multiple frames on regular time intervals. This video-to-frames conversion was really easy and quick usingOpenCV toolkit. These frames were about ~25ms apart which makes it 40fps (frames per second) . The difference between two consecutive frames was little but it was fairly visible.

Here is a simple python code which converts a video-to-frames. (*frame rate may vary on different machines as it depends upon processing speed of your system.)

Video to Frame conversion

Data preparation :

Previous exercise gives us around 30k frames. Out of these 30k frames, first 29k frames are kept as training data and rest 1k frames as validation data.

These frames were really big in size 800 * 1200. To make it simple and fast for the model , each frame is resized to 240 * 160 before passing it to the model for training.

here is a simple python function for data preprocessing.

Preprocessing function

This is how a processed image looks like —

‘After’ 1. Cropping the image to keep only the relevant part of image.

‘And’ 2. Resizing the image to a smaller size (240*160), as the original image was really big. (while resizing the image, aspect ratio is kept the same).

Frames after cropping and resizing
Image preprocessing

Model :

  1. Training Data :

Preprocessed frames from previous exercise are arranged in pairs such that each pair of frames (frame_x, frame_y) has two continuous frames — for example if frame_x appears in the video at n’th position then frame_y should be from the (n+1)’th position.

In this manner , those 29k frames we kept for training data, can make 29k — 1 such training data pairs. Similarly remaining 1k validation frames can make 1k validation data pairs.

2. Model Architecture :

Image for post
Encoder — Decoder Architecture

A Simple Encoder-Decoder model architecture is used in this experiment. Here encoderpart of the model is a three layered 2D CNN (convolutional neural network) with MaxPooling layers. Also the decoder part of the model is 2D CNN with UpSampling layers or (transpose convolutional neural network ). Kernel sizes of the convolution layers are kept in way such that we get same sized image after decoding.

Here is the architecture of the model implemented using keras api from tensorflow.

Image for post
Model Architecture

3. Model Training :

Now for each training pair, model takes frame_x as input and learns to guess frame_y (next immediate frame).

Gradient descent algorithm is used to train the model with mean squared error as loss function…….

--- Model Parameters ---
batch_size : 32
epochs : 10
optimizer : 'adam'
learning_rate : 0.001
loss : 'mse'

After training for just 10 epochs ,learned weights are saved for inference.

Training time : approximately 8 hours (on CPU)
Hardware : Macbook Pro, 16GB, 2.6 GHz, Intel Core i7

Training and Validation loss graph

Image for post
Loss Chart

Results:

To check the performance of the model on unseen data, one random frame is picked from validation set. Now using this single frame we can generate any number of future frames by passing predicted frames again and again to the model as input.

Model was able to generate first 10–12 future frames with decent accuracy from a single input frame. It gets really noisy after 15th frame as the prediction error adds up at each new prediction. One important thing to notice is that the static parts of the frames (my PUBG control buttons) are intact as model is able to learn what is static and what is changing also this part doesn’t get blurry .

Below are two example image sequences generated by the model, along with ground truth images on top.

Image for post
Image sequences generated by model

Conclusion :

It was quite interesting to see that model was easily able to generate so many future images using just a single image. It was easy because this experiment was much controlled — The guy in that PUBG video was performing only one activity (always running ). So the model only needed to learn his movements as well as how the background is changing with time.

Predicting the real life scenarios is not that easy. Also it’s is an active area of research. This is hard because there are endless possibilities in real world scenarios. Multiple objects can change the environment simultaneously. To model such scenarios we need a better and more powerful model architecture. Also a large amount of good quality real world data.

thanks Raghav Bali and Dipanjan (DJ) Sarkar for the review.

Please do share your comments/ feedback with me.