A Neural Image Caption Generator - Part 1

Dec 20, 2025

I’ll be working through this seminal paper which explores how to put together CNN, RNN and LSTMs to get a nice image caption generator. First - I will work through the concepts in the paper, followed by an implementation of it (if resources allow). I have bookmarked a reference implementation found at nikhilmaram/Show_and_Tell.

Context: This is a paper from 2015, all the authors are from Google:

The problem of image captioning combines both vision and linguistic understanding. The standard mechanism to assess the quality of this class of problems is via BLEU-1 -(Bilingual Evaluation Understudy) - where the goal is to get the machine to produce descriptions as close to the human as possible. Before the paper - the state-of-the-art performance was BLEU-1 of 25, whereas the paper yielded 59, which is a big jump, considering human performance is at 69. The authors apply the technique on Flickr30k, SBU, and COCO datasets.

I will probably try to apply the techniques on Flickr8k, but we’ll look into this in detail once we actually get there.

Image captioning is a comparatively harder task because - it involves both object recognition and relevant linguistic expression:

The innovation in the model is “integration”. Older approaches was about stitching together separate subsystems - whereas the present paper produces an integrated model which does both the relevant things. In simple terms, the goal is to take in an image, and return an array of words from a dictionary that describe the image with the highest likelihood of a good match:

The paper pinpoints the original insight: in language translation, the older assumption was we master the details first, and then based on the mastery of details, we produce an integrated solution by putting them together. But RNNs proved that - you can sort of “skip the details”, and just get an encoder → embedding → decoder sequence to just get a language translation:

Since the input here is image, and CNNs are proved to represent images in a rich way - the encoder part is a CNN that is trained for image classification. Then we take this and plug it into and RNN decoder, which is “skilled at” language to generate the final image caption. The authors call it the NIC - Neural Image Caption. It’s a catchy name!

Once these components with some “pre-training” are plugged in - the authors do further “end to end training with gradient descent” - that is, they repurpose these pieces to work together so that they produce image captions competently. Reading this - it almost sounds like organizing people to get some large goal done. Take expert 1, take expert 2, put them together, set goal, give feedback, refine, refine, keep eye on the goal, and produce result.

Learning Loom - Tales of Technology and the Human Spirit

Discussion about this post

Ready for more?