深度學習的一個有意思的應用: 通過一張圖片講故事
We are starting a series of articles explaining some of the more popular deep learning models and their architects. We'll start off with one called Neural Storyteller originally written as a research paper by Jamie Ryan Kiros. We've taken it and modified it to be a lot more powerful in a few key ways.
If you haven't seen it already, Neural Storyteller is an artificial intelligence model that when given an image, can generate a story about the image using features in the image.
I had a train over the bridge and at the same time , it hurt so much . But it did n't take long for me to understand what he was saying . In fact , it was as if he were walking away from her , looking out over the bridge and into the river below . I had no idea how many times I tried to convince her to be my anchor , but that made me feel better . His face was a mask of pure steel , the bridge of his mouth . His entire body shuddered with relief . Sometimes , I d have to follow the bridge that led up on a ship and once more
I looked to the river at the exact same time , as it had begun to fade . In fact , there were so many times that I wondered what was going on in my mind . In the distance , she pulled out of the lake and into the forest . She had no idea what to do with him . In fact , it was more than likely that he could walk away from her . My mother 's body language told me she was beautiful , but most of the time , I drew a deep breath . She likely wanted to hang onto the cliffs and leave it in a hellish lake
It's a fun toy and yet you can imagine the future and see where all these artificial intelligence models are moving towards. You can test out a live online version of the model here.
Architect
Building a neural network model to accomplish a goal increasingly involves building larger and more sophisticated pipelines, which can include mixing and matching different algorithms together.
Neural storyteller consists of 4 main parts:
- Skip-thought vectors
- Image-sentence embeddings
- Style shifting
- Conditional neural language models
Skip-Thought Vectors
Skip-thought vectors are a way to encode text in an unsupervised (inferring a function from unlabeled data) manner. The system works in an unsupervised way by exploiting the continuity of text. For any given sentence from the text, it tried to reconstruct the surrounding text. For neural storyteller, romance novels are converted into skip thought vectors.
Image-Sentence Embeddings
Another separate model, a visual semantic embedding model, is built so that when given an image, it outputs a sentence describing that image. The dataset used to train this is called MSCOCO. There are many models that already do this such as Neural Talk.
With these 2 models, they can now be connected together to get the result we are looking for. Another program is written that is essentially this function:
F(x) = x - c + b
x represents the image caption, c represents the "caption style", and b represents the "book style". Which means: keep the "thought" of the caption, but replace the image caption style with that of a story.
c, the caption style is generated by taking the mean of the top MSCOCO captions generated for the image.
b is the mean of the skip thought vectors for romance novel passages that are of length > 100.
Style Shifting
The above function is the 'style-shifting' operation that allows the model to transfer standard image captions to the style of stories from novels. Style shifting was inspired by "A Neural Algorithm of Artistic Style" but the technical details are completely different. You can play with an example model of neural style here.
Data
There are 2 main sources of data that is used in this model.
MSCOCO is a dataset from Microsoft containing around 300,000 images, with each image containing 5 captions. MSCOCO is the only supervised data being used, meaning it is the only data where humans had to go in and explicitly write out captions for each image.
The other source of data is called BookCorpus. It contains around 11 thousand books from various genres. The model was trained on a subset of BookCorpus, specifically 11 million passages from romance novels. But BookCorpus also contains books from adventure, sci-fi, and other genres.
Future Revisions
You've seen how the technology works. There are many future versions of this that could be built, both simple modifications and larger architect changes. Just by changing the data used, you can come up with new and interesting outputs.
Imagine taking this same algorithm and having it map to law documents. You could get the model to output whether or not what is occurring in the model is legal or not.
How about if you mapped it to Wikipedia. Imagine modifying the model so that it could output educational information about an image. This could be used as an educational tool where for each image you give it, the model could tell you different facts about the core features of the image.
How about religious versions where instead of feeding in romance novels, you feed in Christian texts. You would get a model that could tell you what the teachings of Christianity think about what is happening in an image. It could be used as a guide for people learning more about a new religion. You could build these kinds of models for any major religion that had enough text.
We are working with several clients now that are applying these algorithms in new and interesting ways.
As you can see, neural-storyteller is an innovative architecture and there will be many new derivative models and works created based off it. Try out the model to see what is possible. Please contact us if you would like to collaborate on a project together.
This is the first in a series of articles where we will review popular deep learning algorithms and explain what they do and how they work.