A chatbot to talk about the real stuff [Ep. 2]

I wanted to complete the previous post where I reproduced in tensorflow one of the generative chatbots proposed by Wizard of Wikipedia: Knowledge-Powered Conversational agents. Funny enough, they linked my project in their official repository (check the Note: an unofficial …), which could not make me more proud! To be honest I asked if they could, but they did! As well I mentioned in the previous post that I didn’t know if they optimized on the perplexity (PPL) or masked perplexity, and they confirmed by email that they optimized on PPL, which is good news for me, since my results are better on PPL, not when I compare my masked PPL with their PPL.

There’s three things I wanted to show:

  1. my results on the complete table 4 of the paper
  2. my results when the network doesn’t use the knowledge base at all, which are remarkably good
  3. show a few samples of a dialogue
Predicted Knowledge Gold Knowledge
Seen Unseen Seen Unseen
PPL F1 PPL F1 PPL F1 PPL F1
E2E (original) 63.5 16.9 97.3 14.4 23.1 35.5 32.8 32.2
E2E (ours) 12.4 30.6 18.7 25.6 11.7 31.6 17.5 26.3
E2E-nk (ours) - - - - 13.0 - - -

PPL stands for Perplexity and the lower the better, while for F1, the higher the better. Our naive reimplementation happens to outperform the original in all the data splits used in the original work, based on PPL, while in terms of F1 seems to be more sensitive on the seed chosen for training in our experience. Very interestingly, in the case when the knowledge to use is predicted by the network (Predicted Knowledge) and not provided by the task (Gold Knowledge), our implementation closes the gap massively with respect to the original implementation. That means that our knowledge dropout, since I think the major difference with the original lies there, allows the network to better predict what is the piece of information necessary to reply appropriately to the querier.

We were curious to know if the network was really using the knowledge base, so we implemented a GPT variant of the original implementation, with analog hyperparameters, that has no encoder (E2E-nk in the table). The PPL achieved is surprisingly low, indicating that PPL doesn’t capture clearly to which extent the knowledge base is being used. For that reason it would be appropriate to use one of the metrics the original article proposed, Wiki F1, to compare the dialogue generated directly with the knowledge base, and probably, extending Wiki F1 to Wiki BLEU, to consider longer n-grams, could be a clever idea.

And an example dialogue. Bare in mind that the replies the network recieves were collected in human interactions, so, they don’t react to the generation of the network, and the network replies are conditioned on the human dialogue, so, not on its own replies. We used beam search with a beam size of 3, as in the original article.

Topic: Gardening

human: I like Gardening, even when I’ve only been doing it for a short time.

model: I love gardening, it is a relaxing activity

human: That sounds great. I’ve always thought that I would love living in a farm, but I’ve always lived in the city. What do you mostly plant?

model: I love to garden and cultivating plants.

human: Great, I love the idea of growing my own vegetables and fruits! Do you have animals in the farm?

model: I’m not sure, but I do know that there are over a hundred different types of plants.

human: Wow, it sounds amazing, the Micro - pigs are so cute! are they trainable to be well behaved?

model: Yes, I do. I like to grow a lot of different types of plants.

Well, we see some issues. Even if the perplexities are good, language sounds a bit generic and it does not seem to react properly to the history of the conversation, even considering the limitations mentioned above. I want to code a proper interaction with the model, to give a better impression on how it would feel to talk directly with the model. Probably for an episode 3!

Written on October 13, 2021