EmpTransfo: How to create a chatbot that understands emotion?
Understanding emotions is challenging! (From movie her)

EmpTransfo: How to create a chatbot that understands emotion?

Understanding emotions and using them in a chatbot setting is a daunting task

We know that human conversations involve emotion understanding and responding to those emotions accordingly. And because of this, many usual chatbots fail to satisfy a coherent and meaningful conversation with users. Just imagine you are happy and want to share with your friend that you won a math contest but instead of giving you congratulations they talk about math contest in general! I’m sure you would be disappointed and call your friend a robot for not having emotions!

For all NLP folks it’s obvious that we are living in the transformers era, where the state of the art language models is all based on the Transformer architecture. For the first time, they gave us the ability to harness the power of deep neural networks which used to be part of the image processing endeavor.

Here we show that we can use the Transformer model architecture augmented with muti-task learning to train the network with the task of predicting emotions. We also take advantage of other contextual information in the Dataset that will be introduced shortly.

Architecture

If the conversation consists of a sequence of utterances like:

 \(U= {u_1, u_2, …, u_n}\)

\(P(t_i|t_1, …, t_{i-1})=softmax(h* W_1)\)

In which, h is the last hidden layer of the transformer model and W is the token embedding matrix that is learned in training. Then we can define the loss function based on cross-entropy as:

\({L}_1(U_T) = -\sum_{i=1}^{N} \mathrm{log} P(t_i|t_1,…,t_{i-1})d\)

\(P_e(e| e_1, e_2, …, e_{T-1}) = \text{softmax}({h_{l-1}} * {W_3}) \)

\(\mathcal{L}_2(U_{1:T}) = -\mathrm{log} P_u(a|u_1, u_2, ..u_{T-1})\)

But why we should stop? We can add more tasks for the Transformer to learn. If we have the emotions of each utterance as a sequence:

\({e_1, e_2, …, e_{T-1}, e_{next}}\)

The model can be trained to distinguish between the correct next emotion among a set of distractors. The reason to add this head is to make the model learn not only the grammatical and language structure but also the appropriate emotions for any given history of utterances.

\( P_e(e| e_1, e_2, …, e_{T-1}) = \text{softmax}({h_{l-1}} * {W_3}) \)

Which e is 1 if it’s the correct next emotion and 0 otherwise. And the loss function would be:

\( \mathcal{L}_3(U_{1:T-1}) = -\mathrm{log} P_e(e|e_1, e_2, ..e_{T-1})\)

\(\mathcal{L}_{total}=c_1\mathcal{L}_{1}+c_2\mathcal{L}_{2}+c_3\mathcal{L}_{3}\) 

https://gist.github.com/roholazandie/ab76156f704e3ba73982ef58b1e12e1d

Input representation

 

https://gist.github.com/roholazandie/7eef14009e67c8040dabeabb4a562218
 

Training

https://gist.github.com/roholazandie/7ae6092168eafe5f4cb3712e10b89450

Results:

 

In order to evaluate the next utterance emotion prediction, we calculate the precision and recall from the confusion matrix over the evaluation dataset. Figure below demonstrates the calculated confusion matrix with Precision=81.35, Recall=72.37, and F1=76.59.

And finally, let’s see some real conversations from the chatbot and how it’s better in emotional contexts:

As you can see EmpTransfo is more empathetic and can respond with better answers that take into account the emotions of the user. And, finally a more verbose conversation with the bot:

 

 

References:

 

Next PostRead more articles