17 Apr 2018

SOTA Neural Question Generation

Neural Question Generation for Reading Comprehension

The paper can be found here

Goal of the paper

Automatic question generation for sentexces from passages in reading comprehension

Key features:
* trainable end-to-end via seq2seq learning
* does not rely on any hand crafted features
* also able to incorporate the paragraph level information

It is a state-of-the-art models as for March, 2018 (correct me if you know any other paper with better scores)

Note: There are ofcourse a lot of places where these kinds of models can be used. I skip the applications for brevity

Concrete definition of the task

Input: Sentence x of tokens [x1,x2,…xm]
Output: Question y of tokens [y1,y2…yk]

y = argmax P(y/x)

i.e. Given an input sentence x, we try to generate a natural question y of arbitrary length k, such that it has the maximum conditional likelihood of being generated.


                    ----------                      ------------
--sent--[+para]---> | Encoder| ---[Latent Info]-->  | Decoder  | ----Generated Question---->
                    ----------                      ------------


We factorize the conditional in the above equation as follows:


where the probability of each yt is predicted based on all words generated previously and the input sequence x

Each of the above terms can be computed using the following equation, where ht is the state at time step t and ct is the attention-based encoding of x at decoding time step t (explained below). Ws and Wt need to be learned.


The decoder LSTM is as usual given as follows:


(inputs being the previously generated word yt-1)

Hidden state initialization

**Note: Initialization of the decoder’s hidden state is the key thing that differentiates the two models that the paper proposes based on:

a) Solely the sentence
b) The sentence and the surrounding paragraph for context

For the sentence based model, the decoder’s hidden state is initialized by some representation of the sentence (that the encoder for sentence produces), while for the paragraph based model, the representation of the sentence and the paragraph (that the sentence and the paragraph encoders generate respectively). See the two figures below for the two models

Sentence based:

Sentence + Paragraph based:


Attention-based sentence encoder

We use a bi-directional LSTM to encode the input sentence as follows:


The attention based encoding of x at decoding step t (ct - as mentioned above in the decoder) is given by weighted average of bi (i = 1…. x ):


where bi = [bi->; bi<-] i.e. concatenating the hidden states at time t of both forward and backward passes

The attention weights are given by the linear scoring function and softmax normalization:


The final sentence encoding for initialization of the decoder is concatenation of last hidden state of forward and backward pass.


  • Truncate paragraph to a length threshold L to avoid extremely large paragraphs. Call this paragraph: z
  • Use a simple bi-directional LSTM to encode this z
  • Use the concatenation of the last hidden state of the forward and backward pass as paragraph encoder’s output


Given a training corpus of sentence and question pairs, our objective is to minimize the negative of the log-likelihood of the training data w.r.t to all parameters. Straightforward right? :P


Whenever exact inference takes too much time, we use approximate technique of …. Beam Search. This is exactly, what the authors of the paper do, since rating all possible y-vectors and selecting the one with highest conditional likelihood is not feasible.

How to handle produced from decoding?

A simplistic replacing strategy can be to replace it with the token in the input sentence that has the highest attention score (attention weight) in the context ct of that decoding step.


SQuAD - Stanford Question Answering Dataset

This has the context, questions, and answers along with the character number where the answer starts in the context text (This can be used to get the sentence where the answer lies - and hence use in the sentence encoder model)


Evaluating seq2seq model is a nightmare to me. People have uses all kinds of approximate ideas and deviced scoring metrics and yet none of them qualify to be called good enough intuitively.

It remains a challenge on how can you automatically score a seq2seq task - when the target questions (or the translations) need not even be exact to the gold answers and can yet be correct.

The authors use BLEU-n, ROUGUE scores etc

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.


My views on the paper

The reported model has a bad performance when used along with the paragraph/context around the sentence. Unfortunately, I think that this is because of the ill definition of the scoring metric. The dataset has questions, answers to which are expected to be a phrase in some sentence. This is why questions framed are around a single sentence. Adding paragraph information forces model to frame the same question from a large input set, and our scoring metric expects the model to still generate the question most similar to the gold sentence, even though the it might actually be helping the model to generate a more natural and difficult. This makes me feel that the automatic evaluation is not the right way to use in these kinds of tasks.

Again, human evaluation is subjective and so is not a very good scoring metric to use, but it atleast in some way captures if the questions generated are more natural/difficult etc. So this might be a more appropriate way to judge these kinds of models (though this can be costly)

Visitors: visitor counter