Generative based chatbot

Generative based chatbot


Arpan Sil 

Kounteyo Roy Chowdhury

A short introduction to Chatbots:

Conversational agents or dialogue systems, popularly known as chatbots are widely gaining popularity and importance across various business domains and customer service portals.  The primary motivation behind building a chatbot is to answer the basic queries, requests and questions asked by the consumer so that the management and employees can focus on the more important stuffs. Nowadays, with the increasing research and development in the fields of Natural Language Processing (NLP) and Deep Learning, chatbots are gaining the potential to solve just more than the elementary problems 0it was designed to solve in the first place. Let us look at a few examples of where chatbots are used in real life:-

  • Ordering food delivery platforms like Zomato, Swiggy, etc. where the customer can convey their complaints and grievances, and the chatbots can take care of it in the elementary step
  • Customer care portal for mobile connections and communications like Jio, Vodafone etc.
  • Websites having chatbots to resolve primary queries posed by a visitor

And so on….

Chatbots generally try to process the following to respond to the questions asked:

  1. Purpose of the user (What is the user inquisitive about?)
  2. Did the user say anything generic or specific?
  3. What questions can be asked to understand the user requirement more?
  4. What can be the most relevant reply to the question asked by the user?

The efficiency of the reply or answer given by the chatbot to the user will strongly depend on how accurately the chatbot finds answers to the above listed questions.

Now, we will explore the two types of chatbots used:

  1. Generative based chatbots: It generates new response using various machine learning and deep learning techniques on a lot of historic data and previous conversations. Although the process does not rely on pre-defined responses, the responses given might be irrelevant or grammatically incorrect in many cases.
  2. Retrieval based chatbots: In this case, there is a collection of pre-defined responses and using some technique (ranging from simple like rule based pattern to very complex like ensemble learning), it chooses the best response for a given question. It does not generate any new response.

It is important to understand the open and closed domain concept here in relation to the chatbot.

  • Open domain (Generalist bots): This is basically an open conversation where the user can take the content or topic of conversation anywhere. This is completely at the discretion of the user. The space if possible inputs and outputs are ideally infinite or unlimited in this case. For example: Broadly financial management, ways of reducing the income tax.
  • Closed domain (Specialist bots): Here, the scope or space of possible inputs and outputs are somewhat limited. The system is mostly made to achieve a very specific goal or set or goals. For example: Food ordering service from a specified menu in a restaurant, where the number of options can be very large, but finite.
Figure: Chatbots with respect to the domains and how easy or difficult it is to build

In this article, we will focus on building a simple generative based chatbot using Python and tensor flow.


The first step in any data science or machine learning problem is thorough data pre-processing or cleaning to make it ready for analysis (also called data wrangling). In most machine learning models, it consumes around 80-90% of the time. For text-mining and NLP problems, the pre-processing is even more important and difficult. Once the data is ready for analysis, numerous models can be tried to understand the best model and implemented.

Data Cleaning/ Pre-processing

Data cleaning in this case, essentially involves the following steps:

  • The entire text is first converted into lowercase, for homogeneity and ease of analysis. For example, ‘Help’ and ‘help’ essentially conveys the same meaning, but when the text is converted into vectors, it will be stored as 2 different vectors and treated separately, thus increasing pressure on computation and storage. Converting the entire text into lowercase (chosen by convention, uppercase can also be done, but generally not done traditionally) reduces the computation pressure considerably.
  • The next step involves tokenization, which is splitting a given string or text into individual words or tokens. For example: A={’the sun rises in the east’} is split into A= {‘the’, ‘sun’, ‘rises’, ‘in’, ‘the’, ‘east’}. The essence of any machine learning problem is basically breaking the problem into the smallest meaningful part for ease of understanding and micro-analysis. Tokenization performs that task in case of text-mining quite comprehensively. Sentences as a whole, are difficult to be analysed and interpreted by a computer, but when split into words or tokens, and converted into vectors, becomes quite manageable. 
  • The third step involves adding the generated tokens in the previous step in the vocabulary, which forms a repository of words for comparison and analysis in the subsequent steps.
dir_path = 'chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)

questions = list()
answers = list()

for filepath in files_list:
    stream = open( dir_path + os.sep + filepath , 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len( con ) > 2 :
            replies = con[ 1 : ]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append( ans )
        elif len( con )> 1:

answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str:
        answers_with_tags.append( answers[i] )
        questions.pop( i )

answers = list()
for i in range( len( answers_with_tags ) ) :
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

Preparing the environment for implementing the neural network model:

The pre-processing is now completed, but we need to prepare the data in a way that when fed in the neural network, will give us the desired output. Recurrent neural networks using LSTM and why and how they are used are discussed in the subsequent sections.

Now, let’s think from the perspective of the chatbot. It is supposed to take in questions from the user and generate a reply or answer to the question with maximum relevance. For, giving the relevant reply, LSTMs provides a huge improvement over the conventional RNNs.

For this, we prepare the following arrays:

1) Encoder input data:
  • First, the questions inputted are converted to tokens.
  • The maximum length of the tokenized questions are then found out.
  • Next, the tokenized questions are padded to the maximum length (as found in the previous step) using post-padding, and stored into a numpy array, saved as the encoder input data.
Figure: A simplified figure to understand the flow of the process in encoder and decoder
2) Decoder input data:

The essential steps followed here are, firstly, the supposed answers to the questions in the previous steps are converted to tokens, padded to maximum length using post-padding and stored in an array called the decoder input data.

3) Decoder output data:

The question here is of the context. The answer provided by the chatbot should satisfy the basic objective of giving the customer relevant information as far as possible. For this, taking the account of the time factor is imperative. Let us take an example.

A consumer is ordering food from a restaurant, and the restaurant is using a generative chatbot. Suppose he orders 4 items sequentially and then decides to cancel the 2nd item he has ordered, maybe 2 minutes back. For this, the customer types in “Cancel my 2nd order”. Reading this, the chatbot must be able to understand what the 2nd order was, confirm it with the customer and then proceed towards cancellation. To correlate this contextual understanding, the time factor comes into play, as it has become a historic information for the machine by then. This is sometimes referred to as the ‘teacher forcing’. To implement this, the decoder output is used.

We use 2 matrices for the decoder to take into account the time factor for the current target token. (For example: ‘am’ should come after ‘I’ and not ‘He’, so the time factor is taken care of while using the 2 matrices). This practise is sometimes called the ‘teacher forcing’.

The essential steps involved here are:

  • Tokenizing the answer based on the time context for the current target token.
  • One-hot encoding of the padded answers. One-hot encoding is a well-defined technique in machine learning, where the data is coded as rows and columns of essentially binary numbers, 1 or 0. Since the computation inside a machine happens only on binary numbers (i.e., even the text, video, audio or any data, in the most basic levels of analysis in computers are converted into combinations of 0 and 1), one-hot encoding is an essential step for analysis here also.      
  • Finally, output is stored in an array named the decoder output data.
from gensim.models import Word2Vec
import re

for word in tokenizer.word_index:
def tokenize(sentences):
    for sentence in sentences:
        sentence = sentence.lower()
        sentence = re.sub( '^a-zA-Z',' ',sentence)
        tokens = sentence.split()
    return tokens_list,vocabulary

model = Word2Vec(p[0])
embedding_matrix = np.zeros((VOCAB_SIZE,100))
for i in range(len(tokenizer.word_index)):
    embedding_matrix[i] = model[vocab[i]]
tokenized_questions = tokenizer.texts_to_sequences(questions)
maxlen_questions = max([len(x) for x in tokenized_questions])
padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')
encoder_input_data = np.array( padded_questions )
print(encoder_input_data.shape, maxlen_questions)

tokenized_answers = tokenizer.texts_to_sequences(answers)
maxlen_answers = max([len(x) for x in tokenized_answers])
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
decoder_input_data = np.array(padded_answers)
print(decoder_input_data.shape, maxlen_answers)

tokenized_answers = tokenizer.texts_to_sequences(answers)
for i in range(len(tokenized_answers)):
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
one_hot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)
decoder_output_data = np.array(one_hot_answers)

Now that, we have formed the 3 arrays for analysis, it’s time to move towards building and training the model.

The LSTM RNN Model:

Now, we move on to use a Long Term Short Memory (LSTM) recurrent neural network (RNN) comprising of internal gates.

A small example of Seq2seq model for response creation using chatbot

We use a LSTM RNN instead of a conventional RNN to remove the long-term dependencies. In LSTMs, each unit or cell within this layer is classified into 2 types: an internal cell state (abbreviated as ‘c’) and outputs a hidden state (abbreviated as ‘h’). There are gates which outputs a value between 0 and 1, 0 meaning completely get rid of the information and 1 means take this information completely. Next, for the information to store the input layer is decided jointly by a sigmoid and a tanh layer. The detailed description about how a LSTM works can be found in this link

The basic configuration of the RNN used is as follows:

  1. There are 2 input Layers, one for encoder input data and another for decoder input data.
  2. Embedding layer:  Word embedding is a class of approaches for representing words and documents using a dense vector representations. Here, it is used for converting token vectors to fixed sized dense vectors.
  3. LSTM layer: Provide access to Long-Short Term cells.
  4. Rmsprop is used as an optimizer and categorical cross-entropyis used as loss function here.
encoder_inputs = tf.keras.layers.Input(shape=( None , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()[encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=100 ) 

Interacting with our chatbot:

The basic LSTM sequence to sequence model is trained to predict decoder output given the encoder input and decoder input data. The encoder input data comes in the Embedding layer (encoder embedding). The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors (h and c which are encoder states, as described above). These states are set in the LSTM cell of the decoder. The decoder input data comes in through the Embedding layer. The Embedding goes in LSTM cell (which had the states) to produce sequences.

def make_inference_models():
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    return encoder_model , decoder_model

def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')
enc_model , dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation )

Interaction with the created chatbot:

In our Next articles we will be discussing about:

  1. Retrieval based chatbots with python (chatbots which are trained with a collection of pre-defined responses.
  2. Creation of GUI (Graphical User Interface) like Google assistance where which can be integrated with any website or App.

For detailed code and saved model you can visit the GitHub repository.


Understanding LSTMs

Difference between Return sequences & Return states for LSTMs

Recurrent Layers (LSTM)

Introduction to seq2seq mode


Chatbots explanation

Kounteyo Roy Chowdhury
Kounteyo Roy Chowdhury

Msc in applied statistics.
Data Scientist specializing in AI-NLP

Arpan Sil
Arpan Sil

Msc in Applied statistics
A technical data Analyst



Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. Feel free to reach out to us for any kind of discussion on any of the related topics,

Leave a Reply

Your email address will not be published.