Generative based chatbot

Generative based chatbot


Arpan Sil 

Kounteyo Roy Chowdhury

A short introduction to Chatbots:

Conversational agents or dialogue systems, popularly known as chatbots are widely gaining popularity and importance across various business domains and customer service portals.  The primary motivation behind building a chatbot is to answer the basic queries, requests and questions asked by the consumer so that the management and employees can focus on the more important stuffs. Nowadays, with the increasing research and development in the fields of Natural Language Processing (NLP) and Deep Learning, chatbots are gaining the potential to solve just more than the elementary problems 0it was designed to solve in the first place. Let us look at a few examples of where chatbots are used in real life:-

  • Ordering food delivery platforms like Zomato, Swiggy, etc. where the customer can convey their complaints and grievances, and the chatbots can take care of it in the elementary step
  • Customer care portal for mobile connections and communications like Jio, Vodafone etc.
  • Websites having chatbots to resolve primary queries posed by a visitor

And so on….

Chatbots generally try to process the following to respond to the questions asked:

  1. Purpose of the user (What is the user inquisitive about?)
  2. Did the user say anything generic or specific?
  3. What questions can be asked to understand the user requirement more?
  4. What can be the most relevant reply to the question asked by the user?

The efficiency of the reply or answer given by the chatbot to the user will strongly depend on how accurately the chatbot finds answers to the above listed questions.

Now, we will explore the two types of chatbots used:

  1. Generative based chatbots: It generates new response using various machine learning and deep learning techniques on a lot of historic data and previous conversations. Although the process does not rely on pre-defined responses, the responses given might be irrelevant or grammatically incorrect in many cases.
  2. Retrieval based chatbots: In this case, there is a collection of pre-defined responses and using some technique (ranging from simple like rule based pattern to very complex like ensemble learning), it chooses the best response for a given question. It does not generate any new response.

It is important to understand the open and closed domain concept here in relation to the chatbot.

  • Open domain (Generalist bots): This is basically an open conversation where the user can take the content or topic of conversation anywhere. This is completely at the discretion of the user. The space if possible inputs and outputs are ideally infinite or unlimited in this case. For example: Broadly financial management, ways of reducing the income tax.
  • Closed domain (Specialist bots): Here, the scope or space of possible inputs and outputs are somewhat limited. The system is mostly made to achieve a very specific goal or set or goals. For example: Food ordering service from a specified menu in a restaurant, where the number of options can be very large, but finite.
Figure: Chatbots with respect to the domains and how easy or difficult it is to build

In this article, we will focus on building a simple generative based chatbot using Python and tensor flow.


The first step in any data science or machine learning problem is thorough data pre-processing or cleaning to make it ready for analysis (also called data wrangling). In most machine learning models, it consumes around 80-90% of the time. For text-mining and NLP problems, the pre-processing is even more important and difficult. Once the data is ready for analysis, numerous models can be tried to understand the best model and implemented.

Data Cleaning/ Pre-processing

Data cleaning in this case, essentially involves the following steps:

  • The entire text is first converted into lowercase, for homogeneity and ease of analysis. For example, ‘Help’ and ‘help’ essentially conveys the same meaning, but when the text is converted into vectors, it will be stored as 2 different vectors and treated separately, thus increasing pressure on computation and storage. Converting the entire text into lowercase (chosen by convention, uppercase can also be done, but generally not done traditionally) reduces the computation pressure considerably.
  • The next step involves tokenization, which is splitting a given string or text into individual words or tokens. For example: A={’the sun rises in the east’} is split into A= {‘the’, ‘sun’, ‘rises’, ‘in’, ‘the’, ‘east’}. The essence of any machine learning problem is basically breaking the problem into the smallest meaningful part for ease of understanding and micro-analysis. Tokenization performs that task in case of text-mining quite comprehensively. Sentences as a whole, are difficult to be analysed and interpreted by a computer, but when split into words or tokens, and converted into vectors, becomes quite manageable. 
  • The third step involves adding the generated tokens in the previous step in the vocabulary, which forms a repository of words for comparison and analysis in the subsequent steps.
dir_path = 'chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)

questions = list()
answers = list()

for filepath in files_list:
    stream = open( dir_path + os.sep + filepath , 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len( con ) > 2 :
            replies = con[ 1 : ]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append( ans )
        elif len( con )> 1:

answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str:
        answers_with_tags.append( answers[i] )
        questions.pop( i )

answers = list()
for i in range( len( answers_with_tags ) ) :
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

Preparing the environment for implementing the neural network model:

The pre-processing is now completed, but we need to prepare the data in a way that when fed in the neural network, will give us the desired output. Recurrent neural networks using LSTM and why and how they are used are discussed in the subsequent sections.

Now, let’s think from the perspective of the chatbot. It is supposed to take in questions from the user and generate a reply or answer to the question with maximum relevance. For, giving the relevant reply, LSTMs provides a huge improvement over the conventional RNNs.

For this, we prepare the following arrays:

1) Encoder input data:
  • First, the questions inputted are converted to tokens.
  • The maximum length of the tokenized questions are then found out.
  • Next, the tokenized questions are padded to the maximum length (as found in the previous step) using post-padding, and stored into a numpy array, saved as the encoder input data.
Figure: A simplified figure to understand the flow of the process in encoder and decoder
2) Decoder input data:

The essential steps followed here are, firstly, the supposed answers to the questions in the previous steps are converted to tokens, padded to maximum length using post-padding and stored in an array called the decoder input data.

3) Decoder output data:

The question here is of the context. The answer provided by the chatbot should satisfy the basic objective of giving the customer relevant information as far as possible. For this, taking the account of the time factor is imperative. Let us take an example.

A consumer is ordering food from a restaurant, and the restaurant is using a generative chatbot. Suppose he orders 4 items sequentially and then decides to cancel the 2nd item he has ordered, maybe 2 minutes back. For this, the customer types in “Cancel my 2nd order”. Reading this, the chatbot must be able to understand what the 2nd order was, confirm it with the customer and then proceed towards cancellation. To correlate this contextual understanding, the time factor comes into play, as it has become a historic information for the machine by then. This is sometimes referred to as the ‘teacher forcing’. To implement this, the decoder output is used.

We use 2 matrices for the decoder to take into account the time factor for the current target token. (For example: ‘am’ should come after ‘I’ and not ‘He’, so the time factor is taken care of while using the 2 matrices). This practise is sometimes called the ‘teacher forcing’.

The essential steps involved here are:

  • Tokenizing the answer based on the time context for the current target token.
  • One-hot encoding of the padded answers. One-hot encoding is a well-defined technique in machine learning, where the data is coded as rows and columns of essentially binary numbers, 1 or 0. Since the computation inside a machine happens only on binary numbers (i.e., even the text, video, audio or any data, in the most basic levels of analysis in computers are converted into combinations of 0 and 1), one-hot encoding is an essential step for analysis here also.      
  • Finally, output is stored in an array named the decoder output data.
from gensim.models import Word2Vec
import re

for word in tokenizer.word_index:
def tokenize(sentences):
    for sentence in sentences:
        sentence = sentence.lower()
        sentence = re.sub( '^a-zA-Z',' ',sentence)
        tokens = sentence.split()
    return tokens_list,vocabulary

model = Word2Vec(p[0])
embedding_matrix = np.zeros((VOCAB_SIZE,100))
for i in range(len(tokenizer.word_index)):
    embedding_matrix[i] = model[vocab[i]]
tokenized_questions = tokenizer.texts_to_sequences(questions)
maxlen_questions = max([len(x) for x in tokenized_questions])
padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=maxlen_questions, padding='post')
encoder_input_data = np.array( padded_questions )
print(encoder_input_data.shape, maxlen_questions)

tokenized_answers = tokenizer.texts_to_sequences(answers)
maxlen_answers = max([len(x) for x in tokenized_answers])
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
decoder_input_data = np.array(padded_answers)
print(decoder_input_data.shape, maxlen_answers)

tokenized_answers = tokenizer.texts_to_sequences(answers)
for i in range(len(tokenized_answers)):
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=maxlen_answers, padding='post')
one_hot_answers = utils.to_categorical(padded_answers, VOCAB_SIZE)
decoder_output_data = np.array(one_hot_answers)

Now that, we have formed the 3 arrays for analysis, it’s time to move towards building and training the model.

The LSTM RNN Model:

Now, we move on to use a Long Term Short Memory (LSTM) recurrent neural network (RNN) comprising of internal gates.

A small example of Seq2seq model for response creation using chatbot

We use a LSTM RNN instead of a conventional RNN to remove the long-term dependencies. In LSTMs, each unit or cell within this layer is classified into 2 types: an internal cell state (abbreviated as ‘c’) and outputs a hidden state (abbreviated as ‘h’). There are gates which outputs a value between 0 and 1, 0 meaning completely get rid of the information and 1 means take this information completely. Next, for the information to store the input layer is decided jointly by a sigmoid and a tanh layer. The detailed description about how a LSTM works can be found in this link

The basic configuration of the RNN used is as follows:

  1. There are 2 input Layers, one for encoder input data and another for decoder input data.
  2. Embedding layer:  Word embedding is a class of approaches for representing words and documents using a dense vector representations. Here, it is used for converting token vectors to fixed sized dense vectors.
  3. LSTM layer: Provide access to Long-Short Term cells.
  4. Rmsprop is used as an optimizer and categorical cross-entropyis used as loss function here.
encoder_inputs = tf.keras.layers.Input(shape=( None , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()[encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=100 ) 

Interacting with our chatbot:

The basic LSTM sequence to sequence model is trained to predict decoder output given the encoder input and decoder input data. The encoder input data comes in the Embedding layer (encoder embedding). The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors (h and c which are encoder states, as described above). These states are set in the LSTM cell of the decoder. The decoder input data comes in through the Embedding layer. The Embedding goes in LSTM cell (which had the states) to produce sequences.

def make_inference_models():
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    return encoder_model , decoder_model

def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')
enc_model , dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation )

Interaction with the created chatbot:

In our Next articles we will be discussing about:

  1. Retrieval based chatbots with python (chatbots which are trained with a collection of pre-defined responses.
  2. Creation of GUI (Graphical User Interface) like Google assistance where which can be integrated with any website or App.

For detailed code and saved model you can visit the GitHub repository.


Understanding LSTMs

Difference between Return sequences & Return states for LSTMs

Recurrent Layers (LSTM)

Introduction to seq2seq mode


Chatbots explanation

Kounteyo Roy Chowdhury
Kounteyo Roy Chowdhury

Msc in applied statistics.
Data Scientist specializing in AI-NLP

Arpan Sil
Arpan Sil

Msc in Applied statistics
A technical data Analyst


Mathematica-city is an online Education forum for Science students run by Kounteyo, Shreyansh and Souvik. We aim to provide articles related to Actuarial Science, Data Science, Statistics, Mathematics and their applications using different Statistical Software. We also provide freelancing services on the aforementioned topics. Feel free to reach out to us for any kind of discussion on any of the related topics.

26 thoughts on “Generative based chatbot

  1. Hi 🙂 I found your webpage via Yahoo Search Engine while searching for free settlement letters to creditors and your post regarding news looks very interesting to me. I have a few meega websites of my own and I must say that your site is really top notch. Keep up the great work on a really high class resource. Super! Thanks!…

  2. I am going to go ahead and save this content for my brother for a research project for school. This is a attractive site by the way. Where did you get the theme for this website?

  3. After read a few of the blogposts on your web site now, and I really like your style of blogging. I bookmarked it to my favorites site list and will be checking back soon. Pls check out my internet site as well and let me know what you think.

  4. I and my buddies appeared to be studying the best key points from your web site and so quickly developed an awful feeling I had not thanked the web site owner for those techniques. Those boys were definitely consequently excited to study all of them and have now seriously been using those things. Appreciate your turning out to be very considerate and then for making a choice on this sort of high-quality subject matter millions of individuals are really needing to be aware of. My very own honest regret for not expressing gratitude to you earlier.

  5. I enjoy you because of all of the hard work on this site. My mum really likes managing research and it’s obvious why. Almost all notice all about the compelling mode you give priceless items through this web site and as well cause participation from some other people on this subject matter so our own daughter is always becoming educated so much. Take advantage of the rest of the new year. You’re performing a terrific job.

  6. I wanted to send you this little bit of note to finally thank you so much once again on your beautiful advice you’ve documented above. It is quite incredibly generous of you to convey easily exactly what many of us would’ve offered for an ebook in making some bucks for themselves, primarily considering that you could possibly have done it if you desired. The good tips likewise acted to provide a easy way to comprehend the rest have a similar eagerness the same as mine to grasp a good deal more when it comes to this problem. I think there are several more pleasant sessions in the future for people who look into your blog.

  7. I have been meaning to write about something like this on my webpage and you gave me an idea. Thanks.

  8. So informative site! Big thanks! Thanks for a great time visiting your site. Its really a pleasure knowing a site like this packed with great information. Thank you!

  9. That is really fascinating, You are a very skilled blogger. I have joined your rss feed and stay up for in search of more of your magnificent post. Additionally, Ive shared your web site in my social networks!

  10. I too have a good website a small internet site on fashion,i hope we can have a take a look at each other website.Besides i can share hell lot of matters about my fashion information with you in the event you wish.

  11. An outstanding share! I’ve just forwarԀed this onto a colleague who has been conductіng a little homewοrk on this.
    And he in fact bought mе Ьreakfast due to tһe fact that I stumbled upon it for him…

    lol. So allow me to reward this…. Thank YOU for the meаl!!
    But yeah, thanks for sрending the time to talк about this subject here on your blоg.

  12. Hi! І could have sworn I’ve visited this blog bеfore but afteг brоwsing tһrough a few of the articles I гealized it’s new to me.

    Anyhow, I’m certainly pleased I discovered it and I’ll be bookmaгking it
    and checking back reguⅼarⅼy!

  13. I am going to start a blog on the same theme soon, for this reason Im so serious about your article. Would you mind if I used a few of your thoughts for my weblog? Ill obviously point out you as the original source and set up a link pointing back to your site. Appreciate it! Have you considered promoting your blog? add it to SEO Directory right now 🙂

  14. Can I just say what a relief to find someone who actually knows what theyre talking about on the internet. You definitely know how to bring an issue to light and make it important. More people need to read this and understand this side of the story. I cant believe youre not more popular because you definitely have the gift.

  15. An impressive share, I just given this onto a colleague who was doing a little analysis on this. And he in fact bought me breakfast because I found it for him.. smile. So let me reword that: Thnx for the treat! But yeah Thnkx for spending the time to discuss this, I feel strongly about it and love reading more on this topic. If possible, as you become expertise, would you mind updating your blog with more details? It is highly helpful for me. Big thumb up for this blog post!

  16. I’m impressed, I must say. Actually hardly ever do I encounter a blog that’s both educative and entertaining, and let me inform you, you will have hit the nail on the head. Your concept is excellent; the problem is something that not sufficient people are talking intelligently about. I’m very blissful that I stumbled across this in my seek for something referring to this.

Leave a Reply

Your email address will not be published.