Elena' s AI Blog

TensorFlow: Romancing with TensorFlow and NLP

11 Jul 2022 (updated: 01 Jun 2026) / 39 minutes to read

Elena Daehnhardt


Jasper AI-generated art, January 2023


TL;DR:
  • Use Tokenizer for text preprocessing, one-hot encoding for categorical text, Embedding layers for word vectors. Build LSTM/RNN models for sequence generation—essential for NLP tasks.

Previous: Part 18 — Decision Tree versus Random Forest, and Hyperparameter Optimisation

Next: Part 20 — Cross-Validation Techniques

What Is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to preprocess, analyse, and generate human language in textual or voice form. NLP powers many automated tools: text translation, spell checking, search autocompletion, abstract generation, voice text messaging, messenger bots, chatbots, question-answering systems, and virtual assistants such as Amazon Alexa. NLP tools are employed to preprocess and analyse human language in textual or voice media, and [to a certain extent] “understand” its meaning, intent, sentiment, or find named entities such as personal names or cities. I like this short definition of NLP from Wikipedia:

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, mainly how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights in the papers and categorize and organize the documents themselves.

NLP can also be used in natural language generation. For instance, a poem generator that created this sonnet for me:

Ode to the Ocean

My square Ocean, you inspire me to write.
How I love the way you sings, talk and walk,
Invading my mind day and through the night,
Always dreaming about the good crosstalk.

Let me compare you to a direct sky?
You are more honest, dishonest and nice.
True drought dries the fond picnics of July,
And summertime has the attractive gneiss.

How do I love you? Let me count the ways.
I love your loving eyes, smile and kindness.
Thinking of your longing smile fills my days.
My love for you is the river blindness.

Now I must away with a perfect heart,
Remember my straight words whilst we're apart.

Not bad, isn’t it? We don’t know precisely about the algorithm working here. But, we can do so much more with NLP with quite a few lines of code! In this article, however, we focus on the fundamental things that help us create our own poem generation: we can out-write the most famous authors in a couple of minutes! Will be these poems beautiful? Let’s go ahead and check!

What we are going to do? How we will create our poem/sonnets generator? We can implement the poem generator in different ways, but for the beginning, we can follow this strategy:

  1. Firstly, we will need to have some existing text created by human writers. This text will be used to “teach” our program how to write poems. I suggest extracting text from The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren. You can have your own choice; just change the web source to your liking. We will preprocess the extracted text by removing unnecessary parts, such as the beginning of the file and some concluding parts.
  2. Secondly, we tokenize text, thus converting it to numerical format, which is needed for neural networks.
  3. Next, we build the poem-generating model with the help of the TensorFlow library and Keras.
  4. Finally, we generate our love poem!

How did I come up with this solution? As you perhaps, red from my previous posts, I follow the Udemy course “TensorFlow Developer Certificate in 2022: Zero to Mastery.” On the way, I go deeper into each topic and do a write-up in my blog posts, each of them having a complete Deep-learning application using the knowledge I get from the course and other resources I find online (used in the Reference section). In this post, I will focus on the chapter 08. “Natural Language Processing with TensorFlow.” I want to create a simple poem-generating model and also describe things I have learned on the way. Hopefully, it will be helpful for some of you and me to keep track of my learning process and do a bit more than merely digesting the ready material.

Text Vectorisation Concepts for NLP: One-Hot, Integer, and Word Embeddings

Let’s start with basic NLP concepts relevant to our poem generation task. The most important concept is to convert text into numbers since neural networks work with numeric data. Converting text into numbers is called vectorisation, which can be done in different ways. I will show an example of representing the phrase “the sun shines” using several methods of text vectorisation. Let’s assume that we have five words [“the”, “sun”, “shines”, “it”, “rains”] in our vocabulary for simplicity reasons.

One-Hot Encoding in NLP

One-hot encoding is a text vectorisation method that represents each word as a binary vector the length of the vocabulary (the number of unique words), with a single “1” marking the word’s position and “0” everywhere else. One-hot encoding is relatively inefficient because each vector contains many “0” values and is therefore very sparse 6.

thesunshinesitrains
the10000
sun01000
shines00100

Integer Encoding of Text

Integer encoding is a text vectorisation method that assigns each word an arbitrary integer. Integer encoding is efficient in terms of memory storage but makes meaningful models hard to build, because the integers do not relate to word meanings 6.

thesunshines
123

Word Embeddings with the Keras Embedding Layer

Word embeddings are a text vectorisation technique that represents words as dense, learned vectors so that semantically similar words sit close together in a continuous vector space. Word embeddings are learned with back propagation, and after a model is trained they capture similarities between words 6. In Keras, we have the Embedding layer, which builds on the specified vocabulary length (number of words), and the number of embedding dimensions. The Embedding layer can be used as a part of a Deep Neural Network or separately while fitting to particular training data and learning from the provided words, positioned in a continuous vector space based on words and their surroundings (other words used together).

The Keras Embedding Layer takes in the “input_dim” parameter defining the size of text vocabulary (number of unique words), the “output_dim” determining the size of the vector space and chosen experimentally accordingly to the application. The “input_length” is the fixed length of input sentences or the maximum number of words on input documents. Thus, all inputs should be of the same size. This is why we will pad the sequences, as I will describe later in the code section. We will use tf.keras.utils.pad_sequences function for padding sequences.

from tensorflow.keras.layers import Embedding

# Embed a 5 (usually thousands in practice) word vocabulary into 3 dimensions.
embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=3, input_length=4)

The output of the Embedding layer is a 2D vector with embeddings of words in the input document.

It is essential to mention that we don’t need to learn the word embeddings always “from scratch.” We can also use already pre-trained embedding, which explanation is a bit outside of this post (I plan to investigate it further in detail in one of my next posts). You can read more on Glove, using pre-trained word embeddings from the Keras documentation. You might find it interesting there is also Word2Vec tutorial at tensorflow.org.

Tokenizing Text with the Keras Tokenizer

Tokenization is the NLP preprocessing step that converts raw text into integer sequences, one integer per word, so the text can be fed into a Keras Embedding layer. We will use word embeddings, which are helpful because we (ideally) want to capture the meanings of words for writing poems. Before feeding our text into the word embedding, we must vectorise the text because the Keras Embedding layer requires the input data is presented by integers defined for each word. The process is called tokenization. Keras has a ready solution. We will use the Keras Tokenizer, conveniently preparing our text, such as lowering its case, and removing the punctuation marks. As a result, the Tokenizer will build the text indexes. For instance, with the small “weatherly” text corpus: [“The sun shines”, “it rains”] consisting of two text documents describing different weather conditions, we build the tokenizer:

from tensorflow.keras.preprocessing.text import Tokenizer

text = ["The sun shines","it rains"] 

tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)

# Total number of words
vocabulary_length = len(tokenizer.word_index)
print (f"The vocabulary length is {vocabulary_length} words.")
The vocabulary length is 5 words.

We use “texts_to_sequences” with the fitted tokenizer to tokenize a text string. You can notice from the tokenization examples that tokenizer ignores unknown words (cats and dogs) in the last sentence. We must rebuild the tokenizer to a new text corpus containing cats and dogs.

# Tokenization examples
sentences = [["The sun"], ["The sun shines"], ["It rains"], ["It rains cats and dogs"]]
for sentence in sentences:
  print(f"{sentence}: {tokenizer.texts_to_sequences(sentence)}")
['The sun']: [[1, 2]]
['The sun shines']: [[1, 2, 3]]
['It rains']: [[4, 5]]
['It rains cats and dogs']: [[4, 5]]

Sequence Modeling with RNNs and LSTMs

Sequence modeling is a machine learning approach that predicts the next element in an ordered series, such as the next word in a sentence, based on the elements that precede it. Text is a sequence of words and punctuation marks joined together. We can predict which words can be next based on the previous words. For instance, we can expect that after the word “good,” the next word will be “afternoon”. We thus remain the word combinations and their orders in human speech. However, we create the poem generator, we want to build text sequences automatically using the sequence prediction techniques. Sequence modeling algorithms are implemented in TensorFlow. These algorithms are used not only in text generation but also in other tasks, such as working with video frames or sound sequences.

We will use Recurrent Neural Networks (“RNNs”), particularly the Bidirectional layer in Keras (tf.keras.layers.Bidirectional), and Long Short-Term Memory (LSTM) networks (see tf.keras.layers.LSTM), for our text sequence prediction task. A Recurrent Neural Network (RNN) is a neural network architecture for sequential data in which a series of networks are joined together, passing and preserving the outputs of previous networks as inputs to the next ones; RNNs use “hidden states” to store the memory of the computations. Long Short-Term Memory (LSTM) is an RNN architecture that adds “forget” gates to retain long-range dependencies and mitigate the vanishing gradient problem 10. On LSTM, see also the paper by Hochreiter and Schmidhuber Long Short-Term Memory

Python Code: Building a Poem Generator with TensorFlow and Keras

This is the most exciting part. We will code our poem generator with the help of TensorFlow and Keras, using poems extracted from The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren. You can get all the code from this post at my GitHub repository “deep_learning_notebooks”.

Extracting a Poetry Corpus from Project Gutenberg

I have created a simple function using requests to get a text corpus with poems from the www.gutenberg.org database. We need to check up manually wherein the poem text blocks start and end, for which we have “start_phrase” and “end_phrase” input variables. Notice that we strip the text and break it into lines when we see the carriage return.

Why do ween need to know about the average number of words? When we generate our poem, we need to define how long our poetry sentences should be.

import requests

def get_corpus(url, get_part=True, start_phrase="", end_phrase=""):
    """
    Extracts text from a file located at the provided web address.
    :param url: Link to the text file
    :param get_part: when True, we get only text located between start_phrase and end_phrase strings
    :param start_phrase:
    :param end_phrase:
    :return: a stripped text string without carriage returns and the average number of words in a line.
    """
    try:
        text = requests.get(url).text
    except:
        print("Can not load the document at: " + str(url))
        return False

    if get_part:
        start = text.find(start_phrase)  # skip header
        end = text.rfind(end_phrase)  # skip extra text at the end

    text = text.strip()

    # Split text on carriage returns
    text = text.split('\r')

    # Strip off new lines and empty spaces from the text
    text = [t.strip() for t in text]

    average_number_of_words_in_line = round(sum([len(s.split()) for s in text]) / len(text))
    return text, average_number_of_words_in_line

# Getting and preprocessing a text corpus
text, average_words_number = get_corpus(url="https://www.gutenberg.org/cache/epub/45470/pg45470.txt", get_part=True, start_phrase="THE SHINING HOURS",
                    end_phrase="End of the Project Gutenberg EBook" )
text[:10]
['\ufeffThe Project Gutenberg EBook of The Love Poems, by Émile Verhaeren',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever. You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org/license',
 '',
 '',
 'Title: The Love Poems',
 "(From Les Heures claires, Les Heures d'après-midi, Les Heures du Soir)"]

Creating a Keras Tokenizer from the Corpus

As we already know, when having the text corpus, we must convert it into numbers. In Keras we can use the function “Tokenizer” that helps to tokenize text.

from tensorflow.keras.preprocessing.text import Tokenizer

def create_tokenizer(text):
    """
    Returns tokenizer and total words number based on the extracted text.
    :param text: a text corpus, extracted and preprocessed with get_corpus()
    :return: tokenizer, total words number
    """
    # Please note that I have removed symbols [.,;:] from the default filetr value
    # This helps to preserve punctuation to a certain extent
    tokenizer = Tokenizer(filters='"#$%&()*+-/<=>?@[\\]^_`{|}~\t\n')
    tokenizer.fit_on_texts(text)

    # Total number of words
    vocabulary_length = len(tokenizer.word_index) + 1
    return tokenizer, vocabulary_length

# Tokenizing the extracted text
tokenizer, vocabulary_length =  create_tokenizer(text)
print(vocabulary_length)
3714

Padding Sequences with pad_sequences

We saw that our token vectors were of different lengths when we tokenized our text. However, the deep-learning model in Keras will need to have input vectors of the same length. This is why we pad sequences as follows. I have created pack_sequences and use Keras pad_sequence function for the actual padding of the sentences in our text corpus.

# Import required libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow.keras.utils as kerasutils
from tensorflow.keras.preprocessing.sequence import pad_sequences

def pack_sequences(text, tokenizer, total_words_number):
  """
  Based on the corpus of documents and tokenizer, create padded sequences for the further prediction task
  :param corpus: Text strings
  :param tokenizer: tokenizer
  :param total_words_number: unique number of words in the corpus
  :return: maximum length of sequences, predictors, and labels
  """
  # create input sequences using list of tokens
  input_sequences = []
  for line in text:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i + 1]
        input_sequences.append(n_gram_sequence)

  # pad sequences
  max_sequence_len = max([len(x) for x in input_sequences])
  input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

  # create predictors and labels
  predictors, labels = input_sequences[:, :-1], input_sequences[:, -1]

  labels = kerasutils.to_categorical(labels, num_classes=total_words_number)
  return max_sequence_len, predictors, labels


# Pad text sequences
sequence_length, predictors, labels = pack_sequences(text, tokenizer, vocabulary_length)
print(sequence_length)
15

Building the Sequential Model with Bidirectional LSTM

Next, we create a Sequential Keras model for generating love poems. Notice that we use the Embedding layer from Keras and Bidirectional LSTM networks.

# Import model layers and Adam optimiser
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam


def create_model(vocabulary_length, sequence_length):
  model = Sequential()
  model.add(
        Embedding(input_dim=vocabulary_length, output_dim=100, input_length=sequence_length - 1))
  model.add(Bidirectional(LSTM(150, return_sequences=False))) 
  model.add(Dense(vocabulary_length, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
  return model

# Create and the poem generating model
poems = create_model(vocabulary_length, sequence_length)

# Print the model summary
print(poems.summary())
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 14, 100)           371400    
                                                                 
 bidirectional (Bidirectiona  (None, 300)              301200    
 l)                                                              
                                                                 
 dense (Dense)               (None, 3714)              1117914   
                                                                 
=================================================================
Total params: 1,790,514
Trainable params: 1,790,514
Non-trainable params: 0
_________________________________________________________________
None

Next, we fit the compiled model. I have created an early stopping callback to reduce model training time should require less than 50 epochs.


# Create an early stopping callback, only saving the model weights
def create_early_stopping_callback(monitor="loss", patience=2, restore_best_weights=False):
  early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor=monitor,
                                                             patience=patience,
                                                             verbose=1,
                                                             restore_best_weights=restore_best_weights)

  return early_stopping_callback

# Fit the compiled model
history = poems.fit(predictors, 
                    labels, 
                    epochs=50, 
                    callbacks=[create_early_stopping_callback()],
                    verbose=1)
Epoch 11/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4513 - accuracy: 0.8919
Epoch 12/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4370 - accuracy: 0.8940
Epoch 13/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4253 - accuracy: 0.8947
Epoch 14/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4116 - accuracy: 0.8967
Epoch 15/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4013 - accuracy: 0.8960
Epoch 16/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3951 - accuracy: 0.8974
Epoch 17/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3909 - accuracy: 0.8980
Epoch 18/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3830 - accuracy: 0.8988
Epoch 19/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3780 - accuracy: 0.8993
Epoch 20/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3645 - accuracy: 0.9008
Epoch 21/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3649 - accuracy: 0.9010
Epoch 22/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3593 - accuracy: 0.9008
Epoch 23/57
451/451 [==============================] - 30s 66ms/step - loss: 0.3581 - accuracy: 0.9007
Epoch 24/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3553 - accuracy: 0.9010
Epoch 25/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3461 - accuracy: 0.9019
Epoch 26/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3438 - accuracy: 0.9029
Epoch 27/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3444 - accuracy: 0.9026
Epoch 28/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3410 - accuracy: 0.9015
Epoch 29/57
451/451 [==============================] - 30s 68ms/step - loss: 0.3370 - accuracy: 0.9026
Epoch 30/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3348 - accuracy: 0.9036
Epoch 31/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3292 - accuracy: 0.9038
Epoch 32/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3321 - accuracy: 0.9036
Epoch 33/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3292 - accuracy: 0.9031
Epoch 34/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3265 - accuracy: 0.9024
Epoch 35/57
451/451 [==============================] - 31s 69ms/step - loss: 0.3225 - accuracy: 0.9040
Epoch 36/57
451/451 [==============================] - 30s 68ms/step - loss: 0.3210 - accuracy: 0.9043
Epoch 37/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3198 - accuracy: 0.9041
Epoch 38/57
451/451 [==============================] - 31s 69ms/step - loss: 0.3201 - accuracy: 0.9041
Epoch 39/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3242 - accuracy: 0.9031
Epoch 39: early stopping

Inspecting Model Predictions Layer by Layer

Right now, we see the fitted model as a black box and have no idea about the data prediction on each layer, right? Fortunately, we can check the layer’s attribute of a model.

# Model layers
poetry.layers
[<keras.layers.embeddings.Embedding at 0x7fdea56d8a90>,
 <keras.layers.wrappers.Bidirectional at 0x7fdea560b410>,
 <keras.layers.core.dense.Dense at 0x7fdeab22c850>]

And we can see how the model prediction works on each layer with this code. The data on each layer is represented by float numbers. Note that we preprocess the see text for feeding text strings into the model.

from keras.models import Model

# Text for predictions
seed_text = "Call me"

def preprocess(seed_text):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=14, padding='pre')
  return token_list


print("Model outputs on each layer")

for i in range(0, len(poems.layers)):
    model = Model(poems.layers[0].input, poems.layers[i].output)
    output = model.predict(preprocess(seed_text))
    print(f"======= {i}: {poems.layers[i]} ========")
    print(f"{output}")
Model outputs on each layer
======= 0: <keras.layers.embeddings.Embedding object at 0x7fdea56d8a90> ========
[[[-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  ...
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [ 0.01843664  0.02065328  0.03408707 ... -0.01587552 -0.02026105
   -0.03027008]
  [-0.0420275  -0.3784417   0.02906846 ... -0.10219543  0.16914204
    0.08471879]]]
======= 1: <keras.layers.wrappers.Bidirectional object at 0x7fdea560b410> ========
[[ 1.67686522e-01  3.02018255e-01 -2.93912590e-01  1.64449215e-03
   1.40806567e-03  5.34395039e-01 -5.08705258e-01 -7.53865242e-01
  -3.80721778e-01 -7.58378804e-01 -1.64750755e-01 -7.17333496e-01
  -9.89738345e-01  4.30853665e-02 -5.30713797e-01  5.23770034e-01
   9.85164642e-01  8.51590157e-01  9.70998406e-03 -2.37380475e-01
   9.38392699e-01 -2.39028364e-01 -9.16242719e-01 -9.21792686e-01
   7.33635843e-01  2.30621308e-01  5.96461952e-01 -9.17051673e-01
   4.89480674e-01 -5.30641973e-02 -7.75405169e-01 -7.62042046e-01
   4.63971496e-03  9.10258114e-01  5.64844608e-01 -7.50938952e-02
   1.10193595e-01 -4.62228358e-02  9.75572228e-01  5.35259187e-01
   5.10092378e-02 -4.58493531e-02 -1.88940186e-02 -2.20926225e-01
  -5.70184290e-01  7.91557312e-01  9.86257732e-01 -9.96294260e-01
   2.72399634e-01  6.37552261e-01 -3.62190276e-01 -2.24583745e-02
  -9.04657785e-03 -5.49311399e-01 -7.59658933e-01 -6.54185176e-01
   7.72388577e-01  1.16069829e-02  9.47195709e-01 -5.54253638e-01
   2.20582038e-01 -9.54650819e-01  2.39327431e-01  2.00341687e-01
   9.60499272e-02  6.15832925e-01  1.87492281e-01  9.00895238e-01
  -6.41395688e-01  7.33810246e-01 -9.99938607e-01  8.40495110e-01
  -5.22413254e-02 -4.91351753e-01 -6.62701607e-01 -1.55056305e-02
   1.54367298e-01 -1.52263790e-01  2.44105130e-01  7.19992101e-01
   5.31756759e-01 -8.88277531e-01 -2.73087919e-02  6.41297936e-01
   7.52824605e-01  4.38874960e-01  2.48972699e-01 -1.22749002e-03
   1.42141759e-01 -7.29399204e-01 -4.50825542e-01  1.07500324e-04
   7.79183805e-01 -7.54467368e-01  3.15285325e-01  5.95727742e-01
   5.02261162e-01 -9.08636808e-01  7.24367082e-01  7.56709099e-01
   1.74799770e-01  7.59554565e-01 -4.18312430e-01  1.46734178e-01
   5.55629790e-01  6.32315129e-02  6.81187034e-01 -4.69827652e-02
   9.97444391e-01  8.27595413e-01  5.44130802e-01  3.49239111e-02
   7.38222957e-01  7.53437042e-01 -5.86992919e-01  5.72768524e-02
  -7.34889805e-02 -7.20157087e-01  9.30093825e-01  9.55041409e-01
  -6.99992836e-01  8.60040247e-01  7.26323009e-01 -7.30807841e-01
  -8.93093228e-01  9.73617852e-01 -5.33384562e-01 -3.21368098e-01
  -3.93123746e-01  6.90287948e-02  3.36278915e-01  4.27034825e-01
  -9.57970023e-01 -2.38073617e-01 -4.31691557e-02 -5.65511882e-01
   8.56223226e-01 -8.75256002e-01  7.03857481e-01 -1.98097751e-02
  -6.73841417e-01 -9.82385278e-01  8.26719627e-02  2.90077686e-01
   7.10362434e-01  8.36537302e-01  9.96259749e-01  6.98546367e-03
  -6.56979918e-01  6.16679788e-01  7.31738191e-03 -3.62134457e-01
   8.91300291e-02  3.79151672e-01  1.50031667e-06  3.50421309e-01
  -1.40906975e-03 -5.25088571e-02 -3.34957476e-05  9.98893917e-01
   5.28899848e-01  9.89395678e-01 -1.29087281e-03 -6.55309930e-02
   1.96016908e-01 -1.16327172e-03 -4.14484888e-01 -9.99033928e-01
  -1.97359500e-03  6.75709784e-01  9.72128782e-06  9.36290264e-01
  -7.56961823e-01 -5.42829512e-04  9.17187095e-01  1.36587454e-03
  -9.94153798e-01  9.98658001e-01  7.32723236e-01 -3.47359121e-01
  -7.54145861e-01  6.15631461e-01  9.96937990e-01  7.14225054e-01
  -9.24452603e-01 -4.95396316e-01 -6.13545299e-01 -2.47467797e-05
  -9.77903903e-01  1.20051473e-01  7.62190878e-01 -9.36906457e-01
   5.77772385e-04  6.71585798e-01  3.25368298e-03  1.94000095e-01
   1.76189214e-01 -7.35071719e-01 -4.89228606e-01 -9.72593665e-01
   6.09238148e-01  2.99195573e-02  5.77019854e-03  3.39788087e-02
  -9.80277836e-01 -9.19586062e-01  1.62273951e-04 -1.92941315e-02
  -5.71717741e-03 -9.63379562e-01  9.76729810e-01 -2.38395496e-05
  -2.92603690e-02  6.55407924e-03 -9.07341833e-04 -1.66761765e-05
   1.38965889e-03  7.14213729e-01  1.79404858e-02 -2.41063803e-01
  -7.32580316e-04  5.58820784e-01 -4.36292320e-01  3.05520778e-04
   9.99281704e-01  1.64881028e-04 -9.33250666e-01  1.49732083e-02
   1.45692721e-01 -9.19333160e-01  8.78533483e-01  9.96746719e-01
   9.18693900e-01  1.22652799e-01 -1.65246171e-03 -9.92490530e-01
   1.25048589e-03  2.05047339e-01 -1.13654444e-02  9.25401032e-01
   4.14350331e-02 -7.52726614e-01 -4.56443131e-01  1.78641975e-02
  -1.60465036e-02  7.59996831e-01  9.70341563e-01  9.85438824e-01
   8.51573646e-01  5.91744669e-02  3.75899136e-01 -9.58901286e-01
   9.97678161e-01 -6.45919591e-02 -1.59563959e-01 -1.59562573e-01
   4.76286933e-03 -9.15662050e-01 -3.35523684e-04 -1.93180114e-01
   2.28081077e-01  6.28476322e-01 -8.47836375e-01  1.22441597e-05
   1.66022167e-01 -1.66538788e-03  8.67520630e-01  7.55602062e-01
   1.37379006e-01 -9.66367781e-01 -2.42418453e-01  2.08438367e-01
   5.53086340e-01 -1.33329080e-02 -9.54729497e-01  9.41386342e-01
   2.59216845e-01  3.48699152e-01 -3.15106094e-01  1.21884560e-03
   3.62821847e-01  9.96118784e-01  2.49868989e-01  1.00890247e-04
   9.75508749e-01  4.27330807e-02 -2.03665641e-05 -1.68836862e-01
   5.54664584e-04  9.82056737e-01  4.10370231e-01  9.47619259e-01
  -9.98544633e-01  9.82442915e-01 -1.90521474e-03 -7.00094461e-05
  -9.78231370e-01  1.60007745e-01  9.27375138e-01  2.38520373e-02]]
======= 2: <keras.layers.core.dense.Dense object at 0x7fdeab22c850> ========
[[1.3833241e-10 5.6871340e-05 5.4320190e-03 ... 6.6077948e-14
  1.3805723e-10 5.0975579e-09]]

Next, we can see the next word prediction for our poem. Again, we need to preprocess the word into the numerical format, further, use our model for predicting the next word, and finding the word corresponding to its index calculated on the prediction step. The word following the seed_text = “Call me” is “to”.

def next_word(predicted):
  predicted = np.argmax(predicted, axis=-1)
  for word, index in tokenizer.word_index.items():
    if index == predicted:
       return word
  return ""

def predict_next_word(seed_text):
  predicted = poems.predict(preprocess(seed_text))
  return next_word(predicted)

seed_text = "Call me"
next = predict_next_word(seed_text)
print(next)

to

With the function predict_next_word(seed_text) we can generate easily the poem text sequences.

for i in range(0, 3):
  next = predict_next_word(seed_text)
  seed_text = seed_text + " " + next
  print(seed_text)
Call me to
Call me to breathe
Call me to breathe in

Generating Poem Text from a Seed Title

Ideally, we don’t need to write our poems ourselves. We just come up with titles, and the program will generate our poem automatically! For this, we initialize the “seed_text” variable with the poem title. We also define how many following words should be generated in each iteration. The “paragraphs” variable specifies the number of paragraphs in the generated poem.

def write_poem(model, tokenizer, max_sequence_length, seed_text="The Moon and Sun", next_words=6, paragraphs=3):
    """
    Uses fitted text generating Keras Sequential model to write a poem.
    :param model: Keras sequential model, fitted to a text corpus
    :param tokenizer: Tokenizer
    :param max_sequence_length: Maximum length of text sequences
    :param seed_text: a text sring to start poem generation
    :param next_words: Number of words in a sentence
    :param paragraphs: Number of paragraphs in the generated poem
    :return: text of the generated poem
    """
    poem = seed_text.capitalize() + "\n\n"
    while paragraphs > 0:
        paragraph = ""
        for word_number in range(next_words):
            sentence = "\n"
            for _ in range(next_words):
                token_list = tokenizer.texts_to_sequences([seed_text])[0]
                token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
                predicted = model.predict(token_list)
                predicted = np.argmax(predicted, axis=-1)
                output_word = ""
                for word, index in tokenizer.word_index.items():
                    if index == predicted:
                        output_word = word
                        break
                seed_text += " " + output_word
                sentence += " " + output_word
            if word_number < next_words:
                paragraph += sentence.strip().capitalize() + "\n"
            seed_text = output_word
        seed_text = sentence
        poem += paragraph + "\n"
        paragraphs -= 1

    print(poem)
    return poem

We tokenize each seed text into a sequence, which is further padded. The resulting list of tokens is used for predicting the next word. Each first word in a sentence is capitalized, and each sentence is concluded with a new line.

End-to-End Poetry Generation Workflow

Finally, we join all the steps of text preprocessing, vectorisation and model training to write a poem with our code! Isn’t it excellent for folks that cannot write poetry? Will this approach lead to exciting results? Let’s see.

# Getting and preprocessing a text corpus
text, average_words_number = get_corpus(url="https://www.gutenberg.org/cache/epub/38572/pg38572.txt",
    get_part=True, start_phrase="LOVE SONNETS OF AN",
    end_phrase="_Now in Press_" )

# Tokenizing the extracted text
tokenizer, vocabulary_length =  create_tokenizer(text)

# Pad text sequences
sequence_length, predictors, labels = pack_sequences(text, tokenizer, vocabulary_length)

# Create and the poem generating model
poems = create_model(vocabulary_length, sequence_length)

# Print the model summary
print(poems.summary())

# Fit compiled model
history = poems.fit(predictors, labels, epochs=150, verbose=1)

# Generate poetry
write_poem(poems, tokenizer, 15, seed_text="Shine in the darkness", next_words=5, paragraphs=3)
Shine in the darkness

At the fall of evening,
I part your hair, and
I make towards you, happy
And serene, they believe eagerly;
Its offering, my joy and

The fervour of my flesh.
Oh! how everything, except that
Lives in the fine ruddy
Being seems to dwell in
The summer wind, this page

And that so so open
Forth in the general terms
Of this agreement, you may
My two hands against your
Eyes were then so pure

We can see text strings such as “general terms” and “of this agreement” in our poem. It shows that doing better text cleaning and removing unrelated content is essential.

Conclusion: NLP Poem Generation with Keras

We have created a simple poem generation model with the Keras Sequential API. The generated poem is gibberish, with no plot or idea, but we have exercised the text preprocessing and learned general NLP concepts on the way. A Keras poem generator is a sequence model that converts a text corpus into padded integer sequences, learns word embeddings with a Bidirectional LSTM, and predicts the next word iteratively from a seed title. The next step is to improve the generator with better text cleaning and a larger corpus so it can write more coherent poems.

NLP and TensorFlow Poem Generation FAQ

What is the difference between one-hot encoding and word embeddings in NLP?

One-hot encoding represents each word as a sparse binary vector the length of the vocabulary, so it captures no relationship between words. Word embeddings represent each word as a dense, learned vector in a continuous space where semantically similar words sit close together. For poem generation, use a Keras Embedding layer rather than one-hot vectors.

Why does the Keras Tokenizer ignore some words in texts_to_sequences?

The Tokenizer only knows the words it saw during fit_on_texts. Any word outside that vocabulary (out-of-vocabulary token) is silently dropped from the output sequence. To include new words such as cats and dogs, re-fit the tokenizer on a corpus that contains them, or supply an oov_token.

Why do you pad sequences before training a Keras text model?

A Keras model needs input vectors of the same length, but tokenized sentences differ in length. tf.keras.utils.pad_sequences (or tensorflow.keras.preprocessing.sequence.pad_sequences) pads each sequence to max_sequence_len with zeros, using padding='pre' to add zeros at the front.

Why use a Bidirectional LSTM for poem generation?

An LSTM is a recurrent architecture with forget gates that retain long-range dependencies and reduce the vanishing gradient problem. Wrapping it in tf.keras.layers.Bidirectional lets the model read context in both directions, which improves next-word prediction quality on text sequences.

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about Machine Learning that might be interesting for you




References

1. Natural Language Processing:

2. A poem generator

3. The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren.

4. TensorFlow Developer Certificate in 2022: Zero to Mastery

5. 08. Natural Language Processing with TensorFlow

6. Word embeddings

7. tf.keras.utils.pad_sequences

8. Glove. Using pre-trained word embeddings

9. Word2Vec

10. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2022) 'TensorFlow: Romancing with TensorFlow and NLP', daehnhardt.com, 11 July 2022. Available at: https://daehnhardt.com/blog/2022/07/11/python-natural-language-processing-tensorflow-one-hot-encodings-tokenizer-sequence-modeling-word-embeddings/
All Posts