Elena' s AI Blog

TensorFlow: Romancing with TensorFlow and NLP

11 Jul 2022 / 37 minutes to read

Elena Daehnhardt


Jasper AI-generated art, January 2023


Introduction

Today we have many automated tools that help us to translate text, spell check, autocompletion of text in search, generating abstracts, voice text messaging, messenger bots, chatbots, question-answering systems, and virtual assistants such as Amazon Alexa, amongst other tools. All these and much more are realised with AI techniques specifically focusing on Natural Language Processing (NLP). NLP tools are employed to preprocess, analyse human language in textual or voice media, and [to a certain extent] “understand.” its meaning, intent, sentiment, or find named entities such as personal names or cities. I like this short definition of NLP from Wikipedia:

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, mainly how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights in the papers and categorize and organize the documents themselves.

NLP can also be used in natural language generation. For instance, a poem generator that created this sonnet for me:

Ode to the Ocean

My square Ocean, you inspire me to write.
How I love the way you sings, talk and walk,
Invading my mind day and through the night,
Always dreaming about the good crosstalk.

Let me compare you to a direct sky?
You are more honest, dishonest and nice.
True drought dries the fond picnics of July,
And summertime has the attractive gneiss.

How do I love you? Let me count the ways.
I love your loving eyes, smile and kindness.
Thinking of your longing smile fills my days.
My love for you is the river blindness.

Now I must away with a perfect heart,
Remember my straight words whilst we're apart.

Not bad, isn’t it? We don’t know precisely about the algorithm working here. But, we can do so much more with NLP with quite a few lines of code! In this article, however, we focus on the fundamental things that help us create our own poem generation: we can out-write the most famous authors in a couple of minutes! Will be these poems beautiful? Let’s go ahead and check!

What we are going to do? How we will create our poem/sonnets generator? We can implement the poem generator in different ways, but for the beginning, we can follow this strategy:

  1. Firstly, we will need to have some existing text created by human writers. This text will be used to “teach” our program how to write poems. I suggest extracting text from The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren. You can have your own choice; just change the web source to your liking. We will preprocess the extracted text by removing unnecessary parts, such as the beginning of the file and some concluding parts.
  2. Secondly, we tokenize text, thus converting it to numerical format, which is needed for neural networks.
  3. Next, we build the poem-generating model with the help of the TensorFlow library and Keras.
  4. Finally, we generate our love poem!

How did I come up with this solution? As you perhaps, red from my previous posts, I follow the Udemy course “TensorFlow Developer Certificate in 2022: Zero to Mastery.” On the way, I go deeper into each topic and do a write-up in my blog posts, each of them having a complete Deep-learning application using the knowledge I get from the course and other resources I find online (used in the Reference section). In this post, I will focus on the chapter 08. “Natural Language Processing with TensorFlow.” I want to create a simple poem-generating model and also describe things I have learned on the way. Hopefully, it will be helpful for some of you and me to keep track of my learning process and do a bit more than merely digesting the ready material.

Basic Concepts

Let’s start with basic NLP concepts relevant to our poem generation task. The most important concept is to convert text into numbers since NN work with numeric data. This is called vectorisation, which can be done in different ways. I will show an example of representing the phrase “the sun shines” using several methods of text vectorisation. Let’s assume that we have five words [“the”, “sun”, “shines”, “it”, “rains”] in our vocabulary for simplicity reasons.

One-hot Encodings

The one-hot encoding creates numerical vectors with a length of text vocabulary; this is the number of unique words. The presence of each word is encoded by “1”, otherwise “0”. This approach is relatively inefficient because it includes too many “0” and the resulting vectors are very sparse 6.

thesunshinesitrains
the10000
sun01000
shines00100

Integer Encodings

Herein we encode each word by an arbitrary number. This kind of encoding is quite efficient in terms of memory storage; however, challenging to build meaningful models since the integers do not relate to the word meanings 6

thesunshines
123

Word Embeddings

Word embeddings is a novel approach to representing words as dense vectors. The word embeddings are learned with back propagation. After a model is trained, word embeddings also capture similarities between words 6. In Keras, we have the Embedding layer, which builds on the specified vocabulary length (number of words), and the number of embedding dimensions. The Embedding layer can be used as a part of a Deep Neural Network or separately while fitting to particular training data and learning from the provided words, positioned in a continuous vector space based on words and their surroundings (other words used together).

The Keras Embedding Layer takes in the “input_dim” parameter defining the size of text vocabulary (number of unique words), the “output_dim” determining the size of the vector space and chosen experimentally accordingly to the application. The “input_length” is the fixed length of input sentences or the maximum number of words on input documents. Thus, all inputs should be of the same size. This is why we will pad the sequences, as I will describe later in the code section. We will use tf.keras.utils.pad_sequences function for padding sequences.

from tensorflow.keras.layers import Embedding

# Embed a 5 (usually thousands in practice) word vocabulary into 3 dimensions.
embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=3, input_length=4)

The output of the Embedding layer is a 2D vector with embeddings of words in the input document.

It is essential to mention that we don’t need to learn the word embeddings always “from scratch.” We can also use already pre-trained embedding, which explanation is a bit outside of this post (I plan to investigate it further in detail in one of my next posts). You can read more on Glove, using pre-trained word embeddings from the Keras documentation. You might find it interesting there is also Word2Vec tutorial at tensorflow.org.

Tokenizer

Thus, we will use word embedding, which is quite helpful because we (ideally) want to capture the meanings of words for writing poems. Before feeding our text into the word embedding, we must vectorise the text because the Keras Embedding layer requires the input data is presented by integers defined for each word. The process is called tokenization. Keras has a ready solution. We will use the Keras Tokenizer, conveniently preparing our text, such as lowering its case, and removing the punctuation marks. As a result, the Tokenizer will build the text indexes. For instance, with the small “weatherly” text corpus: [“The sun shines”, “it rains”] consisting of two text documents describing different weather conditions, we build the tokenizer:

from tensorflow.keras.preprocessing.text import Tokenizer

text = ["The sun shines","it rains"] 

tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)

# Total number of words
vocabulary_length = len(tokenizer.word_index)
print (f"The vocabulary length is {vocabulary_length} words.")
The vocabulary length is 5 words.

We use “texts_to_sequences” with the fitted tokenizer to tokenize a text string. You can notice from the tokenization examples that tokenizer ignores unknown words (cats and dogs) in the last sentence. We must rebuild the tokenizer to a new text corpus containing cats and dogs.

# Tokenization examples
sentences = [["The sun"], ["The sun shines"], ["It rains"], ["It rains cats and dogs"]]
for sentence in sentences:
  print(f"{sentence}: {tokenizer.texts_to_sequences(sentence)}")
['The sun']: [[1, 2]]
['The sun shines']: [[1, 2, 3]]
['It rains']: [[4, 5]]
['It rains cats and dogs']: [[4, 5]]

Sequence Modeling

Text is a sequence of words, and punctuation marks joined together. We can predict which words can be next based on the previous words. For instance, we can expect that after the word “good,” the next word will be “afternoon”. We thus remain the word combinations and their orders in human speech. However, we create the poem generator, we want to build text sequences automatically using the sequence prediction techniques. Luckily, we have sequence modeling algorithms realised in TensorFlow. They can be used not only in text generation, and other tasks such as working with video frames or sound sequences.

We will use Recurrent Neural Networks (“RNNs”), particularly, Bidirectional Layer in Keras (tf.keras.layers.Bidirectional, and Long Short Term Memory (LSTM) networks, see tf.keras.layers.LSTM, for our text sequence prediction task. In RNNs, we have a series of Neural Networks that joined together, providing and preserving outputs of previous networks as inputs to the next networks. RNNs use “hidden states” for storing the memory of the computations. LSTMs are similar to RNNs. Additionally, LSTMs have “forget” gates that help in dealing with vanishing gradient 10. On LSTM, see also the paper by Hochreiter and Schmidhuber Long Short-Term Memory

Get the Code

This is the most exciting part. We will code our poem generator with the help of TensorFlow and Keras, using poems extracted from The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren. You can get all the code from this post at my GitHub repository “deep_learning_notebooks”.

Getting Poetry Corpus

I have created a simple function using requests to get a text corpus with poems from the www.gutenberg.org database. We need to check up manually wherein the poem text blocks start and end, for which we have “start_phrase” and “end_phrase” input variables. Notice that we strip the text and break it into lines when we see the carriage return.

Why do ween need to know about the average number of words? When we generate our poem, we need to define how long our poetry sentences should be.

import requests

def get_corpus(url, get_part=True, start_phrase="", end_phrase=""):
    """
    Extracts text from a file located at the provided web address.
    :param url: Link to the text file
    :param get_part: when True, we get only text located between start_phrase and end_phrase strings
    :param start_phrase:
    :param end_phrase:
    :return: a stripped text string without carriage returns and the average number of words in a line.
    """
    try:
        text = requests.get(url).text
    except:
        print("Can not load the document at: " + str(url))
        return False

    if get_part:
        start = text.find(start_phrase)  # skip header
        end = text.rfind(end_phrase)  # skip extra text at the end

    text = text.strip()

    # Split text on carriage returns
    text = text.split('\r')

    # Strip off new lines and empty spaces from the text
    text = [t.strip() for t in text]

    average_number_of_words_in_line = round(sum([len(s.split()) for s in text]) / len(text))
    return text, average_number_of_words_in_line

# Getting and preprocessing a text corpus
text, average_words_number = get_corpus(url="https://www.gutenberg.org/cache/epub/45470/pg45470.txt", get_part=True, start_phrase="THE SHINING HOURS",
                    end_phrase="End of the Project Gutenberg EBook" )
text[:10]
['\ufeffThe Project Gutenberg EBook of The Love Poems, by Émile Verhaeren',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever. You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org/license',
 '',
 '',
 'Title: The Love Poems',
 "(From Les Heures claires, Les Heures d'après-midi, Les Heures du Soir)"]

Creating Tokenizer

As we already know, when having the text corpus, we must convert it into numbers. In Keras we can use the function “Tokenizer” that helps to tokenize text.

from tensorflow.keras.preprocessing.text import Tokenizer

def create_tokenizer(text):
    """
    Returns tokenizer and total words number based on the extracted text.
    :param text: a text corpus, extracted and preprocessed with get_corpus()
    :return: tokenizer, total words number
    """
    # Please note that I have removed symbols [.,;:] from the default filetr value
    # This helps to preserve punctuation to a certain extent
    tokenizer = Tokenizer(filters='"#$%&()*+-/<=>?@[\\]^_`{|}~\t\n')
    tokenizer.fit_on_texts(text)

    # Total number of words
    vocabulary_length = len(tokenizer.word_index) + 1
    return tokenizer, vocabulary_length

# Tokenizing the extracted text
tokenizer, vocabulary_length =  create_tokenizer(text)
print(vocabulary_length)
3714

Padding sentences

We saw that our token vectors were of different lengths when we tokenized our text. However, the deep-learning model in Keras will need to have input vectors of the same length. This is why we pad sequences as follows. I have created pack_sequences and use Keras pad_sequence function for the actual padding of the sentences in our text corpus.

# Import required libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow.keras.utils as kerasutils
from tensorflow.keras.preprocessing.sequence import pad_sequences

def pack_sequences(text, tokenizer, total_words_number):
  """
  Based on the corpus of documents and tokenizer, create padded sequences for the further prediction task
  :param corpus: Text strings
  :param tokenizer: tokenizer
  :param total_words_number: unique number of words in the corpus
  :return: maximum length of sequences, predictors, and labels
  """
  # create input sequences using list of tokens
  input_sequences = []
  for line in text:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i + 1]
        input_sequences.append(n_gram_sequence)

  # pad sequences
  max_sequence_len = max([len(x) for x in input_sequences])
  input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

  # create predictors and labels
  predictors, labels = input_sequences[:, :-1], input_sequences[:, -1]

  labels = kerasutils.to_categorical(labels, num_classes=total_words_number)
  return max_sequence_len, predictors, labels


# Pad text sequences
sequence_length, predictors, labels = pack_sequences(text, tokenizer, vocabulary_length)
print(sequence_length)
15

Creating the model

Next, we create a Sequential Keras model for generating love poems. Notice that we use the Embedding layer from Keras and Bidirectional LSTM networks.

# Import model layers and Adam optimiser
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam


def create_model(vocabulary_length, sequence_length):
  model = Sequential()
  model.add(
        Embedding(input_dim=vocabulary_length, output_dim=100, input_length=sequence_length - 1))
  model.add(Bidirectional(LSTM(150, return_sequences=False))) 
  model.add(Dense(vocabulary_length, activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
  return model

# Create and the poem generating model
poems = create_model(vocabulary_length, sequence_length)

# Print the model summary
print(poems.summary())
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 14, 100)           371400    
                                                                 
 bidirectional (Bidirectiona  (None, 300)              301200    
 l)                                                              
                                                                 
 dense (Dense)               (None, 3714)              1117914   
                                                                 
=================================================================
Total params: 1,790,514
Trainable params: 1,790,514
Non-trainable params: 0
_________________________________________________________________
None

Next, we fit the compiled model. I have created an early stopping callback to reduce model training time should require less than 50 epochs.


# Create an early stopping callback, only saving the model weights
def create_early_stopping_callback(monitor="loss", patience=2, restore_best_weights=False):
  early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor=monitor,
                                                             patience=patience,
                                                             verbose=1,
                                                             restore_best_weights=restore_best_weights)

  return early_stopping_callback

# Fit the compiled model
history = poems.fit(predictors, 
                    labels, 
                    epochs=50, 
                    callbacks=[create_early_stopping_callback()],
                    verbose=1)
Epoch 11/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4513 - accuracy: 0.8919
Epoch 12/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4370 - accuracy: 0.8940
Epoch 13/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4253 - accuracy: 0.8947
Epoch 14/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4116 - accuracy: 0.8967
Epoch 15/57
451/451 [==============================] - 30s 67ms/step - loss: 0.4013 - accuracy: 0.8960
Epoch 16/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3951 - accuracy: 0.8974
Epoch 17/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3909 - accuracy: 0.8980
Epoch 18/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3830 - accuracy: 0.8988
Epoch 19/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3780 - accuracy: 0.8993
Epoch 20/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3645 - accuracy: 0.9008
Epoch 21/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3649 - accuracy: 0.9010
Epoch 22/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3593 - accuracy: 0.9008
Epoch 23/57
451/451 [==============================] - 30s 66ms/step - loss: 0.3581 - accuracy: 0.9007
Epoch 24/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3553 - accuracy: 0.9010
Epoch 25/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3461 - accuracy: 0.9019
Epoch 26/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3438 - accuracy: 0.9029
Epoch 27/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3444 - accuracy: 0.9026
Epoch 28/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3410 - accuracy: 0.9015
Epoch 29/57
451/451 [==============================] - 30s 68ms/step - loss: 0.3370 - accuracy: 0.9026
Epoch 30/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3348 - accuracy: 0.9036
Epoch 31/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3292 - accuracy: 0.9038
Epoch 32/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3321 - accuracy: 0.9036
Epoch 33/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3292 - accuracy: 0.9031
Epoch 34/57
451/451 [==============================] - 30s 67ms/step - loss: 0.3265 - accuracy: 0.9024
Epoch 35/57
451/451 [==============================] - 31s 69ms/step - loss: 0.3225 - accuracy: 0.9040
Epoch 36/57
451/451 [==============================] - 30s 68ms/step - loss: 0.3210 - accuracy: 0.9043
Epoch 37/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3198 - accuracy: 0.9041
Epoch 38/57
451/451 [==============================] - 31s 69ms/step - loss: 0.3201 - accuracy: 0.9041
Epoch 39/57
451/451 [==============================] - 31s 68ms/step - loss: 0.3242 - accuracy: 0.9031
Epoch 39: early stopping

Model Prediction on Layers

Right now, we see the fitted model as a black box and have no idea about the data prediction on each layer, right? Fortunately, we can check the layer’s attribute of a model.

# Model layers
poetry.layers
[<keras.layers.embeddings.Embedding at 0x7fdea56d8a90>,
 <keras.layers.wrappers.Bidirectional at 0x7fdea560b410>,
 <keras.layers.core.dense.Dense at 0x7fdeab22c850>]

And we can see how the model prediction works on each layer with this code. The data on each layer is represented by float numbers. Note that we preprocess the see text for feeding text strings into the model.

from keras.models import Model

# Text for predictions
seed_text = "Call me"

def preprocess(seed_text):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=14, padding='pre')
  return token_list


print("Model outputs on each layer")

for i in range(0, len(poems.layers)):
    model = Model(poems.layers[0].input, poems.layers[i].output)
    output = model.predict(preprocess(seed_text))
    print(f"======= {i}: {poems.layers[i]} ========")
    print(f"{output}")
Model outputs on each layer
======= 0: <keras.layers.embeddings.Embedding object at 0x7fdea56d8a90> ========
[[[-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  ...
  [-0.14440411  0.07779645  0.09079583 ...  0.13821019 -0.02658074
    0.09287226]
  [ 0.01843664  0.02065328  0.03408707 ... -0.01587552 -0.02026105
   -0.03027008]
  [-0.0420275  -0.3784417   0.02906846 ... -0.10219543  0.16914204
    0.08471879]]]
======= 1: <keras.layers.wrappers.Bidirectional object at 0x7fdea560b410> ========
[[ 1.67686522e-01  3.02018255e-01 -2.93912590e-01  1.64449215e-03
   1.40806567e-03  5.34395039e-01 -5.08705258e-01 -7.53865242e-01
  -3.80721778e-01 -7.58378804e-01 -1.64750755e-01 -7.17333496e-01
  -9.89738345e-01  4.30853665e-02 -5.30713797e-01  5.23770034e-01
   9.85164642e-01  8.51590157e-01  9.70998406e-03 -2.37380475e-01
   9.38392699e-01 -2.39028364e-01 -9.16242719e-01 -9.21792686e-01
   7.33635843e-01  2.30621308e-01  5.96461952e-01 -9.17051673e-01
   4.89480674e-01 -5.30641973e-02 -7.75405169e-01 -7.62042046e-01
   4.63971496e-03  9.10258114e-01  5.64844608e-01 -7.50938952e-02
   1.10193595e-01 -4.62228358e-02  9.75572228e-01  5.35259187e-01
   5.10092378e-02 -4.58493531e-02 -1.88940186e-02 -2.20926225e-01
  -5.70184290e-01  7.91557312e-01  9.86257732e-01 -9.96294260e-01
   2.72399634e-01  6.37552261e-01 -3.62190276e-01 -2.24583745e-02
  -9.04657785e-03 -5.49311399e-01 -7.59658933e-01 -6.54185176e-01
   7.72388577e-01  1.16069829e-02  9.47195709e-01 -5.54253638e-01
   2.20582038e-01 -9.54650819e-01  2.39327431e-01  2.00341687e-01
   9.60499272e-02  6.15832925e-01  1.87492281e-01  9.00895238e-01
  -6.41395688e-01  7.33810246e-01 -9.99938607e-01  8.40495110e-01
  -5.22413254e-02 -4.91351753e-01 -6.62701607e-01 -1.55056305e-02
   1.54367298e-01 -1.52263790e-01  2.44105130e-01  7.19992101e-01
   5.31756759e-01 -8.88277531e-01 -2.73087919e-02  6.41297936e-01
   7.52824605e-01  4.38874960e-01  2.48972699e-01 -1.22749002e-03
   1.42141759e-01 -7.29399204e-01 -4.50825542e-01  1.07500324e-04
   7.79183805e-01 -7.54467368e-01  3.15285325e-01  5.95727742e-01
   5.02261162e-01 -9.08636808e-01  7.24367082e-01  7.56709099e-01
   1.74799770e-01  7.59554565e-01 -4.18312430e-01  1.46734178e-01
   5.55629790e-01  6.32315129e-02  6.81187034e-01 -4.69827652e-02
   9.97444391e-01  8.27595413e-01  5.44130802e-01  3.49239111e-02
   7.38222957e-01  7.53437042e-01 -5.86992919e-01  5.72768524e-02
  -7.34889805e-02 -7.20157087e-01  9.30093825e-01  9.55041409e-01
  -6.99992836e-01  8.60040247e-01  7.26323009e-01 -7.30807841e-01
  -8.93093228e-01  9.73617852e-01 -5.33384562e-01 -3.21368098e-01
  -3.93123746e-01  6.90287948e-02  3.36278915e-01  4.27034825e-01
  -9.57970023e-01 -2.38073617e-01 -4.31691557e-02 -5.65511882e-01
   8.56223226e-01 -8.75256002e-01  7.03857481e-01 -1.98097751e-02
  -6.73841417e-01 -9.82385278e-01  8.26719627e-02  2.90077686e-01
   7.10362434e-01  8.36537302e-01  9.96259749e-01  6.98546367e-03
  -6.56979918e-01  6.16679788e-01  7.31738191e-03 -3.62134457e-01
   8.91300291e-02  3.79151672e-01  1.50031667e-06  3.50421309e-01
  -1.40906975e-03 -5.25088571e-02 -3.34957476e-05  9.98893917e-01
   5.28899848e-01  9.89395678e-01 -1.29087281e-03 -6.55309930e-02
   1.96016908e-01 -1.16327172e-03 -4.14484888e-01 -9.99033928e-01
  -1.97359500e-03  6.75709784e-01  9.72128782e-06  9.36290264e-01
  -7.56961823e-01 -5.42829512e-04  9.17187095e-01  1.36587454e-03
  -9.94153798e-01  9.98658001e-01  7.32723236e-01 -3.47359121e-01
  -7.54145861e-01  6.15631461e-01  9.96937990e-01  7.14225054e-01
  -9.24452603e-01 -4.95396316e-01 -6.13545299e-01 -2.47467797e-05
  -9.77903903e-01  1.20051473e-01  7.62190878e-01 -9.36906457e-01
   5.77772385e-04  6.71585798e-01  3.25368298e-03  1.94000095e-01
   1.76189214e-01 -7.35071719e-01 -4.89228606e-01 -9.72593665e-01
   6.09238148e-01  2.99195573e-02  5.77019854e-03  3.39788087e-02
  -9.80277836e-01 -9.19586062e-01  1.62273951e-04 -1.92941315e-02
  -5.71717741e-03 -9.63379562e-01  9.76729810e-01 -2.38395496e-05
  -2.92603690e-02  6.55407924e-03 -9.07341833e-04 -1.66761765e-05
   1.38965889e-03  7.14213729e-01  1.79404858e-02 -2.41063803e-01
  -7.32580316e-04  5.58820784e-01 -4.36292320e-01  3.05520778e-04
   9.99281704e-01  1.64881028e-04 -9.33250666e-01  1.49732083e-02
   1.45692721e-01 -9.19333160e-01  8.78533483e-01  9.96746719e-01
   9.18693900e-01  1.22652799e-01 -1.65246171e-03 -9.92490530e-01
   1.25048589e-03  2.05047339e-01 -1.13654444e-02  9.25401032e-01
   4.14350331e-02 -7.52726614e-01 -4.56443131e-01  1.78641975e-02
  -1.60465036e-02  7.59996831e-01  9.70341563e-01  9.85438824e-01
   8.51573646e-01  5.91744669e-02  3.75899136e-01 -9.58901286e-01
   9.97678161e-01 -6.45919591e-02 -1.59563959e-01 -1.59562573e-01
   4.76286933e-03 -9.15662050e-01 -3.35523684e-04 -1.93180114e-01
   2.28081077e-01  6.28476322e-01 -8.47836375e-01  1.22441597e-05
   1.66022167e-01 -1.66538788e-03  8.67520630e-01  7.55602062e-01
   1.37379006e-01 -9.66367781e-01 -2.42418453e-01  2.08438367e-01
   5.53086340e-01 -1.33329080e-02 -9.54729497e-01  9.41386342e-01
   2.59216845e-01  3.48699152e-01 -3.15106094e-01  1.21884560e-03
   3.62821847e-01  9.96118784e-01  2.49868989e-01  1.00890247e-04
   9.75508749e-01  4.27330807e-02 -2.03665641e-05 -1.68836862e-01
   5.54664584e-04  9.82056737e-01  4.10370231e-01  9.47619259e-01
  -9.98544633e-01  9.82442915e-01 -1.90521474e-03 -7.00094461e-05
  -9.78231370e-01  1.60007745e-01  9.27375138e-01  2.38520373e-02]]
======= 2: <keras.layers.core.dense.Dense object at 0x7fdeab22c850> ========
[[1.3833241e-10 5.6871340e-05 5.4320190e-03 ... 6.6077948e-14
  1.3805723e-10 5.0975579e-09]]

Next, we can see the next word prediction for our poem. Again, we need to preprocess the word into the numerical format, further, use our model for predicting the next word, and finding the word corresponding to its index calculated on the prediction step. The word following the seed_text = “Call me” is “to”.

def next_word(predicted):
  predicted = np.argmax(predicted, axis=-1)
  for word, index in tokenizer.word_index.items():
    if index == predicted:
       return word
  return ""

def predict_next_word(seed_text):
  predicted = poems.predict(preprocess(seed_text))
  return next_word(predicted)

seed_text = "Call me"
next = predict_next_word(seed_text)
print(next)

to

With the function predict_next_word(seed_text) we can generate easily the poem text sequences.

for i in range(0, 3):
  next = predict_next_word(seed_text)
  seed_text = seed_text + " " + next
  print(seed_text)
Call me to
Call me to breathe
Call me to breathe in

How We Generate our Poems?

Ideally, we don’t need to write our poems ourselves. We just come up with titles, and the program will generate our poem automatically! For this, we initialize the “seed_text” variable with the poem title. We also define how many following words should be generated in each iteration. The “paragraphs” variable specifies the number of paragraphs in the generated poem.

def write_poem(model, tokenizer, max_sequence_length, seed_text="The Moon and Sun", next_words=6, paragraphs=3):
    """
    Uses fitted text generating Keras Sequential model to write a poem.
    :param model: Keras sequential model, fitted to a text corpus
    :param tokenizer: Tokenizer
    :param max_sequence_length: Maximum length of text sequences
    :param seed_text: a text sring to start poem generation
    :param next_words: Number of words in a sentence
    :param paragraphs: Number of paragraphs in the generated poem
    :return: text of the generated poem
    """
    poem = seed_text.capitalize() + "\n\n"
    while paragraphs > 0:
        paragraph = ""
        for word_number in range(next_words):
            sentence = "\n"
            for _ in range(next_words):
                token_list = tokenizer.texts_to_sequences([seed_text])[0]
                token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
                predicted = model.predict(token_list)
                predicted = np.argmax(predicted, axis=-1)
                output_word = ""
                for word, index in tokenizer.word_index.items():
                    if index == predicted:
                        output_word = word
                        break
                seed_text += " " + output_word
                sentence += " " + output_word
            if word_number < next_words:
                paragraph += sentence.strip().capitalize() + "\n"
            seed_text = output_word
        seed_text = sentence
        poem += paragraph + "\n"
        paragraphs -= 1

    print(poem)
    return poem

We tokenize each seed text into a sequence, which is further padded. The resulting list of tokens is used for predicting the next word. Each first word in a sentence is capitalized, and each sentence is concluded with a new line.

Workflow of the Poetry Writing

Finally, we join all the steps of text preprocessing, vectorisation and model training to write a poem with our code! Isn’t it excellent for folks that cannot write poetry? Will this approach lead to exciting results? Let’s see.

# Getting and preprocessing a text corpus
text, average_words_number = get_corpus(url="https://www.gutenberg.org/cache/epub/38572/pg38572.txt",
    get_part=True, start_phrase="LOVE SONNETS OF AN",
    end_phrase="_Now in Press_" )

# Tokenizing the extracted text
tokenizer, vocabulary_length =  create_tokenizer(text)

# Pad text sequences
sequence_length, predictors, labels = pack_sequences(text, tokenizer, vocabulary_length)

# Create and the poem generating model
poems = create_model(vocabulary_length, sequence_length)

# Print the model summary
print(poems.summary())

# Fit compiled model
history = poems.fit(predictors, labels, epochs=150, verbose=1)

# Generate poetry
write_poem(poems, tokenizer, 15, seed_text="Shine in the darkness", next_words=5, paragraphs=3)
Shine in the darkness

At the fall of evening,
I part your hair, and
I make towards you, happy
And serene, they believe eagerly;
Its offering, my joy and

The fervour of my flesh.
Oh! how everything, except that
Lives in the fine ruddy
Being seems to dwell in
The summer wind, this page

And that so so open
Forth in the general terms
Of this agreement, you may
My two hands against your
Eyes were then so pure

We can see text strings such as “general terms” and “of this agreement” in our poem. It shows that doing better text cleaning and removing unrelated content is essential.

Conclusion

We have created a simple poem generation model with Keras Sequential API. I think that poem is gibberish; there is no plot or idea, right? However, we have exercised the text preprocessing and learned general NLP concepts on the way. What’s next? Of course, we will improve the poem generator so we can write many beautiful poems with one click!

Did you like this post? Please let me know if you have any comments or suggestions.

Posts about Machine Learning that might be interesting for you




References

1. Natural Language Processing:

2. A poem generator

3. The Project Gutenberg EBook of The Love Poems, by Émile Verhaeren.

4. TensorFlow Developer Certificate in 2022: Zero to Mastery

5. 08. Natural Language Processing with TensorFlow

6. Word embeddings

7. tf.keras.utils.pad_sequences

8. Glove. Using pre-trained word embeddings

9. Word2Vec

10. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.

desktop bg dark

About Elena

Elena, a PhD in Computer Science, simplifies AI concepts and helps you use machine learning.

Citation
Elena Daehnhardt. (2022) 'TensorFlow: Romancing with TensorFlow and NLP', daehnhardt.com, 11 July 2022. Available at: https://daehnhardt.com/blog/2022/07/11/python-natural-language-processing-tensorflow-one-hot-encodings-tokenizer-sequence-modeling-word-embeddings/
All Posts