In Which a Project was Undertaken, Part 4

The Meat of the Matter

Got your multipass? (Wikipedia link, IMDB link)

In the previous blogs, I talked about my motivation for studying deep learning (link), my choice of a machine (link to part 1, link to part 2), my search for appropriate online courses and my review of the courses I took (link to part 1, link to part 2, link to part 3), and finally my choice of an initial project and preparations for the project (link to part 1, link to part 2, link to part 3).

And finally it's time to dig in to the results of the project, which is a simple dialogue generator for Star Trek: The Next Generation characters. The neural network and input data have all been set up, and fun can begin!

Run 1

For the first few runs, I had not trimmed the inputs as I had explained in the previous blog. It was only later when I decided to try to improve the run time that I switched. The model summary of the neural network was as follows:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 178, 128) 792192
_________________________________________________________________
bidirectional (Bidirectional (None, 178, 256) 263168
_________________________________________________________________
dropout (Dropout) (None, 178, 256) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256) 394240
_________________________________________________________________
dense (Dense) (None, 1024) 263168
_________________________________________________________________
dense_1 (Dense) (None, 6189) 6343725
=================================================================
Total params: 8,056,493
Trainable params: 8,056,493
Non-trainable params: 0
_________________________________________________________________

Training was relatively bearable at about 2 minutes an epoch, but started plateauing at 47% accuracy.

Train on 80531 samples
Epoch 1/100
80531/80531 [==============================] - 121s 2ms/sample - loss: 6.5320 - acc: 0.0492
...
Epoch 100/100
80531/80531 [==============================] - 117s 1ms/sample - loss: 2.5600 - acc: 0.4742

The output was already somewhat coherent, though. A 50 word output stream gives:

Picard: I don't know I don't know I don't know what you have but that is not appropriate she's attacked down I want you to consider that he was alive but I would like to talk to a little threatened the shut down time but stopping us in our own realities

Run 2

I was wondering if the LSTM portion was simply too small to store all the relationships between words, so I added two LSTM layers, one the same as the first one, and another one double the size in front of that. Also, at this point I am more concerned with accuracy than overfitting, and I am not working with a dev set either, so I removed the dropout layer and the L2 regularization.

model = Sequential()
model.add(Embedding(total_words, 128, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1024, activation='relu'))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

This increased the number of parameters to 9,271,469, and is significantly slowing down training. Accuracy did increase to 71%, however.

Epoch 134/300
80531/80531 [==============================] - 196s 2ms/sample - loss: 1.1290 - acc: 0.7102

Run 3

I continued to incrementally increase the neural neetwork size so as to maintain a reasonable training time and maintain network efficiency. This time I doubled the size of the first two LSTM layers. This turned out to drastically increase the number of parameters (to 14,654,253) and training time. And broke the network. The accuracy hit a plateau at 41% then started crashing. Either the network is overfitting, or it moved into a part of the parameter space which gives poor results and is stuck there.

Epoch 65/200
80531/80531 [==============================] - 361s 4ms/sample - loss: 2.4229 - acc: 0.4116
...
Epoch 74/200
80531/80531 [==============================] - 336s 4ms/sample - loss: 3.0998 - acc: 0.2916

Run 4

Since large LSTMs were taking too long to train, I tried switching the LSTM to GRUs, hoping that would lessen the architecture's complexity and reduce training time. There was some speedup, but it turns out that the GRUs in this arrangment cannot be trained properly; the accuracy never goes above 11%.

Epoch 88/200
80531/80531 [==============================] - 265s 3ms/sample - loss: 5.3787 - acc: 0.1152

Picard: I am the ship is the ship I have is the ship I have is the ship I have is the ship

Run 5

Back to using LSTMs, but instead of increasing the size of the layer as in Run 4, I increased the number of layers again, which was helpful previously. However, it appears that I have passed the point of diminshing returns, and the extra layer did not affect accuracy significantly.

model = Sequential()
model.add(Embedding(total_words, 128, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1024, activation='relu'))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Epoch 161/200
80531/80531 [==============================] - 232s 3ms/sample - loss: 1.1463 - acc: 0.7109

Run 6

I needed a new approach besides messing with layer numbers and width. This is when I started wondering whether the input was too sparse for good learning, as in, there are too many short lines of dialogue with padding of empty data to accommodate the long lines, and too many words used only once. I trimmed the input data as described earlier, and returned to the architecture used in Run 3. The number of parameters and training time dropped significantly after trimming the data, and accuracy made a small gain to 76%.

Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 78, 128) 446336
_________________________________________________________________
bidirectional_8 (Bidirection (None, 78, 512) 788480
_________________________________________________________________
bidirectional_9 (Bidirection (None, 78, 256) 656384
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 78, 256) 394240
_________________________________________________________________
bidirectional_11 (Bidirectio (None, 128) 164352
_________________________________________________________________
dense_4 (Dense) (None, 1024) 132096
_________________________________________________________________
dense_5 (Dense) (None, 3487) 3574175
=================================================================
Total params: 6,156,063
Trainable params: 6,156,063
Non-trainable params: 0
_________________________________________________________________

Epoch 170/200
75398/75398 [==============================] - 103s 1ms/sample - loss: 0.9113 - acc: 0.7625

Picard: I don't understand this is not the way it's supposed to happen therefore that would be happy to hear them okay okay how long it takes we will do please report to the array again and it's getting worse your attention and how long have you been here

Riker: I don't think we have a choice it would take weeks to reach the main deflector and return to the transmission and returning to the ship in the nebula and the sensors six weeks why are you all on one

Worf: I do not know but doctor crusher cannot that riker I understand that you were trying to remove with these em sometimes life we have no idea why to these eruptions you ever been alone and see if the Bajorans

Run 7

The neural network was giving fairly good results at this level of accuracy. However, I want to see if I can eke out some more improvements, so I threw everything and the kitchen sink at the neural network.

First I switched out the default Glorot uniform initialization, which samples from a uniform distribution, with He normal initialization, which samples from a truncated normal distribution centered on 0, for the dense layer with ReLU activation function. He initialization works better for layers with ReLU activation to further contain vanishing and exploding gradient issues. See here for a moderately in-depth explanation: link.

I also mini-batched the training with a batch size of 256, which updates the neural network after small batches of input instead of all at once after it sees every input, which should speed up learning. Mini-batching can be considered a midpoint between stochastic gradient descent, which updates the weights of the network after every training example, and batch gradient descent, which updates only after every example has been sent through the network. This site goes over the difference in a bit more detail: link.

Finally I switched out the Adam optimizer with the Nesterov-accelerated Adam optimizer, which improves the momentum calculation by looking at the future step instead of the current step. The link I shared earlier also has an explanation of Nesterov Adam (Medium link).

The results were great! Training time for each epoch dropped by more than half, and accuracy increased to 84%.

Epoch 200/200
75398/75398 [==============================] - 39s 515us/sample - loss: 0.5893 - acc: 0.8439

I also tried adding learning rate decay, but it did not make much of a difference.

Epoch 100/100
75316/75316 [==============================] - 34s 451us/sample - loss: 0.5610 - acc: 0.8567

Run 8+

I could basically have stopped here, but I was curious as to how much each of the changes I made contributed to the increase in training speed and accuracy. Switching the optimizer from Nesterov Adam back to just plain Adam slowed down the learning rate some, but made no difference in the final accuracy.

Epoch 200/200
75398/75398 [==============================] - 37s 486us/sample - loss: 0.5752 - acc: 0.8521

Reverting He normal initialization back to Glorot uniform lowered max accuracy a little, otherwise it had little effect.

Epoch 200/200
75398/75398 [==============================] - 36s 482us/sample - loss: 0.7004 - acc: 0.8185

Since it looks like mini-batching was the cause of most of the improvements, I explored optimizing mini-batch size. I reduced the number of epochs to 20 so that testing various batch sizes doesn't take too long. The results are as follows:

I ran the tests multiple times, as it appears that the random elements of the training causes results to be somewhat inconsistent, although it does look like I lucked out and hit upon approximately the best batch size for accuracy gain per second.

Next, I tried searching for the best learning rate, but the randomness of results was once again hampering my efforts. I went on a bit of a detour to try to make the training deterministic. Surprisingly, this turned out to be rather difficult. I found various blogs and discussions about where the randomness are, and set the following parameters in my code. I even went as far as resetting all the random seeds before each change in learning rate, but no dice (unintentional pun). Please let me know if there is anything that I've missed.

rn.seed(12345)
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(42)
tf.random.set_seed(1234)
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)

Finally, just for fun, I threw all 7 seasons' scripts at the neural network and turned on regularization and dropout, just to see what would happen. The network choked at these changes, sputtering to learn the significantly larger corpus, and catastrophically failed after 600 epochs. Likely the changes made the previously tuned parameters sub-optimal again, and the network wandered off into a bad part of the parameter space.

Talking Heads

This guy/lady has a tendency to talk their enemies to death... (Wikipedia link, IMDB link)

Taking the output from the best performing run, I chopped the output stream into sentences and arranged them a bit into something somewhat coherent... (Yes, it's cheating a little, but it was surprisingly easy to arrange it into something that makes some sort of sense.)

Picard: And Worf...
Worf: Why?
Picard: It is a long range scan of the region of a biological inquiry.
Picard: If I could just get access to the Enterprise computer, just for a few moments...
Riker: I don't think we could have gotten it.
Crusher: I think I can use the tricorder to set up a multiphase pulse that should weaken the field enough to let us through.
Data: Cloak out the plasma relays on the Pegasus.
Worf: She's right.
Data: Perhaps you could.
Picard: It would seem an alternative.
Picard: It should be forced to cut the signal from the transport levels, otherwise they're in danger.
Crusher: If you wish.
LaForge: More power levels is hard.
Troi: Accessing the container itself until the event.
Riker: Power systems that commander worf's quantum signature pose.
Riker: They went to warp.

Picard: So what will Baran do once he's the second artefact?
Data: Right now I think that's what happened twelve years ago.
Data: After we left the ship.
Riker: I'm going through those times.
Riker: No memory emergent thing, for instance.
LaForge: I don't know, Data, all I can think is that maybe Lieutenant Kwan felt that there was something wrong in his life.
Data: I am not certain I do.
Troi: I don't think so, she's believed she's human all her life.
LaForge: That way she's in command.
Worf: I do not believe that you would have to analyse it.
Worf: Just good data in watching him come to give us a hand.
Riker: Any word?
Picard: I don't know.
Worf: Get some rest.

The End of the Beginning

All of this has happened before, and all of this will happen again. (Wikipedia link, IMDB link)

This concludes the current project, with some slightly hokey and mildly awkward dialogue generated, although with some massaging (adding stage directions, etc) I think a story can be constructed. It was a good experience to try my hand at optimizing the neural network, its parameters and hyperparameters, and getting a more intuitive sense of what works and what doesn't than can be obtained from the Coursera course assignments.

One problem with achieving high accuracy for learning from the transcripts is that sometimes the neural network regenerated the exact lines from the transcript word-for-word, which is not the goal. A balance is needed between having high accuracy for following the rules of grammar, and having some randomness so that new dialogues are generated. The mixing of the lines of dialogue has given the generated text some of that flavor, but it is hampered by two issues. First, the removal of the single-use words caused grammatical anomalies which the neural network has a hard time recovering from. Second, the limited vocabulary of the input also limits the possible output.

This gives a clear follow-on project: start with a model which has already been trained on a much larger corpus of tests such as word2vec, and then use transfer learning to add the flavor of Star Trek dialogues to the model. This would allow more natural speech patterns to arise from the network, as well as the use of vocabulary not originally in the scripts.

However, that is an adventure for another day...

He awaits his new adventure. (Wikipedia link, IMDB link)

Still Waters Run Deep Learning