In Which a Project was Undertaken, Part 3

Now that my system is set up to run neural network training locally (described in In Which a Project was Undertaken, Part 2: link). I started to experiment with some trial runs with training the network to be able to generate dialogue.

Massaging the Data

Trust the data (Data and Lore, androids from Star Trek: The Next Generation (Wikipedia link, IMDB link)

I had collected Season 7 of the classic Star Trek: The Next Generation transcripts from this site (link), representing 26 hours of dialogue which can be used as input for training, but it turns out that using this data directly posed several issues. To prepare the data for ingestion by the neural network, it needs to be preprocessed, trimming off all the fat and indigestible bits.

First I removed all the stage directions such as [Bridge], [OC], (Picard sits down), since that is not necessary for dialogue generation. Luckily all such directions and descriptions were contained between parentheses or square brackets, so they were easily filtered out.

Next, the length of the lines needs to be considered. In order to feed the neural network with a predictable length of data for every input stream, each line of dialogue from the transcript had to be padded to the length of the longest line with empty padding data. This means that for short lines, there would be almost no information.

Processing the transcript showed that the longest line was a monologue 179 words long. This means that for one-word-lines of dialogue, there will be 177 blank paddings (the name of the speaker takes one word). To keep the input size of the neural network manageable, extremely long dialogues will be removed from the corpus. Looking at the distribution curve of the dialogue length, it's clear that 80 words is safely into the long tail, so lines longer than 80 words will be removed (truncation is not used so that the neural network always gets complete sentences).

Also, very short lines of dialogue (such as "Thanks", "Goodbye", "I don't know") are not very helpful in terms of learning the peculiarities of someone's speech, so dialogues of 4 words or less will be removed as well.

The resulting corpus still contains 90.6% of the original sentence count, so the majority of the data is still retained.

Then, I decided to only generate dialogue for the major characters. Minor characters often don't have enough lines to establish a linguistic pattern, and thus including their lines will only add noise to the data. Looking at the distribution of lines per character, it's clear that only the top 7 characters' lines need to be kept.

Finally, there are 6180 unique words in the remaining corpus. This large output size slows down learning, especially for infrequently used words. Counting how often each words are used, it turns out that only 3486 (56.4%) are used more than once. Removing only single-use words significantly reduced the output size. This will likely introduce some grammar anomalies in the output due to the missing words, but I believe the output can still be usable through the remaining words.

Starting the Network

The Net is Vast and Infinite (Wikipedia link, IMDB link)

For the initial neural network, I started with a simple one which was proven to work in the TensorFlow in Practice Specialization (reviewed in In Which Instructions were Sought, Part 3: link). The neural network was used to ingest Shakespeare's sonnets and output a text stream similar in style. Starting with this neural network offers several advantages:

Chosen by the instructor as a functional network
Shown to work for a similar learning task
I have some experience with tuning it

The network consists of only a few layers. Here is the TensorFlow/Keras code:

______________________________________________________

model = Sequential()
model.add(Embedding(total_words, 128, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Dropout(0.1))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(1024, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

______________________________________________________

A not-so-short description of the network is as follows:

The embedding layer takes the total number of distinct words and maps it into a dense vector representation (an embedding), which allows both more compact representation and the ability for the network to learn associations between words. There is a description of word embeddings on this site: Towards Data Science link. The input length of the layer is set to the longest line of text.

The Long Short-Term Memory layer (LSTM) is a neural network architecture which can work with sequential data and remember state from previous data, which allows it to learn better representations of the data by associating it with additional information presented earlier. Here is a site with an explanation of how LSTMs work: link. For our use case, we already have all the inputs ahead of time, which means we can make the LSTM bidirectional and let earlier data in the sequence be affected by later data. The output of all the units in the layer will be used (return_sequences = True) since we are adding additional learning layers after this.

The dropout layer prevents overfitting (learning to fit the training data too well, to the exclusion of a more general model) by randomly turning off (or dropping out) input units in the layer, forcing the next layer to not rely on any specific unit in the previous layer. This site has some diagrams illustrating the concept: link.

The next LSTM layer continues learning sequential relationships, and as the final sequential layer, only outputs its prediction (no return_sequences).

The dense layer is a basic densely connected neural network layer, with every input connected to every unit. This layer's activation is set to ReLU (Rectified Linear Unit, basically output 0 for all inputs <0, otherwise output the input) since it is fast and simple, and avoids many issues of more complicated activation functions such as vanishing gradients. Here is a description of ReLU: link. An L2 regularizer is also added, which causes the learned weights to decay towards zero, thereby simplifying the model and avoiding overfitting. Here is a site with an explanation of L2 regularization: link.

The final dense layer converts the learned information into an output word. A softmax activation is used so that a probability distribution of possible output words are generated.
The model is then compiled using a categorical cross-entropy loss function due to the multi-class classification nature of the task (the output needs to be categorized into one of the possible output words). An Adam (adaptive moment estimation) optimizer (a simple explanation is here: link) is used so that the the gradient descent step size is dynamically adjusted based on a moving average of the gradient, as opposed to using a constant, which allows the network to converge faster.

To Be Continued

Tune in next time, same bat-time, same bat-channel! (Wikipedia link, IMDB link)

Now that the input data and the neural network structure has been introduced, the next blog can finally get to the meat of the project: the actual training and results! Please continue following my adventures in my next entry, In Which a Project was Undertaken, Part 3 (link). Thank you!

Still Waters Run Deep Learning