Building fully recurrent networks in LENS

This tutorial follows on from the previous tutorial on simple recurrent networks (SRNs). Here we will build and train a fully recurrent, continuous network, that is, a network in which a) activation can flow in both directions between connected units, b) it takes time for unit activations to change, and c) the model can learn to produce sequences of outputs rather than a single, static input. If you like you can download the network build file, the network example file, and the procedures for writing data before beginning.

To start, let’s build the network shown in this figure, which learns to map from localist representations of words to temporally distributed representations of their “phonology”. In this implementation, we want to give the network a word and have it output, in correct order, the correct sequence of sounds. To do this, we will have a feed-forward connection from the word into a hidden layer, but full recurrence within the hidden layer, between the hidden and output layers, and within the output layer itself. We will also need training patterns distributed in time, and we will use a continuous network. To keep things simple, we will use a restricted number of letters. The network is laid out so that there are 3 possible letters in the first position, three possible vowels, and three possible letters in the third position. There are 20 English 3-letter-words that can be formed from the letters indicated in the figure, excluding proper names, and we will train the network on all of these.

As always, you will typically build the network by typing the series of commands into a text file and sourcing the file in LENS. But for now, we can do it step-by-step. The first step is to add the network:

 addNet words –i 10 –t 5 CONTINUOUS

This is the same old command familiar from feed-forward and simple-recurrent networks, with a couple of additional arguments. In SRNs the -i indicated the maximum number of intervals or steps appearing in the sequences the model would be processing. In continuous networks, units do not update their activations instantaneously–updating occurs over time. It is worth reading the manual pages on time to understand how time is handled in LENS. They will tell you that, in continuous networks, an interval is an abstract unit of time. In default settings and most applications, one interval indicates the time needed for a unit to fully update its activation to match its current input.  So if a unit’s current state is 0.1, but its inputs indicate that its activation should be 0.9, then it will take one interval of time for the unit to move from its current state of 0.1 to its input-specified state of 0.9.

The –i flag above indicates that the network will be run for a maximum of 10 intervals on any given example. This is the same parameter used in simple recurrent networks; however in SRNs, all unit activations are updated “instantaneously” with each interval. That is, it does not take any time for activation to “build up” and pass forward in such networks. Instead, all units are updated fully once in each interval. So for such networks, this is the only time-related parameter that needs to be set. Our model is “continuous”, as indicated by the CONTINUOUS type at the end of the command. This means that the build-up of activation on a given unit takes some amount of time, and that units can pass on their states to other units as their own activations are slowly building up. This in turn means that a given unit’s activation will eventually be influenced, not only by its initial inputs, but by feedback from elsewhere in the network as the unit passes its activations forward and receives signals back. We want to simulate this continuous settling process on a digital computer, so we need to specify some other parameters to control how time is handled.

The –t flag sets the number of ticks per interval. This parameter indicates the number of activation updates the network will perform in a single interval of time. In this model, the network will update unit activations 5 times in each interval–5 ticks per interval. Since the model will run for at most 10 intervals, then the model will run for at most 50 ticks, or equivalently, will update unit activations 50 times at most on a single example. You can think of the ticks-per-interval as setting the resolution at which time, a continuous process, is simulated by the digital processes running in the network. The more ticks-per-interval, the closer the simulation gets to approximating a truly continuous process, but the longer the simulation takes to run. It is usual to set the ticks-per-interval to something like 5, or the lowest value at which the network seems to be doing a reasonable job.

Other details about time that you might want to know about

There are a few other details about how time is handled that may or may not be relevant to your simulations. You can skip to the next section if you are not interested in these but I will refer briefly to them here for those who need to know.

The first such parameter is dt, which specifies how much a unit’s activation will be updated in a single tick, or equivalently, the speed with which a unit updates its activations. By default, a unit’s dt is set to 1/ticksPerInterval. This means that, given an unchanging input, it will take the unit one full interval for its activation to reach the state indicated by its input.  If, as in the current case, ticksPerInterval is set to 5, dt is set to 0.2 (or 1/5), and a unit with an unchanging input will take 5 ticks (that is, 5 updates) before its activation reaches the state indicated by the input. If we doubled the dt, then the unit would reach the appropriate activation in half the number of ticks—in this case, 2.5 ticks (but since there is no such thing as “half a tick” it would in reality take 3 ticks in the current example).

dt can be set for the entire network, for groups of units, or for individual units. You can have different dts for different sets of units. If your project involves having some units that respond very quickly to changes in the input, and others that respond more sluggishly, you could implement this by having groups of units with different dts. If your project involves dynamically changing the speed with which units respond throughout the entire network, perhaps depending upon the network’s previous outputs, you could implement this using a Tcl script that changes the dt for the entire network based on the network’s outputs.

The second parameter you may or may not care about is whether time is integrated when activations are updated, or when net inputs are updated. The preceding two paragraphs describe what happens when time is integrated at the level of activations, the default. That is, a unit’s net input is calculated instantaneously, by taking the weighted sum of its inputs and directly setting the unit’s input to the resulting value. Changing the magnitude of the net input does not take time—with each tick, the net input to the unit is calculated and directly set accordingly. Instead, the time parameters influence how quickly activations build up in response to a change in net input.

It is possible to reverse these relationships, so that activations are set instantaneously but changes to net inputs take time. In this case, when the net input to a unit changes, instead of directly setting the unit’s net input to the weighted sum of the sending activations, you instead move the net inputs a fraction of the distance from their current value toward the new value. ticksPerInterval and dt work in the same way, they just exert their influence on the net inputs instead of the activations. The manual discusses some of the differences between integrating at the level of activations versus net inputs. The main difference is that, when integrating over net inputs (and for a given dt), unit activations change faster if the unit is in its midrange than if the unit is very strongly activated or de-activated. When integrating over activations, the rate at which unit activations change is the same no matter what the unit’s current state. To change this option from the default (ie to have inputs integrated over time instead of activations), specify IN_INTEGR when you create the unit group.

Back to business

Okay, we understand the addNet command: our network uses continuous units; it will cycle for at most 10 intervals on any example; and will update unit activations 5 times with each interval. But as of now, it has no units (except for the lonely bias unit, connected to nothing). So, let’s add some groups:

addGroup word 22 INPUT
addGroup hidden 6
addGroup spelling 9 OUTPUT SUM_SQUARED

This should be familiar from feed-forward networks. Now, let’s connect the groups together as indicated in the Figure:

connectGroups word hidden
connectGroups hidden spelling –p FULL –bidirectional
connectGroups hidden hidden
connectGroups spelling spelling

In these commands, the remember that the –p flag allows you to specify what type of projection you would like to create. FULL is the default so we don’t really need to specify it; it means every unit in the first group is connected to every unit in the second. Other types are described in the manual and the previous tutorial. The -bidirectional flag indicates that connections should be formed in both directions—that is, word sends connections to hidden, and hidden sends connections back to word. This is what makes the model recurrent! Finally, you can connect the hidden units to one another, and the spelling units to one another. You don’t need to specify “bidirectional” in this case because, with connections within a layer, it’s obvious they have to be bidirectional.

You can plot the model using the autoPlot function:

autoPlot 3

…looks nice.

Temporal events in the environment.

Okay, the network exists and is looking fine. We need to create a world to train it with. For this model, we need to specify a set of patterns that specifies an input (a localist word representation) and a sequence of outputs (the individual letters that spell the word, in the correct order). For a word like “CAT,” how do we tell the model that we want the C to come out first, and not to activate the “A” until the “C” is done?

The fully recurrent environment is specified in a manner similar to the simple recurrent environment: each example consists of a sequence of input-target pairs, with each pair occurring at a different point in time. What differs is how the network decides when it is time to stop processing one step in the sequence and move on to the next one, and this difference requires a couple of extra parameters.

The environment for the CAT pattern might look like this:

defI: 0
defT: 0
min: 0.5
max: 3
;

name: cat 3
I: (word 0) 1 T: 1 0 0 0 0 0 0 0 0
I: (word 0) 1 T: 0 0 0 1 0 0 0 0 0
I: (word 0) 1 T: 1 0 0 0 0 0 1 0 0
;

The first 4 lines are the file header, which as usual sets defaults for all the patterns in the file. The first 2 lines set default input and target values. The next two lines specify some defaults for time-management that we will discuss in a moment. Then the example patterns begin; I have shown a single example for the word “CAT.” It is specified exactly as one would do for a simple recurrent network. Each line of the example shows an input and a target value; the 3 lines together describe a sequence of input-output pairs. In this case, the input stays the same: the first unit in the word layer will be activated on all 3 steps, and all other input units will be set to their default value (0). What changes is the target values: the first step has “C” as a target, the next step has “A” and the third step has “C.”

How is this different from a simple recurrent network? The answer has to do with how the network “decides” to move from one step of the sequence to the next. In a simple recurrent network, all unit activations are updated instantaneously, and each input-target pairing corresponds to one step in time.  That is, the first pattern is presented, all units updated, error calculated; then the next pattern is presented, activations updated, error calculated; and so on. But in a continuous fully-recurrent network, it takes time for activations to build up on units. The network must cycle for some period of time before activation can build up enough in the output layers to produce a response. The environment shouldn’t move on to the next sequence unless the network has had enough time to produce a response. But if a pattern has not yet been learned, the network could cycle forever without getting the response correct. How does the model “decide” whether it should move from the current time-step to the next one?

In LENS, there are two ways of controlling how long the network cycles on a given step of an example. First, on each tick it checks the activations of all the output units against a performance criterion that you can specify (trainGroupCrit). For instance, you might want the network to cycle until all of its units are within 0.1 of their target values, and then move on to the next step. To accomplish this you can set the parameter trainGroupCrit to 0.1:

setObj trainGroupCrit 0.1

Now when the model is training, it will check to see if the output activations match this criterion on every tick; when they do match, the environment will move on to the next step in the example sequence. There is an analogous criterion you can set when testing the model (testGroupCrit).

What if the output activations never match the criterion for moving on? This is what those two additional default parameters in the example header above are for. The min parameter specified the minimum amount of time (in intervals) the network must cycle on every step of an example, while max specifies the maximum amount of time the network may spend on each step. In the example above, the network must cycle for half an interval (2.5 ticks, which is rounded up to 3 ticks since there is no such thing as half a tick) on each step; but it will not cycle for more than 3 intervals (15 ticks) on any step of the example, even if the performance criterion has not been reached. As with other parameters, these values can be set separately for each example; if they are not specified for an individual example, it will inherit the default values given in the file header.

So in summary: the environment will move on to the next step if i) it has cycled for at least min intervals and output units match the group criterion (trainGroupCrit if training, testGroupCrit if testing), or ii) if it has cycled for max intervals. If you want very exact control over when the example steps are applied to the network, you can set the parameters accordingly. For instance, if you want the model to run for exactly 3 intervals on each step, set trainGroupCrit and testGroupCrit to 0 (the model can never reach this criterion, since error will never be exactly zero on the output units), set min to 3 and set max to 3.

If you want the full environment with 20 different words already constructed, you can download it here and place it in your working directory, and rename it read.ex. With the network constructed as described above, you can load the examples like this:

loadExamples read.ex –exmode PER

Remember that –exmode indicates how the patterns will be sampled from the environment. By default patterns will be presented in order, but this is usually a pretty artificial alternative. The PER above indicates that the examples will be selected in PERMUTED mode: all patterns will appear once per epoch, but in a new random order each epoch.

If you now click on the “New graph” button and click on “OK” in the next button that pops up, you will have a graph showing the error as the model trains. Set the simulator to run 10000 epochs and click the Train button. You should see the error diminish on the graph. Once it seems to have hit a minimum, you can stop training (click on Stop Training) and play around with the model. Click on an example in the unit viewer, and then on the “play” button in the upper-right of the unit viewer. The model should turn on the 3 letters of the word in the correct order.

Also, have a look at the links in the Link Viewer. Is there any pattern to the weights leaving the localist word input units? Do words with similar spellings seem to have similar sets of outgoing weights?  Are their individual weights that correspond to particular letters or words?

Looking at data

The default functions for writing data in LENS are a bit confusing and not very useful unless you know a text-processing programming language like PERL or TCL. I have written two TCL procedures that write unit activations in a slightly more friendly format.  To use these procedures, you need to place the file my_procs.tcl in your working directory. Then in LENS, type:

source my_procs.tcl

This will add two new commands to LENS. The first writes the activations of specified units in the network after the network has cycled for the maximum number of ticks permitted on a given example. Here is the syntax:

testFinalActs fname {group1 group2 …}

Here fname is the output file name, and group1, group2 and so on correspond to the names of the layers in the network for which you want to save activations. If you only want one layer, you don’t need the square brackets; for more than one layer, you need them. This command will run the network on the test example set (there must be a test set loaded), and will write a text file in the working directory that has one line for each example in the test environment.  The first number in each line will indicate the example number; the next number will indicate the tick number, and the remaining numbers will indicate the activations of the units in each group specified in the command, in the order specified by the command. In the example above, the command will write the activation of unit 0 in group1, then unit 1 in group1 and so on until all group1 units are written, then unit 0 in group2 and so on. When all unit activations have been written, a new line will begin giving the activations for the next example number.

You can try this now by typing:

testFinalActs stupid_output.txt spelling

…which will write the activations of units in the letters layer at the end of settling for each example. It takes a moment to run through all examples; but if you now look in the working directory, there is a file called “stupid_output.txt” that you can open with your favorite text editor, or import into your favorite data analysis program.

If you do open it up, it will not look very interesting. You should see that, for each example, only the last letter of the word is activated and all other outputs are basically off. This is because writeFinalActs only records the activations of units at the final step of the settling process. This is useful for models that are settling to some final, static pattern—you can quickly record the activations when the network has finally settled. In the current model, we want to see how the activations are changing over time, so we need more than just the final pattern. This is what the second new command is for:

testAllActs fname {group1 group2 …}

Here the syntax is the same, and the format of the output is the same: the first number of each line shows the example number, the second indicates the tick number, and the remaining figures indicate the activations of the units in the groups specified. Instead of having one line per example in the output file, however, this command writes a line for every tick. So, if the example takes 50 ticks to run, the command will write a 50 lines for the example, with each line indicating activations after a single tick. The command runs through all items in the test example set. You can try it like this:

testAllActs stupid_output2.txt spelling

This takes longer to run partly because it is writing a lot more data, and partly because I was unable to use a LENS shortcut that came in handy in the other command. Maybe you can write a faster command. But after a moment, the network has run through all the examples and created the output file.

Open it up in your favorite data analysis package. Why not use Excel for the time being. The output file is whitespace-delimited, so in excel choose “delimited” when asked about the text file type, then check “spaces” as one of the possible delimiters. When the file opens up, the data will appear in 11 columns, indicating the example number, tick number, and activations for the 9 units in the letters layer. Select the activations for all 9 units, for all the ticks in example 0, and then click on the “Charts” button in excel. Choose a line graph, then “Next” through to the end. You should now have a graph showing the activations of the 9 units over time for the full example. Hopefully, what you see is the activation from C building then declining, followed by the activation for A, followed by the activation for T.

Once you have your data in text files you can generally read them into whatever package you like to do whatever kinds of analyses you like. As an exercise, why not try this: write the activations of the hidden units at all points in time; pull out the pattern at 15 ticks for each example (roughly the point where the first letter is maximally activated in the output); read these into SPSS, and do a hierarchical cluster analysis on the patterns. Is there any rhyme or reason to the way the network groups the 20 words in its internal representations at this point? What happens if you do the same thing for hidden unit patterns at 30 ticks?  What about 45 ticks?