Looking at model data with R

This quick tutorial provides some guidance for looking at neural network model data written from LENS using statistical tools available in the free software package R, available here. The tutorial explains how to read data into R, how it is format, and provides some suggestions for looking at model behavior using barplots and various kinds of multidimensional scaling. A short cheat-sheet noting some basic R functions and what they do is available here.

Reading in data

First you will need some neural network activation data saved to a text file, in the format written out by the testAllActs or testFinalActs as defined in my_procs.tcl. Then invoke R and use the drop-down menu to change the working directory to the directory where your data are located. Now type this into the command line in R:

tmp <- read.table("filename.txt", header = F)

This puts the data into an object called tmp, which will be a 2d array. tmp will be a data-frame object—a weird R object in which the columns can be of any data type (integer, string, boolean, floating-point, etc).

You can see what is contained in tmp by just typing it at the command line:

tmp

This probably dumps a large bunch of data to your screen. A better thing might be just to look at a subset of the rows and columns contained in tmp. You can see how many rows and columns there are like
this:

dim(tmp)

…and you can look at the entries in a subset of rows and columns like this:

tmp[1:5,1:10]

Square brackets indicate that you will refer to a subset of the rows and columns of the object. The first entry before the column indicates rows, the next one indicates columns. The colon indicates that you are referring to a range. So the above command says: report rows 1 through 5 for columns 1 through 10 in the object tmp.

If you do this with the output of testAllActs, you will see 10 columns of data for each of the first 5 ticks of the data file. The first two columns will indicate the pattern number and the tick number, respectively. The remaining columns will contain the unit activations for the groups you indicated, in the order listed in the testAllActs command.

Because tmp is a data frame object, there are some kinds of operations that you can’t perform on it. Specifically, many operations that expect a matrix object will not work on a data frame. The chief difference between a data frame and a matrix is that all columns of a matrix must contain the same kind of data. Luckily, it is easy to convert a data frame to a matrix. To create a matrix that contains *only* the unit activations (and not the pattern and tick numer), do this:

activations <- as.matrix(tmp[,3:n])

…where n is the total number of columns in the data frame (ie, the second number returned by dim(tmp)).

NOTE that in this syntax:

tmp[,3:n]

…there is no row number specified. If you omit the number of rows or the number of columns you want, by default all rows will be returned. So this part says “Take all the rows and columns 3 through n.” The as.matrix command will try to coerce its argument into a matrix form. So all together this command  trips out columns 1 and 2 of tmp, converts the rest to a matrix, and assigns the result to an object called activations.

So now you have a matrix of unit activations!

But wait, you also want to know what the *name* of each pattern is. If you have the most recent version of my_procs.tcl, these names will be written in the second column of the data file. To extract these into a single column vector containing a letter string for each pattern name, do this:

patnames <- unique(as.character(tmp[,2]))

as.character converts the column to a vector of character strings (instead of a factor)

unique pulls out just the unique names—so names that are repeated several times in the column come out just once in the returned vector.

Looking at the data

One simple way to look at the data is through bar plots. To look at the pattern of activation over a set of units at a single tick of time, try this:

barplot(activations[row, cols], beside = T, ylim = c(0,1))

…where row indicates the row of the matrix you want to plot, cols indicates which columns of the matrix you want to plot (just leave blank to plot all of them), and the beside=T component indicates that you want to plot the bars beside one another (the default is to stack them one on top of another).

ylim is a generic plotting command that indicates the minimum and maximum values for the y axis of the plot. It takes a two-element vector, with the first value indicating the minimum and the second indicating the maximum.

An aside about row indexing in R…

In general in R, you can create a 1-dimensional vector as follows:

c(x, x, x, x, x, ..... )

…putting numbers in places of the Xes. So:

c(1,3,5,7,9)

…creates a 1D vector with 5 elements, containing the values 1, 3, 5, 7, 9.

R can also do simple arithmetic with column vectors. For instance, try this:

c(0:4) * 2 + 1

You get back a 5 element vector containing 1, 3, 5, 7, 9. What’s happening? c(0:4) creates a column vector containing the values 0 through 4; it then multiplies every element by 2 (so you get 0,2,4,6,8), then adds 1 to every element, yielding 1,3,5,7,9.

This provides a handy way of pulling items of interest out of a large matrix.

For instance, suppose your matrix contains data from 20 test patterns, and each pattern was processed for 50 ticks. So your data matrix contains 20 * 50 = 1000 rows of data. Suppose further that you are interested in comparing the activations generated by different patterns at tick 15. So, from the matrix you want to pull out row 15, 65, 115, 165, etc…

activations[c(0:19)*50 + 15,]

…which will dump out row 15, 65, 115 and so on up to 20 * 15 = 300

Back to bar plots

Barplots aren’t that useful unless you can compare them. In R it is easy to generate multi-panel plots using the “par” command and setting the mfrow parameter.

par(mfrow = c(4,5))

This tells R to create a 4 x 5 grid on the plotting surface, and to fill it in, with each new plot command, along the rows. (mfrow = multipanel figure plotted by rows). After you have invoked this command, try this:

for (i1 in c(1:20)) barplot(activations[i1,], beside = T, ylim = c(0,1))

The “for” command loops through every value in the specified column vector (in this case c(1:20)), assigns the value to the dummy variable i1, then runs the corresponding barplot command. So the barplot command will be run with i1=1, then i1=2, then i1=3, etc. Each new plot will fill in a new square of the grid created by mfrow, moving along row 1, then row 2, etc. So here in two commands you have generated a display of 20 barplots, representing the first 20 ticks of time in your data.

So what happens if you do this?

for(i1 in (c(0:19)*50 + 15)) barplot(activations(i1,], beside = T, ylim = c(0,1))

Here i1 will loop through all the values specified in (c(0:19)*50 +15)–that is, the number 15, 65, 115, 165, etc. So in the scenario above (20 patterns, each associated with 50 ticks), this command would plot the 15th tick of each of 20 patterns all together on the same plot.

Visualizing distances

Barplots allow you to intuitively see the activations generated for each unit for each unit and to visually inspect/compare these, but it can be very difficult to really figure out how the network is representing things just by looking at these. What we really want is some way of measuring how similar the patterns are to one another. Why? Patterns that are similar in a given layer will tend to generate similar activations in the next layer down. If the network is representing some patterns as more similar to one another than to some other set of patterns, this gives us a clue about how the network might generalize, or what information it is encoding in its internal representations.

There are a range of different ways of trying to visualize the similarity structure of a model’s representations, but all of them depend upon first measuring the similarity between every pair of patterns. One simple measurement of similarity is the Euclidean distance–the square root of the sum of the squared differences in activation on every unit. R has a built-in function for computing the distances between every pair of rows in a matrix:

dist(M)

…will return an object that contains the Euclidean distance between every possible pair of rows in the matrix M. If M has n rows, then there are n(n-1)/2 possible distances. The dist function returns a special type of data structure, namely a *distance* data structure, that will work with many of the built-in data visualization tools in R. Unfortunately, the distance structure is not easy to interpret just by inspection. Instead, it is often useful to convert it to a matrix:

as.matrix(dist(M))

…will return the distances in matrix form. If M has n rows and m columns, the distance matrix will be of dimension n x n, and each cell will contain the Euclidean distance between the corresponding pair of rows in M. For instance, cell (5,10) for the distance matrix will contain the Euclidean distance between row 5 and row 10 of the original matrix M.

To see this, first let’s make a matrix containing the pattern of activation for each input at tick 15 for the Spelling network:

tick15.rep <- activations[c(0:19)*51 + 15,]

To make a matrix containing the Euclidean distances between rows, try:

tick15.dist <- as.matrix(dist(tick15.rep))

Check the dimensionality of this matrix:

dim(tick15.dist)

…should be 20, 20 if there are 20 patterns in the dataset.

You can look at the actual numbers in the matrix in the usual way:

tick15.dist[1:10,1:10]

…will show you the first 10 rows and columns. Note the values on the diagonal are all zero, b/c they contain a given item’s distance to itself, which is always zero. Also note that the matrix is symmetrical–the value in row r column c is the same as that in row c column r. This is because distances are symmetrical—if pattern 1 is 5 away from pattern 2, then pattern 2 has to be 5 away from pattern 1.

So what can you do with this distance matrix? The numbers themselves are not much more informative than the barplots, but visualizing the matrix can be useful. Try this:

image(tick15.dist[,n:1])

…where n=total number of columns. This will generate a “heat” plot of the matrix. By default the image plotted is “flipped” for reasons I don’t understand, but you can flip it back by reverse- orienting the rows (this is what the [,n:1] does above).

Hierarchical cluster plots

plot(hclust(dist(M)), labels = patnames)

This will plot a hierarchical cluster plot of the Euclidean distances between the rows of data matrix M, and will label the leaves with the names stored in patnames. If the rows of M are labelled with the correct names, you don’t need to supply the labels = patnames argument.

Multidimensional scaling plots

data.mds <- cmdscale(dist(M))

This will create a 2-dimensional classical multidimensional scaling (hence cmdscale) of the distances between the rows in data matrix M. You can then plot the solution in 2 dimensions:

plot(data.mds)

…Of course, you won’t know which points correspond to which patterns in M. You can add text to the plot as follows:

text(data.mds, labels = patnames)

…where patnames is a vector containing the set of labels you want to plot, ordered the same way the rows in M were ordered.