CS 2, Winter 2008
Programming for Interactive Digital Arts

Feb 25: Text and web


Strings and characters

Text is represented in Processing with a type of object called a String. We've actually been using strings for quite some time now, for displaying and printing text. Strings are just arbitrary lists of letters, numbers, etc., enclosed in double quotes, e.g., "Hello, there". We can declare a variable to be of type String, and then assign these values to it, e.g., String msg="hello, there";.

An individual letter, number, etc. in a string is an char (character). A character is enclosed in single quotes, e.g., 'H'. We've actually been using characters, too, for testing keyboard input (e.g., if (key=='Q') { ... }. We can declare char variables and assign values to them (e.g., char letter='h').

We can think of a string as an array of chars. The method String.charAt() gets a character at a position in a string, just like we can use square brackets to access an element at a position in an array. We can't update characters in strings that way, though. Alternatively, we can convert a string to an array using the method toCharArray(), which returns an array of char. We can convert an array of char to a string with the constructor (e.g., String s = new String(chars)).

Reas & Fry have a couple of examples, 36-07 and 36-08, illustrating the view of strings as arrays of chars. 36-07 animates each character independently of each other one, changing size (using textSize()) periodically. 36-08 stages the entry of letters of a string -- the first letter moves up, then the second, etc. The function textWidth() gives the width that will be occcupied by a given character, so that the proper position of the next one can be calculated. Given the ability to take apart a string into individual characters, we could do all kinds of such animations on them (e.g., have a Ball or Spring object for each, applying forces).

We can view characters more abstractly, too. Cinematic Particles provides a very interesting visualization of different movies, by controlling particles according to the dialogue (extracted from subtitles). That is, the letters control the acceleration and the particle size, according to specific rules.

Strings support a number of methods beyond looking at their individual characters. To compare two strings, we cannot use ==; instead, we must use the String.equals() method; e.g., "abc".equals("abc") would return true. This is useful in seeing if a variable has a value, e.g.,

if (msg.equals("hello, there")) 
  println("hello to you, too");

Other methods of the String class allow us to see where a letter is in a string, get the length of a string, get an all-uppercase or all-lowercase version of a string, or extract just a substring. There are even more methods described in the underlying Java documentation for strings.

Splitting and matching

Processing has some other functions (not methods) useful for processing strings; these are in the "String Functions" subsection of the "Data" section of the Processing reference. In addition to splitting a string into characters, it's useful to be able to split it into words. The functions split(), splitTokens(), and match() do that. They take a string and return an array of strings, broken according to a rule. For split(), the string is broken wherever a given substring is found, while for splitTokens(), the string is broken wherever one or more of a set of characters is found. In both cases, the part used for breaking is discarded. The match() function is more powerful, providing the ability to use "regular expressions" to match particular patterns. I'm glad to tell interested people more about it.

Reas & Fry 36-06 demonstrates one thing that can be done with an array of strings -- staging the display of words of a message. Try substituting a split() for the initial array. As we saw with character arrays, the words could be represented and manipulated in many different ways.

Reas & Fry 46-07 has a bigger example of string manipulation, reading in a whole book (from Project Gutenberg), counting the number of words, and displaying the long words. The loadStrings() function loads a file (in the data folder), representing each line as a separate string. Thus the file produces an array of strings. A loop then goes line by line, splitting each line into words. (There's a special format for Gutenberg books, so that the body of the text is within particular delimeters, detected by the startsWith() method.)

Recall that the "+" operation can combine two strings (e.g., "abc"+"def" is "abcdef"). The join() function lets us put back a whole array of strings, separated by some given string (e.g., a comma). Between taking a string apart and putting it back together, we can process it however we like. For example, the following sketch is slightly modified from Reas & Fry 46-07, so that each word is translated into pig latin, and then the lines are reconstituted and saved out to a file. The saveStrings() function works like the inverse of loadStrings, writing out each string in an array as a separate line in a file.

String[] lines = loadStrings("2895.txt");
// A processed copy of lines, with each line translated into pig latin
String[] processedLines = new String[lines.length];

boolean started = false; // Ignore lines until the *** START line

for (int i = 0; i < lines.length; i++) {
  processedLines[i] = "";
  if (lines[i].startsWith("*** START")) { // Start parsing text
    started = true;
  } else if (lines[i].startsWith("*** END")) { // Stop parsing text
    started = false;
  } else if (started == true) { // If we're in the useful region
    // List of characters and punctuation to ignore between
    // letters. WHITESPACE is all the whitespace characters
    String separators = WHITESPACE + ",;.:?()\"-";
    // Split the line anywhere that we see one or more of
    // these separators
    String[] words = splitTokens(lines[i], separators);
    // Go through the list of words on the line
    for (int j = 0; j < words.length; j++) {
      String word = words[j].toLowerCase();
      // Pig Latin rules (I think)
      char first = word.charAt(0);
      if (first == 'a' || first == 'e' || first == 'i' || first == 'o' || first == 'u')
        words[j] = word + "ay";
      else
        words[j] = word.substring(1) + first + "ay";
    }
    processedLines[i] = join(words, " ");
  }
}

saveStrings("pl.txt",processedLines);
[pde]

A more complex (but perhaps more informative) processing is to compute the frequencies of the various words in the text. Such frequencies are at the heart of information retrieval (finding a document or page with similar content to a given one). The following sketch computes the frequencies and then displays some words with font size scaled according to frequency. The most frequent words ("the", "a", "and", etc.) tend not to be that interesting, and the least frequent ones aren't that important in describing the document. Rather than trying to use a dictionary to find just, say, nouns, verbs, and adjectives, as well as eliminate "too common" words, the sketch just lists all words within a given range of frequency.

// How frequent words must be, to be worth showing
int minF=50, maxF=100;

PFont font;  // It wasn't working unless I set the font every time

// The words and their frequencies -- word[i] appears counts[i] times
String[] words;
int[] counts;

// Indices of those words with the right frequencies
int[] goodIndices;
int lowF, highF;  // min and max frequencies within the good words

void setup()
{
  size(800,600);
  font = loadFont("Andalus-24.vlw");
  background(196);

  // Get the words and put them in alphabetical order
  String[] allWords = extractWords("2895.txt");
  allWords = sort(allWords);

  // Since they are in order, copies of the same word are next to each other.
  words = new String[allWords.length];
  counts = new int[allWords.length];
  int numWords = 0;
  // Definitely have at least one copy of the first word
  words[0] = allWords[0]; 
  counts[0] = 1;
  // Now go through remaining words
  for (int a=1; a<allWords.length; a++) {
    if (allWords[a].equals(allWords[a-1])) {  // another copy
      counts[numWords]++;
    }
    else {  // new word
      numWords++;
      words[numWords] = allWords[a];
      counts[numWords] = 1;
    }
  }
  // Truncate the arrays to the number of words actually found
  words = subset(words, 0, numWords+1);
  counts = subset(counts, 0, numWords+1);
  
  // Print the words and frequencies
  for (int i=0; i<words.length; i++) 
    println(words[i]+":"+counts[i]);
  println("***");
  
  findGoodWords();
  showGoodWords();
}

void draw()
{
  if (frameCount % 1000 == 0) {
    background(196);
    showGoodWords();
  }
}

// Sets the indices array to say which members of the counts array
// have counts between minF and maxF.  
// Sets lowF and highF to the min and max observed frequencies within the range
void findGoodWords()
{
  goodIndices = new int[counts.length];
  lowF = maxF; highF = minF;  // start them as opposites, so will be updated
  int n = 0;
  for (int i=0; i<counts.length; i++) {
    if (counts[i] >= minF && counts[i] <= maxF) {
      goodIndices[n] = i;
      if (counts[i] < lowF) lowF = counts[i];
      else if (counts[i] > highF) highF = counts[i];
      n++;
    }
  }
  goodIndices = subset(goodIndices,0,n);
}

// Display the words that have the right frequenceis
void showGoodWords()
{
  fill(0,100);
  textFont(font);
  for (int i=0; i<goodIndices.length; i++) {
    int j = goodIndices[i];
    // Size according to frequency
    textSize(map(counts[j],lowF,highF,9,48));
    text(words[j],random(width),random(height));
  }
}

// Returns an array of all the words in the Gutenberg text
String[] extractWords(String filename)
{
  String[] lines = loadStrings(filename);
  
  // Create an array with a guessed size (10 words per line); will expand as needed
  String[] words = new String[10*lines.length];

  // List of characters and punctuation to ignore between
  // letters. WHITESPACE is all the whitespace characters
  String separators = WHITESPACE + ",;.:?()\"-";

  boolean started = false; // Ignore lines until the *** START line
  int currentWord = 0; // How far into words array we are

  for (int l=0; l<lines.length; l++) {
    if (lines[l].startsWith("*** START")) { // Start parsing text
      started = true;
    }
    else if (lines[l].startsWith("*** END")) { // Stop parsing text
      started = false;
    }
    else if (started) { // If we're in the useful region
      // Split the line anywhere that we see one or more of
      // the separators
      String[] wordsOnLine = splitTokens(lines[l], separators);
      if (currentWord + wordsOnLine.length > words.length) // need more space
        words = expand(words);
      // Go through the list of words on the line
      for (int w=0; w<wordsOnLine.length; w++) {
        words[currentWord] = wordsOnLine[w].toLowerCase();
        currentWord++;
      }
    }
  }
  
  // Truncate the array to the actual size used
  return subset(words,0,currentWord);
}
screenshot[pde]

There's a lot going on in this sketch. The extractWords function basically packages up the cored of the Reas & Fry 46-07 sketch, marching through the lines of an input file, and extracting all the words into one big array. We don't know how big the array has to be, so we make a guess when we create it, then expand() it if we're going to run out of room, and finally subset() it back to just the right size when we're done.

Once we've extracted all the words in the document, back in the setup(), we put them in alphabetical order using sort(). Now the multiple copies of the same word are one right after another -- the array will have lots of "a"s, then maybe an "aardvark" or two, then, ..., down to some "zulu"s or whatever. So with a for-loop, we can count the number of duplicates just by noticing when the word we're looking at now differs from the preceding word. We store the unique words and their counts in arrays of those names (words and counts).

We only show words that have frequencies within the specified range. If word[276] shows up 75 times (i.e., count[276] == 75), then 276 will be added to the goodIndices array. Then to show just the good words, we iterate over the goodIndices, getting the appropriate word and scaling the font according to its frequency. The good words are shown in random positions each time through the draw loop.

Browsing and searching

Both loadStrings() and loadImage() can take a URL instead of a filename. That's handy, but by itself isn't really much better than having the files in the data folder. The real power of the ability to load from the web comes when we use the web as the web -- search, follow links, etc.

The Switchboard contributed library (installed in the Mac lab; easy to do yourself) provides some basic abilities to do some web stuff within Processing. Here's a simple example of doing a yahoo web search for the string "dartmouth". The structure of the sketch should look pretty natural by now -- import the library, declare a variable that will do the work, create it in setup() (passing this), and provide a function to handle the input we get. Here the input is a result from yahoo, rather than a video frame or whatever. The Switchboard manual describes what we can get from the input; the example just prints the title and URL. There is one special thing required to use these web services: get a developer key (just a big string) and plug it in to the sketch. The Switchboard documentation describes how to do that.

import org.switchboard.*;
 
Switchboard board;

void setup()
{
  board = new Switchboard(this);
  board.setYahooKey("INSERT YOUR KEY HERE");
 
  board.yahooWeb("dartmouth");
}

void draw()
{
}

// Called for each search result
void resultReceived()
{
  println(board.yahooWeb.getTitle() + " -- " + board.yahooWeb.getUrl());
}
 
// Called when results finished
void endOfResults()
{
  println("done");
}
[pde]

The Switchboard "quick start" has an example (go down to step 8) that queries amazon for albums from an artist (insert your favorite into the query), and gets and displays the images of the cover art. Note that Processing has changed from "framerate" (in the example) to "frameRate". One other new thing in the sketch is the use of an ArrayList. ArrayList is a Java class that serves as an expandable array (we never have to worry about expanding it). It's convenient for sketches like this where we have no idea ahead of time how many things will go in the array. (We handled this above by expanding and subsetting an array, but that's a pain.) Rather than brackets to access the array, we use methods add (to put at the end), set (to put at a particular index) and get (to get from a particular index). The size method tells us how many elements there are. There are also a bunch of other methods, described in the Java documentation. One painful thing in the version of Java used in Processing: we have to say what type of thing we're getting out of the ArrayList, by putting the class in parentheses before get, e.g., PImage img = (PImage)images.get(i);.

The Warhol sketch uses this idea to great effect, producing an ever-changing (and timely) version of Warhol's cans of soup.

Switchboard also provides a Browser class that can extract some information (links, image names) from a web page (see the documentation for details). An example gets images from a given URL. The match() function can be used to get only images matching a certain pattern, e.g., perhaps we only want those of type jpg whose URL doesn't include "/ads/":

if (match(imgURL,"jpg") == null || match(imgURL,"/ads/") != null) 
  println("ignoring "+imgURL);
else {
  println("adding "+imgURL);
  ...
}

The match() functin returns an array of the substrings that match the pattern; here we ignore it unless just see if there is no match for "jpg" or there is a match for "/ads/".

Practice problems

  1. Connect with springs letters of a string or words of a sentence. [hints]
  2. Devise your own drawing language, converting words in a text into drawing commands. [hints]
  3. User interface: allow user control over the desired frequency range for words in a text.
  4. Show some images from each of the different "hits" for a particular search. [hints]