Trial Flight
On the Speech Made Visible Project
Michael Wojcik, 14 December 2009

This is a reflection on the plan, progress, success, and shortcomings of the Speech Made Visible project in WRA 417, Fall 2009. I was one of three team members, and focussed on the analysis portion of SMV. Matt Penniman worked on the renderer, and Chris Huang on the project documentary. We all contributed to planning, research, design, testing, and presenting our resuls. The original concept was Matt's.

Sometimes software development is engineering. It begins with analyses of problems, costs, resources; defers to regulations and accepted practices; grows alongside specifications and process documentation; manages itself with planning and tracking and reports carefully compiled and filed.

That's the path software development often has to take. When software controls an airplane, or a medical scanner, or a nuclear reactor, sober caution has to mark its construction. Even humbler applications may demand it. It's expensive to fix a careless bug in the software of ten thousand cell phones or cameras.

But at other times software can be an experiment, a collection of code flung wildly toward an impractical idea. Programmers often fix their eyes on the far distance, seize whatever materials are at hand, and try to cobble together something new and exciting. This sort of programming is itself a form of multimediated text, lashing together related but disjoint elements. It's the spirit of hacker culture; of Ted Nelson's first GUI demo; of Marc Andreeson's introduction of the <img> element in HTML, giving us the first real integrated-media web; of Wolfenstein 3D and the introduction of real-time 3D graphics to computer games, which perhaps more than anything else spurred the commodification of multimedia-capable personal computers.

Sometimes these projects take flight. Sometimes they do so only to melt down and fall apart. At their best, though, software like this is a glorious mess, from first flap through plummet, and a vision of what could be.

Feathers and Wax

“Feather Detail” by bored-now (click for source)

What Matt proposed was at once elegantly simple and intriguingly complicated: build an application that could show speech on screen. He wanted to create software that could render text in ways that conveyed some of the prosodic features of speech—pitch, intensity, speed—and automate as much of that as possible. We had seen people do very sophisticated versions of this sort of work manually, such as the famous Marcellus Wallace by isnochys, and Tara Rosenberger's Prosodic Font research. What Matt wanted to investigate was how much of that could be automated, to make it accessible to writers who were not graphic designers or visual rhetoricians. That was the idea that captivated me. I had planned to make my final project a continuation of my first one for WRA 417, but Matt's vision was too fascinating to pass up. (As it happened, Chris' documentary work was closely allied to what I had been doing in that first project, so in a sense I made progress on it, too. But more about that later.)

As I explained in the introductory video that Chris created, there were various reasons why I found Matt's idea compelling:

We said from the start that we were only hoping to create a prototype, to get something working, though we knew we'd finish a long way from a polished application that did everything we'd thought of. Even within those limits, those were ambitious goals for an effort of a few weeks, when we all had many other irons in the fire. Of course the best thing to do under those circumstances is to leap right into it with no regard for the difficulties.

We had a goal: take an audio recording and a transcript, and produce a text with that transcript styled to represent the spoken words. We had the outlines of a plan, with avenues for research and experimentation, a loose division of labor, and a rough schedule. We had examples of what had been done before, tools to apply to the problem, and our own fevered ideas. We had five weeks and a lot of unjustified enthusiasm.

From those feathers we made our prototype's wings. We fastened them with the products of our own efforts: my Praat scripts and PHP programs, Matt's stylesheets and scripts, and other bits and pieces, much as Chris edited together his movie from the audio samples and visuals he collected. And after a few tentative flaps—our manual demonstrations, the cycles of enhancements and fixes and tweaks—it took flight.

Sky

“Flight No JL 123” by Pandiyan (click for source)

So we began with research, moved into experimentation, and then implementation of something we could demonstrate to show the idea would fly. I investigated the analysis of speech, and in particular two problems: finding where words began and ended in an audio clip (so we could match them to the transcript), and extracting information about pitch, intensity, and timing for each word. I used Praat to do the heavy lifting, but I had to learn, on the one hand, the basics of prosodic science, and on the other hand how the Praat user and programming interfaces worked. (Matt, concentrating on the renderer, studied various intricacies of CSS styling and Javascript programming; Chris got to study at the school of A/V production, as well as doing some initial research into various ways people had approached the problem of visually representing speech.)

Using Praat

Praat early on proved to be both informative and a lot of fun; and even later, when battling its idiosyncratic and often awkward scripting language, it was always interesting and often entertaining to work with. Certainly early on in the project, Praat was indispensible simply for confirming that some of what we hoped to do would be possible. It was trivial to open a recording in Praat and get a graph of the pitch train, then play segments of it to see that, yes, here was a word and there was another one; that the pitch went from this to that; that, in short, we weren't asking for something completely beyond the bounds of what was available.

Using Elan

While investigating SIDGrid, a facility for doing massive speech analysis on large corpuses, I ran into Elan, the second key piece of software in the development of the SMV analyzer. Elan is a tool for annotating multimedia files. It can be used to time captions for video, for example. What it's really designed for is creating timed transcripts and other annotations for recordings of speech, for archiving and use in later analysis. Elan let us mock-up paired recordings and transcripts, so we could see visually how those pieces of data fit together. More importantly, it gave us a file format to adapt for passing data from the analyzer to the renderer. That not only spared us the effort of developing one, but also let me create output from the analyzer for Matt to render before the analyzer itself was done.

Around this time we gave our checkpoint presentation, showing our work in progress. We had a basic, partial analyzer that, though it had to be run by hand, automated the analysis of the sound, and created the file for the renderer; and a renderer that could be run, with a bit of coaxing, to display the results. For the technical core of the project, this was the greatest milestone: we had demonstrated all the pieces required to make something like what we had proposed happen.

Of course, we'd proposed an application that anyone could use, not just a handful of technologies that even we couldn't explain very well. Making a pair of wings doesn't mean you're off Crete.

It was time to get serious with the software development. I set up a development environment and a server to run our application at WIDE. Matt put together a first implementation of a web app, and we dropped all our code and samples in. We also began to refine our experimental code at this point, fixing bugs and making it more robust, cleaning it up and adding comments. (This is a hugely important part of creating viable software, especially if it's going to be released as open source.)

Editing the analyzer (full size)

Always conscious of Chris' work documenting the project, we also periodically created materials for him. We took screenshots (some of which appeared in our presentations) of our experiments and technologies at work. We occasionally recorded screencasts as we worked, to show what really happens on a developer's machine while software is created. (It makes sausage look better.) Matt and I both recorded some video of ourselves, talking about our work; Chris eventually did not include any of that footage in the released Introduction video, but we've retained it for later use.

Ultimately, the bulk of the material for the documentary came from the several recording sessions Chris made interviewing me and Matt. Some of those we did jointly; other times Chris interviewed one of us at a time. While this process is always time-consuming (and the editing, of course, orders of magnitude more so), I think the hours we spent doing this contributed hugely to the project. Few software development efforts are documented this way. Newcomers to the SMV project—anyone who finds it interesting and wants to contribute to further devleopment—can watch the Introduction and learn a tremendous amount about the ethos of the project, what we wanted to accomplish and how we set about doing it. And it could be similarly valuable to researchers in other areas, such as software studies. Chris wants to make more videos out of the materials he's gathered and more of the same, and Matt and I are eager to contribute.

As all of this progressed, and we worked on getting our rather basic SMV web application running, we kept an eye on the future direction of the project. One aspect of that, on my part, was creating a series of wireframes for how the web application might look in the first polished, general-use release. I developed those using online software from iPlotz; and though I ran into various issues, such as the system's tendency to boot me out after moving my laptop from one building to another (and consequently changing IP addresses), and their lack of a file-browser element to add to the wireframes, I was able to pretty quickly produce some decent mock-ups.

SMV home page mockup SMV edit page mockup SMV experience page mockup

SMV wireframes

Where we ran into major issues, we were able to resolve them at least to the point of having something usable. Sometimes these were simply the sort of bugs you get from the frantic pace of this sort of development. (For a relatively long time, between the checkpoint and final presentations, we'd completely broken the pitch representation in the renderer—and we were so busy working on other features we didn't even notice!) Other issues were thornier. From my perspective, the biggest challenge, after getting the analyzer going, was figuring out how to let users adjust their transcripts to match what the analyzer found, and implementing that. That led to the edit page, a key piece of functionality for SMV.

It might be nice to claim to be surprised by how much more difficult it was than we expected to do some things. If I was surprised, though, I shouldn't have been; if twenty-odd years in software development have taught me anything (have they?), it's that things always take longer than expected. And while we assumed that the areas we didn't know much about—prosody, using Praat, and so on—would be difficult, we underestimated the work involved (and the number of mysterious, time-consuming bugs) in the parts we thought were well-understood, such as creating the web application infrastructure in PHP and HTML.

In the end, we had far more clever ideas than we had any time to try, much less achieve. That's fine. In fact, it's what we hope for a project like this: soar as high as possible, even knowing you won't make it as far as you'd hoped. The view is to die for.

Sun

“Big Sur - Into The Sun” by Maschinenraum (click for source)

Of course the project is a success, even in its very preliminary and tentative state. We got the technology working, more or less. We created a user interface that may be clunky and a bit confusing, but is certainly usable. And I can't say I'm unhappy with the project's reception in WRA 417, which was embarassingly enthusiastic. Still, it's worth reviewing where we flew a little close to the sun.

The prosodic analysis we're doing is still very primitive, a combination of Praat's basic pitch analysis and some rather naive heuristics. I've tweaked the analyzer to err on the side of identifying phrases as "words" rather than breaking actual words up, since that simplifies the editing process, but it still does the latter much more often than I'd like. We have many ideas for improving analysis: looking at intensity as well as pitch, examining the transcript for probable syllable boundaries (where false word breaks are likely to appear), and doing some basic speech recognition on the audio. For that matter, I'd like to spend some time reading through the sources I've found on prosodic analysis (I have a variety of online materials and a handful of books from the library that I've only skimmed). So much of our available time simply had to go to infrastructure work to get the prototype running, it simply wasn't possible to try more than a few of our ideas for analysis.

SMV edit page

SMV edit page

The user interface for the application could use a lot of work. The editing page is a big step forward from the trial-and-error approach of earlier versions, and its underlying technology (using Praat to split the sound file into individual "words") is key. But having to resubmit the page to try out each change is tiresome, and manually editing the transcript to insert underscore characters is clunky. It'd be great to rewrite this page using a script-based dynamic approach for faster response, and with nicer controls for merging and splitting words.

On a more prosaic note, there are features we simply haven't implemented, such as the login/registration system, which will have to be present before we can advertise SMV even for experimental use. (Otherwise it'll get filled up with bot-posted spam.) Nor have we written any decent help text.

So, picking our feathers up off the waves, where do we go from here?

In the short term, SMV itself is moving to Sourceforge, where it will be available to other open-source developers. We've already created the Sourceforge project and begun to populate it; as soon as we have a chance (probably early January) the SMV code will move there, and we can do an initial release and begin to publicize it. This will accomplish our goal, explicit from the beginning of the project, to produce something other people can use and contribute to.

The code itself should continue to develop in various ways. The problems I mentioned above need to be addressed, whether they're limitations that can be addressed once and for all (like the lack of a login system) or opportunities for refinement that will always be present in some form (such as the sophistication of the analyzer). At the same time, we want to make the entire system more easy, convenient, and productive for end users. That will include enhancements to the user interface and several new features in the renderer, such as the ability to save the results as HTML plus CSS, styled HTML, or a static image.

I'm also excited to work with Chris on further documentary models. We have all these great materials already, and we should generate more. This is an opportunity to realize what I was playing with in my Code Show project, but for something of broader interest. So it has the potential to be both immediately useful and a model for documenting other software projects.

It's easy to launch ambitious projects, create a prototype, and then become distracted by other ideas. Certainly I've done that all too often (as my hundreds of pages' worth of unfinished dissertation from my first graduate degree attempt can attest). Speech Made Visible won't move on by by itself; Matt, Chris, and I are going to have to commit to continuing our work on it. I hope the excitement it's generated, and its public presence on Sourceforge, will help us do that.