Rhetoric & Writing Portfolio Michael Wojcik Estimating Ethos

In the Estimating Ethos project I'm investigating ways of computationally estimating an author's ethos in a collection of texts. This project was suggested and supervised by Bill Hart-Davidson, who also provided many helpful materials. It was one of several suggestions he made for an independent study project, and it immediately caught my interest in its intersection between rhetorical theory and computational practice.

The products of this project fall in into three main categories: the theory behind ethos estimation, the prototype software that implements some of that theory, and the analysis of the results. For now, I'm calling the ethos-estimation software prototype application simply EE (for "Ethos Estimator"), a term I'll use below.

One other terminological note: I occasionally use ethic as the adjectival form of ethos, to avoid confusion with the more common "ethical".

Definition

Bill's original formulation of the problem was:

An algorithm (or two, or three) that facilitates formative assessment re: ethos as a complex function of interactions with others in a social networking (broadly speaking) system. Of course, making it about ethos means that it requires some interesting approaches to identity/subjectivity to avoid simplistic reification.

After some thought and discussion, I refined this into the following tasks:

  1. Define the goal of the project. What results might constitute an interesting and useful outcome?
  2. Consider definitions of ethos and devise one that fits this project: relatively simple and clear, and amenable to some kind of algorithmic heuristic. Most likely this will involve indentifying possible proxies for ethos, such as citation (that is, a contributor who is cited may be assumed to have some ethic standing with the contributor doing the citing).
  3. Study some of the extant research relevant to obtaining this sort of information from text. (Along with the next item, this constituted most of the reading portion of the independent study.)
  4. List the attributes of a system that might be able to provide a useful estimate of ethos, and consider what sort of system design could provide them.
  5. Implement a prototype of such a system, for a limited domain (eg a single type of text, such as email, from a fairly constrained domain, such as a technical listserv).
  6. Examine the results, and evaluate the outcome of the project. Present these findings.

Defining Key Terms

Obviously for this project to be feasible at all, I needed to define key terms such as ethos and contributor in ways that were theoretically satisfactory but also at once sufficiently specific and concrete to guide the software design and make the results meaningful, and general enough to be usable and avoid introducing excessive noise into the results. For example, if I tried to define a "contributor" as necessarily a single human author, I would have to filter the data to remove texts produced by groups, institutions, etc.

Starting from Aristotle's original definition of ethos as an appeal to the goodwill of the audience, I settled on a definition of ethos as weight in a reputation network. I discuss that further below.

I left the definition of a contributor — that is, an actor to which some text can be ascribed, and which can be assigned an ethos ranking — deliberately vague. EE is designed as an application of a more general text-processing engine I've designed (tentatively called Textmill), which makes it relatively easy to delegate decisions like "what is a contributor?" to different modules. Thus the defintion of a contributor will not be hard-coded into the system; instead, any module (and the system can contain an arbitrary collection of modules) can add contributor-identifying metadata to the context of a document when it believes it's identified some.

Similarly, the system itself does not restrict what constitutes a "text". Instead, any module can serve as a source of texts, which might be retrieved from some collection or database, or pulled from a feed, etc. Such "source" modules will label the texts they add to the system with metadata indicating provenance, format, and so forth, and other modules which accept texts in that format will then be triggered to process them. Modules may create texts from other texts — for example, an HTML-parsing module might accept HTML texts and produce plain-text versions of them for use by modules that only handle plain text.

The ethos metric (or possibly metrics) itself will be represented in the system as a fuzzy value (or values) with optional metadata. This simplifies reconciling estimates made by different processing modules: if all modules produce fuzzy estimates (eg "it is unlikely this contributor has high ethos in this context", in effect), then the values don't have to be normalized to the same scale. Attaching optional metadata, which can include for example context information ("this estimate applies to subject matter X"), lets modules offer additional qualifications when they can, without forcing them to all make judgements that may not be possible for certain kinds of texts.

Current Status

At this point, I've mostly done a fair bit of research and reading, significant rumination and note-making, and relatively little writing of text or code.

To provide a concrete goal and deadline for the project, I submitted a paper proposal (PDF) to Computers & Writing 2008 , which was accepted. The conference is 21-25 May; that is now my absolute deadline for a working software prototype and initial analysis of the results. However, I intend to be able to present something close to my C&W presentation, including working software, before the end of the spring semester in April.

Project Details

Much of the following discussion I first developed in a series of Mind Maps, which I've found to be a useful tool for invention and exploration.

Theory

Ethos and Proxies

Ethos is a tricky term. Contemporary use seems to have broadened considerably beyond Aristotle's definition, where ethos meant simply appeals to the willingness of the audience to believe the speaker; today, ethos is often used in ways that includes elements of tone, motive, or purpose. This project requires a simple and flexible definition amenable to analysis, so I returned to something more Aristotelian and defined ethos as the source's standing with the audience, with respect to persuasion — that is, the likelihood that the audience will be persuaded simply by the identity of the source. I'm in the process of developing a formal version of this position (draft).

This project is actually looking to identify and estimate not ethic arguments themselves, but a source's potential for making such arguments. In the collection of input texts, what can we say about a contributor's ethic standing, or indications that the other participants assign weight to that contributor's opinion? Who is considered trustworthy, accurate, insightful, worth attending to?

But of course ethic standing is not generally directly present in text, even in conversations with multiple exchanges (the kind of textual environment we're working with here). So the project actually seeks to identify discernable proxies for ethic standing. Citation and quotation of a source, for example, often indicate some degree of ethic standing for that source. (Even if the quotation is for the purpose of disagreement, it shows that the source is sufficiently notable to argue against.) In a collection of texts where conversations usually begin with a question — often true of email collections — then the recipient of the first message in a conversation often has ethic standing with the author of that message; the initiator of the conversation believes there's some possibility that the recipient can reply authoritatively.

Textual and Metatextual Analysis

Some proxies are present in the text itself and can potentially be detected (heuristically, and with some, hopefully acceptable, rate of false positives and negatives) by textual analysis. This might include identifying quotations and the source being quoted, for example. Sometimes writers even argue for the ethos of sources they use (and even for their own ethos), though detecting that through algorithmic analysis might not be feasible.

Given the notorious difficulties of algorithmic textual analysis, though, it's currently not clear how much useful information can be determined this way, and whether it would be possible to get useful results separate from the noise of the inevitable errors. Fortunately, most digital texts come with considerable metadata that is more amenable to analysis (because it already follows standard protocols and is intended for machine consumption) and that (I argue) includes proxies for ethic standing. This enables metatextual analysis for estimating ethos.

The email source-and-recipient relationship I described above is one example. This can usually be determined simply by processing email headers. (Even with listservs, where messages flow to and from a single email address, other headers often provide sufficient information to reliably reconstruct the history of actual conversation participants.) Another example is the graph formed by document links in collections of hypertext documents (eg HTML anchor tags), which is most famously used by the Google PageRank algorithm.

Generality and Extensibility: Textmill

With many possible definitions of ethos, many possible proxies, collections of texts of different sorts from different domains, and approaches to textual and metatextual analysis, it was clear that the prototype software would implement only a very small subset of the possible useful approaches to the problem. The overall system could either reflect that, or aim for generality through extensibility. I took the latter route.

Consequently, EE itself is actually an application for a general-purpose text processing system I've designed called Textmill. Textmill is not specific to estimating ethos; its aim is simply to make it convenient to implement algorithms for text processing. The actual algorithms, for ethos estimation in whatever form or for other tasks, are implemented in plug-in modules which can be replaced or combined.

This architecture lets me get a basic prototype for a specific, constrained, relatively simple algorithm and input, and in the future augment that with complementary or even entirely different mechanisms, and compare the results to evaluate what approaches work well with what texts.

Software

Language and Execution Engine

The software is implemented in Javascript. While Javascript is usually thought of (and used) as a website client-side scripting language, it's highly expressive, relatively simple, and widely known — all goals for this software.

The extensible nature of Textmill is only useful if researchers can write extensions. An expressive language, with high-level constructs and good support for abstraction paradigms such as object-oriented and functional programming, simplifies code and makes it easier to write and read. Javascript is both object-oriented and functional, and its underlying associative data model (names mapped to things) provides a consistent framework for storing, retrieving, and manipulating both data and operations on data. There are other OO functional languages, such as OCaml, many of which are excellent for developing this type of application. But they're less well-known than Javascript, so the support community is much smaller. Also, their syntax is often obscure (OCaml, for example, is hampered by its reliance on arbitrary use of odd punctuation for operators). Javascript, while no more "intuitive" or "natural" than other programming languages, at least broadly follows conventions popularized by many mainstream languages.

Javascript usually runs in web browsers, and it's likely that some of the prototype code will be developed within a browser, purely for the sake of convenience. For the full application, however, I'm currently targetting Konfabulator, the Yahoo! Widgets engine. Konfabulator lets Javascript programs run as first-class applications on the desktop, and provides a number of useful services, such as file access.

EE Design

EE, the actual ethos estimator, is designed as an application of Textmill, the general text-processing system I've designed (described in the next section). EE itself consists of a set of Textmill modules. Textmill text objects flow through these modules, which analyze, transform, or otherwise process them. Each module does one or more of the following:

For example, the initial EE prototype will probably handle email messages (in RFC 2822 format), so it will include a parser for email messages that will extract contributor identify from the email headers. Extending EE to handle, say, a web forum would simply be a matter of adding a source to get data from the forum (perhaps using Web Services) and one or more parsers to extract messages from the HTML and identify metadata in them.

Textmill Design

I have a preliminary, high-level architecture diagram (PDF) for Textmill. The basic idea is a dataflow system with plug-in modules. Modules advertise what they accept for input and what they produce for output. In principle, Textmill constructs a network of modules by connecting matching types, letting data flow between modules as it takes on the appropriate forms. In practice, it continually iterates over the collection of active text objects, sending each one to every module input channel that matches one of the object's available forms, if that module has not previously processed that object. When no modules receive new data in an interation, the objects are sent to all configured sink modules that have an appropriate input channel.

Textmill implements the flow and control logic in its core, which also provides tools for use by modules, such as object factories and the fuzzy-logic evaluator.

Analysis

Obviously, with no running software, there are no results to analyze yet. Nor have I defined a methodology for analyzing them, though that is an important question that will have to be considered as the project evolves. (Likely I will only revisit it in the next iteration, after the C&W presentation.)

Bill and I have agreed, though, that there's something to be learned from nearly any result. If the system output has a significant correlation to even an informal ranking of ethos by a human judge, then there's reason to believe that the project is workable — that a system to computationally estimate ethos, at least in restricted domains, is feasible. If, on the other hand, there is no such discernable correlation, we'll have learned something about what kinds of characteristics do not serve as touchstones for a strong ethic standing.