This morning at EPFL, we had a very interesting talk by Carolyn Rosé (Carnegie Mellon University) about automatic corpus analysis. There is a summary of this work in a paper she wrote for CSCL 2005: Supporting CSCL with Automatic Corpus Analysis Technology:

Process analyses are becoming more and more standard in research on computer-supported collaborative learning. This paper presents the rational as well as results of an evaluation of a tool called TagHelper, designed for streamlining the process of multi-dimensional analysis of the collaborative learning process. In comparison with a hand-coded corpus coded with a 7 dimensional coding scheme, TagHelper is able to achieve an acceptable level of agreement (Cohen's Kappa of .7 or more) along 6 out of 7 of the dimensions when we commit only to the portion of the corpus where the predictor has the highest certainty. In 5 of those cases, the percentage of the corpus where the predictor is confident enough to commit a code is at least 88% of the corpus. Consequences for theory-building with respect to automatic corpus analysis are formulated. Potential applications as a support tool for process analyses, as real-time support for facilitators of on-line discussions, and for the development of more adaptive instructional support for computer-supported collaboration are discussed.

Why do I blog this? coding corpus is both time-consuming and tedious (I like this quote: "Between the training and the coding itself, one quarter of the total duration of the research project was used for the coding of collaborative processes"). Automatic could support coding of natural language corpus data, it would facilitate and potentially improve quantitative process analyses.