TY - GEN
T1 - On the feasibility of internet-scale author identification
AU - Narayanan, Arvind
AU - Paskov, Hristo
AU - Gong, Neil Zhenqiang
AU - Bethencourt, John
AU - Stefanov, Emil
AU - Shin, Eui Chul Richard
AU - Song, Dawn
PY - 2012
Y1 - 2012
N2 - We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors. Given the increasing availability of writing samples online, our result has serious implications for anonymity and free speech - an anonymous blogger or whistleblower may be unmasked unless they take steps to obfuscate their writing style. While there is a huge body of literature on authorship recognition based on writing style, almost none of it has studied corpora of more than a few hundred authors. The problem becomes qualitatively different at a large scale, as we show, and techniques from prior work fail to scale, both in terms of accuracy and performance. We study a variety of classifiers, both "lazy" and "eager," and show how to handle the huge number of classes. We also develop novel techniques for confidence estimation of classifier outputs. Finally, we demonstrate stylometric authorship recognition on texts written in different contexts. In over 20% of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35% of cases the correct author is one of the top 20 guesses. If we allow the classifier the option of not making a guess, via confidence estimation we are able to increase the precision of the top guess from 20% to over 80% with only a halving of recall.
AB - We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors. Given the increasing availability of writing samples online, our result has serious implications for anonymity and free speech - an anonymous blogger or whistleblower may be unmasked unless they take steps to obfuscate their writing style. While there is a huge body of literature on authorship recognition based on writing style, almost none of it has studied corpora of more than a few hundred authors. The problem becomes qualitatively different at a large scale, as we show, and techniques from prior work fail to scale, both in terms of accuracy and performance. We study a variety of classifiers, both "lazy" and "eager," and show how to handle the huge number of classes. We also develop novel techniques for confidence estimation of classifier outputs. Finally, we demonstrate stylometric authorship recognition on texts written in different contexts. In over 20% of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35% of cases the correct author is one of the top 20 guesses. If we allow the classifier the option of not making a guess, via confidence estimation we are able to increase the precision of the top guess from 20% to over 80% with only a halving of recall.
UR - http://www.scopus.com/inward/record.url?scp=84876100930&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84876100930&partnerID=8YFLogxK
U2 - 10.1109/SP.2012.46
DO - 10.1109/SP.2012.46
M3 - Conference contribution
AN - SCOPUS:84876100930
SN - 9780769546810
T3 - Proceedings - IEEE Symposium on Security and Privacy
SP - 300
EP - 314
BT - Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE Symposium on Security and Privacy, S and P 2012
Y2 - 21 May 2012 through 23 May 2012
ER -