Skip to main navigation Skip to search Skip to main content

Text Characterization Toolkit (TCT)

  • Daniel Simig
  • , Tianlu Wang
  • , Verna Dankers
  • , Peter Henderson
  • , Khuyagbaatar Batsuren
  • , Dieuwke Hupkes
  • , Mona Diab

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a tool, Text Characterization Toolkit (TCT), that researchers can use to study characteristics of large datasets. Furthermore, such properties can lead to understanding the influence of such attributes on models' behaviour. Traditionally, in most NLP research, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. TCT aims at filling this gap by facilitating such deeper analysis for datasets at scale, where datasets can be for training/development/evaluation. TCT includes both an easy-to-use tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from several different domains. TCT is used to predict difficult examples for given well-known trained models; TCT is also used to identify (potentially harmful) biases present in a dataset.

Original languageEnglish (US)
Title of host publicationSystem Demonstrations
EditorsWray Buntine, Maria Liakata
PublisherAssociation for Computational Linguistics (ACL)
Pages72-87
Number of pages16
ISBN (Electronic)9781955917551
DOIs
StatePublished - 2022
Externally publishedYes
Event2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL-IJCNLP 2022 - Virtual, Online
Duration: Nov 20 2022Nov 23 2022

Publication series

NameProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Long Paper, AACL-IJCNLP 2022
Volume4

Conference

Conference2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL-IJCNLP 2022
CityVirtual, Online
Period11/20/2211/23/22

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'Text Characterization Toolkit (TCT)'. Together they form a unique fingerprint.

Cite this