Assigning categories to documents is a very difficult task, requiring large amounts of human labor. Complete or partial automation of this task can lead to significant savings in both time and price required to annotate documents. Increasing automation is the main goal of the KTN indexer.
Automated assignment of categories to documents can save considerable amounts of manual work. Consequently, document classification is one of the most useful tasks in natural language processing. While useful, statistical classification systems require annotated documents to be induced. In environments where the possible categories change very frequently (e.g., pressclipping) this is a big issue. The annotation overhead can be mitigated by using active learning techniques (methods where the system actively asks the user to annotate only the most informative documents). In collaboration with Novena d.o.o., TakeLab developed the KTN indexer, a text classification and indexing system. The system allows for very high classification accuracy with minimal human annotation effort. This makes is ideal for very dynamic document processing environments, such as pressclipping agencies.
Participants: Novena d.o.o.; TakeLab FER (Artur Šilić)
Duration: 3 years