Key Extraction in Table Form Documents: Insurance Policy as an Example


Cavusoglu D., Dayibasi O., Saglam R. B.

3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosna-Hersek, 20 - 23 Eylül 2018, ss.195-200 identifier identifier

  • Doi Numarası: 10.1109/ubmk.2018.8566309
  • Basıldığı Şehir: Sarajevo
  • Basıldığı Ülke: Bosna-Hersek
  • Sayfa Sayıları: ss.195-200

Özet

Automatic keyword/key-phrase extraction is an popular research area in text mining, and information retrieval which provides us a brief summary of documents and enables us to analyze large amount of textual material efficiently. Key-phrase extraction methodologies are generally based on the assumption that key-phrases are statistically important phrases and have a good coverage of documents. In this study, we propose a method to extract key-phrases in electronic documents such as insurance policy, bank receipts or e-invoices that appear as keys and have corresponding values. Unlike key-phrases, keys do not cover the whole document and their term frequencies are also relatively low, most of which appear only once in a document. They tend to appear in tables and forms at the beginning of the documents. Based on these observations, we propose a classification method for key extraction for table form electronic documents. Four machine learning based classification algorithms are applied and compared including decision trees, random forests, logistic regression and extreme gradient boosting. Experimental results show that random forests have perlOrmed well with an overall accuracy over 98%.