Skip to main content

Keys for Table Extraction from PDF Files

This article provides information about required and optional keys used to configure the semi-automatic extraction of data from tables in PDF files when setting up a data source in Nexla.

The information and keys described below are for use when entering table extraction settings for file-based sources when "Read as PDF File" is selected as the File Content Format option and "semi-auto" is selected as the Parsing Mode. See Section 1.8 in the Advanced Settings for File-Based Sources article to learn how to configure Nexla to extract data from tables in PDF files using these keys.

1. Key: columns

  • This key is required and must be an array of column names that are present within the table in the PDF files from the data source.

  • List column names as they appear from left to right or from top to bottom in the tables.

    • For columns without a name in the table, enter (blank).

    • For empty columns, enter (column) to produce a Nexset with an attribute containing the data in the column without a name, or enter (column=Desired Name) to produce a Nexset with a Desired Name attribute containing the data in the column without a name.

2. Key: forceBoldHeaders

  • Use this key to configure Nexla to only treat column headers that are in bold font as attribute names.

  • This key is a boolean property that will default to the value "false" unless otherwise configured.

3. Key: forceFullWidth

  • Use this key to configure Nexla to treat the entire page in each PDF file as the table width.

  • When this key is not used, Nexla's parser will treat the least possible width as the table width.

  • This key is a boolean property that will default to the value "true" unless otherwise configured.

4. Key: header

  • Use this key to specify whether Nexla should read columns horizontally or vertically.

    • To configure Nexla to read columns horizontally, enter this key with a value of ROW.

    • To configure Nexla to read columns vertically, enter this key with a value of COLUMN.

5. Key: tupleNumber

  • Use this key to assign an integer value when the header mode is set to COLUMN and the table does not take up the entire page width of the PDF files from this source.

6. Key: spacing

  • Use this key to fine-tune how Nexla should read tables in complex PDF files from this source.

  • This key should be assigned a double-precision value.

  • The default value of this key is 2.5.