Skip to main content

Advanced Settings for File-Based Sources

This article provides information about the options available in the Advanced Settings panel when setting up a file-based data source in Nexla and how to use them.

1. Data Format and Associated Settings

For file-based data sources, Nexla is automatically configured to detect the format in which each file should be parsed based on file extensions. However, this default parsing configuration can be overridden to customize how Nexla will process files from the source.

For example, specifying a parser is useful when the data source contains the following:

  • files with name extensions that do not match the parser that should be used to ingest their content
  • text files with custom delimiters
  • compressed files
  • files without an extension to indicate the type of parser that should be used

To force all files from a source to be parsed in a specified format, select the corresponding option from the File Content Format pulldown menu in the Advanced Settings panel. The following subsections cover additional settings available for some format selections, as well as specific use cases.

  FileContentFormat.png

1.1 Custom Text Files

For sources containing text files that require customized parsing settings, select "Read as Custom Text File" from the File Content Format pulldown menu. This will configure Nexla to read all files from the data source as custom text files according to the selected settings, which are discussed below.

  CustomTextFile.png

  • Specify Delimiter/Qualifier Characters:

    When reading delimiter-separated file formats, Nexla automatically analyzes both the file extension and its content to detect the delimiter and qualifier characters that it contains. However, Nexla can be configured to recognize a specified character or characters as the delimiter and/or qualifier in a text file.

    Delimiter – The delimiter character, or field separator, is used to automatically sort rows of data into attributes.

    Qualifier – The qualifier character is used to wrap text that should be treated as a single attribute, even if it contains occurrences of the delimiter character.

    1. Select the character that Nexla should recognize as the delimiter from the Text Delimiter pulldown menu, or type the character in the field.

      Options listed in the pulldown menu include the most commonly used delimiter characters, but Nexla can be configured to recognize other characters as the delimiter. To do this, type the character directly into the Text Delimiter field, and select the "Use [character]" option that appears.

        TextDelimiter.png

    2. Optional: By default, Nexla will recognize the double quote character as the qualifier in a text file. To specify a different qualifier character, type the character in the Text Qualifier Character field.

        Qualifier.png

  • How Schema Attribute Names Should Be Assigned:

    The Schema Attribute Detection Mode pulldown menu can be used to specify how Nexla should determine attribute names in Nexset schema.

      AttributeDetectionMode.png

    1. For files containing a header row, select "Use Header Row" to configure Nexla to use the entries in the header row as attribute names.

        HeaderRow.png

    2. For files that do not contain a header row, select "Generate Attribute Names" to configure Nexla to automatically assign attribute names to each column of data based on its content.

        GenerateAttribute.png

  • Skip Lines in Structured Files:

    For a data source containing structured files with a fixed number of beginning lines that Nexla should ignore before ingesting data into records, enter the number of lines that should be skipped in the Skip Lines at the Head field.

    By default, this field is set to "0". This setting should be left at the default value if Nexla should not skip any lines before ingesting data in files from this source.

      SkipLines.png

1.2 Compressed ZIP and/or TAR Files

Nexla can ingest data from a source containing files compressed in ZIP and/or TAR format without requiring the files to be extracted at the data source.

  1. To configure Nexla to ingest compressed files from the data source, select "Custom Text File" from the File Content Format pulldown menu.

      CustomTextFile.png

    • Nexla will now automatically decompress files from the source and ingest data from the decompressed files.
  2. Optional: The settings covered in Section 1.1 can also be used to further customize how Nexla should ingest and process data in compressed files from this source.

    These options include the ability to specify delimiter/qualifier characters, indicate how schema attribute names should be assigned, and define a number of lines to be skipped in structured files before beginning data ingestion.

1.3 EDI Files

To configure Nexla to ingest EDI content from a data source, including both EDI files and EDI-formatted content within files containing data in multiple formats, select "Read as EDI" from the File Content Format pulldown menu.

  EDI.png

  • Files Containing Only Some EDI Content:

    For files in which only a portion of the content is valid EDI content, enter the path (in XPath format) to the area of the file that should be processed as EDI content in the EDI XPath field.

      EDI_XPath.png

1.4 Excel Files

For a data source containing files in Excel format (XLSX or XLS), select "Read as Excel" from the File Content Format pulldown menu.

  Excel.png

  • How Schema Attribute Names Should Be Assigned:

    The Schema Attribute Detection Mode pulldown menu can be used to specify how Nexla should determine attribute names in Nexset schema.

      AttributeDetectionMode2.png

    1. For files containing a header row, select "Use Header Row" to configure Nexla to use the entries in the header row as attribute names.

        HeaderRow.png

    2. For files that do not contain a header row, select "Generate Attribute Names" to configure Nexla to automatically assign attribute names to each column of data based on its content.

        GenerateAttribute.png

  • Specify a Relevant Cell Range:

    If only some cells within Excel files from this source contain data that should be ingested, enter the relevant cell path(s)—e.g., sheet1!A1:B5—in the Data Records Cell Range field. Nexla will then ingest data from only the specified cells as individual records and their attributes.

    This field is optional and can be left blank to configure Nexla to ingest data from all cells within Excel files from this source.

      CellRange.png

    • To specify multiple cell ranges in this field, enter the ranges as comma-separated values.

      For example, when sheet1!A1:B5,sheet2!C2:D5 is entered, Nexla will ingest cells A1:B5 from sheet1 and cells C2:D3 from sheet2.

  • Specify a Relevant Metadata Cell Range:

    Some Excel files include common data that should remain associated with data ingested from a source. When this data is located in cells outside the specified Data Records Cell Range, enter the cell path(s) of the common data in the Metadata Cell Range field. Nexla will then include the data in the corresponding cells as metadata attributes in each ingested record.

      Metadata.png

    • Use the general format _<sheetname\>\_<attributecell>:<valuecell>|...|<attributecellN>:<valuecellN>_ when entering the cell path(s).

    • To specify the sheet containing the metadata cells, begin the cell path with the sheetname enclosed in angle brackets, followed by "_"—e.g., _**<sheetname>\_**<attributecell>:<valuecell>|..._.

      By default, Nexla will read the entered cells from the first single sheet when no sheetname is specified.

    • Use the ":" delimiter to split attribute cells from value cells—e.g., _<attributecell>**:**<valuecell>_.

    • Use the "|" delimiter to separate key-value pairs—e.g., _<attributecell_1>:<valuecell_1>**|**<attributecell_2>:<valuecell_2>_.

  • Skipping Merged Cells During Ingestion:

    To configure Nexla to skip merged cells when ingesting data from this source, such as those used for table titles or other formatting, check the box next to "Skip Merged Cells".

      SkipMerged.png

1.5 Fixed-Width Files

To configure Nexla to read all files from a source as fixed width-formatted files, select "Read as Fixed Width File" from the File Content Format pulldown menu.

  FixedWidth.png

  • Field Length (**Required):

    Enter the length of each field in files from this source in the Length of Each Field in File field.

    This field is required when "Read as Fixed Width File" is selected.

    Field lengths should be entered as a comma-separated list of numbers, i.e., 10,12,8,15,20.

  • How Schema Attribute Names Should Be Assigned:

    The Schema Attribute Detection Mode pulldown menu can be used to specify how Nexla should determine attribute names in Nexset schema.

      AttributeDetectionMode.png

    1. For files containing a header row, select "Use Header Row" to configure Nexla to use the entries in the header row as attribute names.

        HeaderRow.png

    2. For files that do not contain a header row, select "Generate Attribute Names" to configure Nexla to automatically assign attribute names to each column of data based on its content.

        GenerateAttribute.png

  • Specify the Field Separator:

    To specify the padding character used to separate individual fields within the fixed-width file, select the character from the Padding Character pulldown menu, or type the character directly into the field.

    Options listed in the pulldown menu include the most commonly used padding characters, but Nexla can be configured to recognize other characters as field separators. To do this, type the character directly into the Padding Character field, and select the "Use [character]" option that appears.

    • By default, Nexla is configured to recognize a single space as the padding character in fixed-width files. To use this default option, no action is required.

        PaddingCharacter.png

  • Line Separation Detection:

    Users can choose whether or not Nexla should automatically detect line separations in fixed-width files, depending on the type of data contained in files from this source and other workflow needs.

    1. To configure Nexla to detect new lines in files from this source, ensure that the box next to "Auto detect line separators?" is checked.

      This option is pre-selected by default.

      If Nexla should ingest data from this source without detecting line separations, uncheck this box.

        LineSeparators.png

    2. Select the character used to indicate a new line from the Line Separation Character pulldown menu.

        LineSeparators2.png

  • Quote Character Removal:

    If Nexla should remove quotes from strings within files from this source, check the box next to "Remove Quote Characters?".

    When this option is selected, the platform will only remove quotes from strings where it is safe to do so.

      RemoveQuote.png

    • Enter the character that should be recognized and treated as a quote character in the Quote Character field.

      By default, Nexla will recognize " as the quote character. To use the default setting, leave this field blank.

        QuoteCharacter.png

  • Scalar Coercion of String Values:

    To force the coercion of string values into scalar data types when Nexla reads files from this source, check the box next to "Force Scalar Coercion of String Values?".

    If string values should not be coerced into scalar data types, uncheck this box.

      ScalarCoercion.png

1.6 JSON Files

When Nexla should read all files from a data source as JSON files, select "Read as JSON" from the File Content Format pulldown menu.

  JSON.png

  • JSON Ingestion Mode (**Required):

    Data processing systems can generate JSON text files formatted such that each row of the text file is a valid JSON object or, more rarely, the entire file is a valid JSON object. Use the JSON Ingestion Mode pulldown menu to designate how Nexla should parse JSON files from this source according to their formatting.

      JSON_Ingestion.png

    • Select "row" if files from this source are in JSON Line format, with each row of the text file being a valid JSON object.

        JSON_Row.png

    • Select "entire.file" if files from this source are formatted with the entire file being a single valid JSON object.

        JSON_EntireFile.png

      1. Optional: When the JSON ingestion mode is set to "entire.file", Nexla can be configured to ingest only part of the JSON object from the file. To do this, enter the JSON-formatted path of the file area that should be ingested from this source in the JSON Path to Data field.

          JSON_Path.png

      2. Optional: In some cases, when the JSON Path to Data field is used to designate a specific portion of the JSON files from this source for ingestion, additional common data located outside of the designated area should also be included in each record, such as metadata information in each file. Enter the JSON-formatted path to this additional data in the Path to Additional Data field to configure Nexla to include it in each record when ingesting files from this source.

          JSON_AdditionalData.png

1.7 Log Files

To configure Nexla to read all files from a source as log-formatted files, such as those generated by IT system-monitoring tools, select "Read as Log File" from the File Content Format pulldown menu.

  LogFiles.png

  • Grok Pattern Selection (**Required):

    The Grok pattern of a log file defines the regular expressions or pattern structure that allows log files to be parsed into structured data. Specify how Nexla should parse log files from this source by selecting the Grok pattern used in the files from the Grok Pattern pulldown menu.

      GrokPattern.png

1.8 PDF Files

For a data source containing files in PDF format, select "Read as PDF File" from the File Content Format pulldown menu.

  PDF.png

  • Configure the Parsing Mode:

    Use the Parsing Mode pulldown menu to select the parsing mode that Nexla should use to extract text from PDF files from this source.

      PDF_Parsing.png

    1. When the "text" option is selected, Nexla will use the text strategy to extract the textual layer from the PDF file. The detected Nexset will contain one record per page of the PDF file, and each record will contain the following two attributes:

      • type – This attribute defines the record type and will have the value "text".
      • text – For each record, this attribute will have a value equal to the textual content extracted from the page.

        PDF_text.png

    2. When the "semi-auto" option is selected, Nexla will use formatting hints to extract structured data from the PDF file.

        PDF_semiauto.png

      1. Enter the configuration settings that Nexla should use to extract records from different slices of structured data in PDF files from this source in the Table Extraction Configuration field.

        This property must be entered as a valid JSON object.

        See the article Keys for Table Extraction from PDF Files for more information about configuring text extraction settings in this field.

          TableExtraction.png

      2. Optional: By default, Nexla will extract text from both table blocks and areas outside of table blocks as records when parsing a PDF file in semi-auto mode. To configure Nexla to extract only text from table blocks in PDF files from this source, uncheck the box next to "Extract Text Blocks?".

          PDF_TextBlocks.png

  • Document Password Settings:

    If PDF files from this source are password-protected, enter the password used to open the files in the Document Password field.

      PDF_Password.png

  • Empty Value Placeholder:

    In the Placeholder Text for Empty Values field, enter the text that should be used as a placeholder value when Nexla detects an empty cell in PDF files parsed from this source.

      PDF_Placeholder.png

1.9 XML Files

To configure Nexla to parse all files from a data source as XML-formatted files, select "Read as XML" from the File Content Format pulldown menu.

  XML.png

  • Configure the XML Ingestion Mode:

    Data processing systems can generate XML files formatted such that each row of the file is a valid XML object or, more rarely, the entire file is a valid XML object. Use the XML Ingestion Mode pulldown menu to designate how Nexla should parse XML files from this source according to their formatting.

      XML_IngestionMode.png

    • Select "row" if Nexla should parse files from this source with each row of the file being a valid XML object.

        XML_Row.png

    • Select "entire.file" if files from this source are formatted with the entire file being a single valid XML object.

        XML_EntireFile.png

      1. Optional: When the XML ingestion mode is set to "entire.file", Nexla can be configured to ingest only part of the XML object from the file. To do this, enter the XPath-formatted path of the file area that should be ingested from this source in the XPath to Data field.

          XPath.png

      2. Optional: In some cases, when the XPath to Data field is used to designate a specific portion of the XML files from this source for ingestion, additional common data located outside of the designated area should also be included in each record, such as metadata information in each file. Enter the XPath-formatted path to this additional data in the Path to Additional Data field to configure Nexla to include it in each record when ingesting files from this source.

          XML_AdditionalData.png

2. Data Selection Options

When setting up a file-based data source, Nexla provides configuration options for specifying which data should be ingested from the source, allowing users to customize data ingestion to suit various use cases. Data can be selected for ingestion from file-based storage systems according to file modification dates, naming patterns, and/or subfolder paths.

2.1 Ingest All Files in the Selected Location

To configure Nexla to ingest all files from the data source, regardless of when the files were added or modified, delete the pre-populated date and time from the "Only read files modified after:" field under the Data Selection heading, and leave this field blank.

  Blank_AllFiles.png

2.2 Ingest Files According to Modification Date

When Nexla should only ingest newer or recently modified files from the data source, the platform can be configured to selectively ingest files modified after a specified date and time.

  1. To specify the file modification date and time that will be used to select which files should be read from this source, click the Calendar.png icon in the "Only read files modified after:" field under the Data Selection heading to access the pulldown menu.

      ModifiedAfter1.png

  2. Click on a date in the calendar to select it as the modification date that should be referenced to identify new and/or modified files to be ingested from the source.

      ModifiedAfter2.png

    • Use the arrows in the top right corner of the calendar to navigate to the previous or next month.

        Calendar1.png

    • Click Today.png in the top right corner of the calendar to select the current date.

        Calendar2.png

    • To open the month-selection menu, click the month and year displayed in the top left corner of the calendar.

        Calendar3.png

    • To open the year-selection menu, first, open the month-selection menu as shown above; then, click the year displayed in the top left corner of the calendar.

        Calendar4.png

  3. In the field at the bottom of the calendar, enter the time (in 24-h format) on the selected date that should be referenced to identify new and/or modified files to be ingested from the source.

      Time.png

2.3 Ingest or Ignore Files According to Path Pattern(s)

Nexla can be configured to scan and/or ignore specific files or subfolders in the selected data source location based on path-naming patterns.

The Apache Ant Path pattern must be used when specifying path patterns to be scanned or ignored. For more information and example path patterns, see the Apache Ant Path documentation for directory-based tasks.

Entered patterns must also start from the root of the selected location accessible with the selected credentials.

  1. Check the box next to "Customize Paths to be Scanned/Ignored" under the Data Selection heading.

      CustomizePaths.png

  2. To configure Nexla to scan only files or subfolders that match a specific path pattern, enter the path pattern in the Paths to Be Scanned field.

    Only matching files or subfolders inside the selected location will be scanned.

    For example, when **/ABC/* is entered, only files in the subfolder "ABC" inside the selected location will be scanned.

      PathsScanned.png

  3. If Nexla should not scan files or subfolders that match a specific path pattern, enter the pattern in the Paths NOT to Be Scanned field.

    Only matching files or subfolders inside the selected location will be ignored.

    For example, when **/ABC/* is entered, files in the subfolder "ABC" inside the selected location will be ignored.

      PathsIgnored.png

  4. Enter the time zone referenced within the selected data source location in the Timezone for Path Format field.

      PathsTimezone.png

3. Nexset (Dataset) Creation Options

Nexla's intelligent data detection and analysis capabilities automatically ensure that ingested data is organized into logical, understandable Nexsets. However, users can also customize how Nexla processes ingested data to suit various use cases.

3.1 Enforce Data Processing into a Single Schema

When Nexla reads files from a data source, data from similar files is always processed into the same Nexset, regardless of any file extension differences. But some use cases require bypassing this automatic Nexset detection to force all data ingested from a source to be organized into a single schema.

Example use cases in which enforcing a single schema on all data ingested from a source include the following:

  • Time-sensitive business data flows that could be impacted by any processing latencies, in which ingested data will always have the same structure. Nexla's schema detection function does not incur a large overhead for each file, but for high file volumes, bypassing schema detection could produce a noticeable improvement in the data-processing speed.

  • Data flows with a high likelihood of sparse data, in which data from multiple files should always be processed into the same Nexset. When ingested data is sparse, Nexla's schema detector might not be able to find significant overlaps between data from multiple files; thus, bypassing schema detection will ensure that all data from the source is added to the same Nexset.

To configure Nexla to process all data from the source into a single Nexset:

  1. Check the box next to "Force a Single Schema" under the Dataset Creation heading.

      SingleSchema.png

3.2 Configure Data Grouping

In some cases, files from a data source contain rows of data that should be combined into arrays of objects based on a key value. This can be achieved by enabling the grouping option when configuring the data source.

For example, a source might contain multiple CSV files with a column labeled "order_number" and additional columns containing information about each order.
With grouping, Nexla can be configured to process data ingested from this source into a Nexset containing two attributes: order_number, which contains the value in the "order_number" column for each record, and order_details, which is an array of objects containing the values present in the remaining columns for each record.

  1. Check the box next to "Enable Grouping" under the Dataset Creation heading.

      Grouping.png

  2. Enter the name of the attribute based on which grouping will be performed in the Grouping Key Attribute field.

    For example, in the scenario at the beginning of this section, grouping should be performed based on the values in the "order_number" column of the CSV files; thus, "order_number" would be entered in this field.

      GroupingKey.png

  3. Enter the name of the attribute that will contain the array of grouped objects in the produced Nexset.

    For example, in the scenario at the beginning of this section, grouping is performed based on the values in the "order_number" column of the CSV files, and the values in the remaining columns are grouped into an order_details attribute; thus, "order_details" would be entered in this field.

      GroupedField.png

  4. Specify how Nexla should handle rows that do not contain a value for the key attribute entered in Step 2 using the "Publish value of null key in grouping" checkbox.

      NullKey.png

    • When the box next to "Publish value of null key in grouping" is unchecked, the platform will ignore rows that do not contain a key attribute value.

    • When the box next to "Publish value of null key in grouping" is checked, the platform will assign each row that does not contain a key attribute value to a record with the grouping key attribute value of "null".

4. Pipeline Ingestion Speed

Nexla can be configured to execute data flows, including data ingestion from the source, processing into detected and any transformed Nexsets, and sending the data to a destination, at a higher speed than the default setting. When the pipeline ingestion speed is increased, the data flow will be carried out with a larger capacity and data throughput infrastructure.

Important Note: Increasing the pipeline ingestion speed will result in a significant increase in the billable charges for the account. Please consult your Account Manager before using this option.

  1. To adjust the speed at which Nexla should ingest and process data from a source, select an option from the Speed Factor pulldown menu under the Pipeline Ingest Speed heading.

      SpeedFactor.png