Reading Data

Data Sources

Data source resources describe sources of data to be ingested, including details about source type, ingestion schedule, associated data sets, and credentials for accessing the source. No matter where the data to be ingested resides, all information where, when, and how to ingest the data is contained in these Nexla resources.

A data source may have one or more datasets associated with it. These correspond to distinct schemas detected by Nexla in the source.

List All Sources

Both Nexla API and Nexla CLI support methods to list all sources in the authenticated user's account. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about all sources.

List All Sources: Request
GET /data_sources
Example:
curl https://api.nexla.io/data_sources \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
List All Sources: Response
[
{
"id": 5002,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example data source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5004,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "API",
"source_type": "api_push",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
},
{
"id": 5003,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example Lat/Lng source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5011,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "POLL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": {
"id": 5001,
...
},
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
]

Show One Source

Fetch a specific source accessible to the authenticated user. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about that source.

In case of Nexla API, add an expand query param with a truthy value to get more details about the source. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.

Show One Source: Request
GET /data_sources/{data_source_id}
Example
curl https://api.nexla.io/data_sources/5003 \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
Show One Source: Response
{
"id": 5003,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example Lat/Lng source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5011,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "POLL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": {
"id": 5001,
...
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
},
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}

Create A Source

Both Nexla API and Nexla CLI support methods to create a new data source in the authenticated user's account. The only required attribute in the input object is the data source name; all other attributes are set to default values.

Create Source: Request
POST /data_sources
Example Request Body
...
{
"name": "Example S3 Data Source",
"source_type": "s3"
}
Create Source: Response
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-12-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}

Create with Credentials

Data sources usually require some credentials for making a connection and ingesting data. You can refer to an existing data_credentials resource or create a new one in the POST call to /data_sources. In this example, an existing credentials object is used:

Create with Credentials: Request
POST /data_sources
Example Request Body
...
{
"name": "Example S3 Data Source",
"source_type": "s3",
"data_credentials": 5001
}

Here, the required attributes for creating a new data_credentials resource are included in the request:

Create with Credentials: Request
POST /data_sources
Example Request Body
...
{
"name": "Example FTP Data Source",
"source_type": "ftp",
"data_credentials": {
"name": "FTP CREDS",
"credentials_type": "ftp",
"credentials_version": "1",
"credentials": {
"credentials_type": "ftp",
"account_id": "XYZ",
"password": "123"
}
}
}

In either case, a successful POST on /data_sources with credential information will return a response including the full data source and the encrypted form of its associated data credentials resource:

Create with Credentials: Response
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Updated S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-13-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}

Update A Source

Nexla API supports methods to update any property of an existing source the authenticated user has access to.

Update Source: Request
PUT /data_sources/5023
Example Request Body
...
{
"name": "Updated S3 Data Source",
}
Update Source: Response
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Updated S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-13-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}

Delete A Source

Nexla API supports methods to delete any source that the authenticated user has administrative/ownership rights to.

If the source is paused and none of its detected datasets have associated downstream resources Nexla can delete the source safely. A successful request to delete a data source returns Ok (200) with no response body.

If the source is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the source.

Delete Source: Request
DELETE /data_sources/{data_source_id}
Delete Source: Response
Empty response with status 200 for success
Error response with reason if source could not be deleted

Control ingestion

Activate and Pause Source

Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.

Activate Source: Request
PUT /data_sources/{data_source_id}/activate

On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.

Pause Source: Request
PUT /data_sources/{data_source_id}/pause

Reingest Files

For file type sources, Nexla can be configured to reingested an already scanned file. This is useful if the file originally failed ingestion due to file errors and the file has been modified.

To re-ingest files for a data source, issue POST request on endpoint /data_sources/<data_source_id>/file/ingest with file path as body. The file path must start with the root of the location that the source points to.

Reingest File: Request
POST /data_sources/{data_source_id}/file/ingest
...
Example Payload
{"file":"xls-merge/PostLog_TableOnlyXLS.xlsx"}
Reingest File: Response
{
"status": "ok"
}

Validate Source Configuration

All configuration about where and when to scan data is contained with the source_config property of a data source.

As Nexla provides quite a few options to fine tune and control exactly what slice of your data location you want to ingest and how, it is important to ensure the source_config contains all required parameters to successfully scan data. To validate the configuration of a given data source, send a POST request on endpoint /data_sources/<data_source_id>/config/validate.

You can send optional json config as input body, if there is no input config in request then stored source_config will be used for validation.

Validate Source Configuration: Request
POST /data_sources/{data_source_id}/config/validate
Validate Source Configuration: Response
{
"status": "ok",
"output": [
{
"name": "credsEnc",
"value": null,
"errors": [
"Missing required configuration \"credsEnc\" which has no default value."
],
"visible": true,
"recommendedValues": []
},
{
"name": "credsEncIv",
"value": null,
"errors": [
"Missing required configuration \"credsEncIv\" which has no default value."
],
"visible": true,
"recommendedValues": []
},
{
"name": "source_type",
"value": null,
"errors": [
"Missing required configuration \"source_type\" which has no default value.",
"Invalid value null for configuration source_type: Invalid enumerator"
],
"visible": true,
"recommendedValues": []
}
]
}

Inspect Source Data

You can inspect the data that a source points to. These methods can be handy when trying to figure out the exact source_config properties to be set in the data source.

Inspect Source Content Hierarchy

You can inspect the tree structure of file and database sources to a particular depth. Note that not all data source types have a natural tree structure.

The following example shows the required request body structure for a /probe/tree call on an S3 data source.

Inspect Source Content Hierarchy: Request
POST /data_sources/<source_id>/probe/tree
...
{
"region": "us-west-1",
"bucket": "production-s3-basin",
"prefix": "events_v2/",
"depth": 3
}
Inspect Source Content Hierarchy: Response
{
"status": "ok",
"output": {
"events_v2": {
"2015": {
"11": {
"1": {},
"2": {},
"3": {}
},
"12": {
"1": {},
"8": {}
}
},
"2017": {
"2": {
"20": {}
}
}
}
}
}

Inspect Sample File Content

You can also get metadata and sample content from a file within a source. Note that the request payload must contain path of file starting from the root of the location that data_source points to.

Inspect File Content: Request
POST /data_sources/{data_source_id}/probe/files
...
{
"path" : "demo-in.nexla.com/test/Stock.json"
}
Inspect File Content: Response
{
"status": 200,
"message": "Ok",
"output": {
"format": "json",
"messages": [
{
"Stockname": "sociosqu ad",
"Total Debt": 8,
"Return on Assets": 1,
"Sector": "Mauris",
"Quick Ration": "1.2"
},
{
"Stockname": "ornare. In",
"Total Debt": 7,
"Return on Assets": 3,
"Sector": "lectus.",
"Quick Ration": "1.2"
},
{
"Stockname": "nec, diam.",
"Total Debt": 5,
"Return on Assets": 5,
"Sector": "eu",
"Quick Ration": "1.2"
}
]
},
"connection_type": "s3"
}

Test Potential Detected schemas

When a data source is activated it will scan all data to detect unique schemas and create a dataset for each schema.

You can choose to test what potential schemas might be detected out of a part of a data source, for ex a specific file. The format of the request body object depends on the data source type. S3 data sources require a bucket attribute and accept an optional prefix. FTP data sources require only a file attribute, which must contain the full path to an ftp-based file.

Test Potential Detected Schemas: Request
POST /data_sources/<data_source_id>/probe/schemas
{
"bucket" : "ftp-nexla.com",
"prefix" : "finance/data"
}

The response to a successful /probe/schemas call contains an array of objects representing potential data sets. Each object contains source_schema and data_samples attributes along with other meta-data.

Test Potential Detected Schemas: Response
[
{
"sample_service_id": 0,
"source_schema": {
"type": "object",
"properties": {
"Date": {
"type": "string"
},
"ShortExemptVolume": {
"type": "string"
},
"ShortVolume": {
"type": "string"
},
"TotalVolume": {
"type": "string"
}
},
"$schema-id": 869426765,
"$schema": "http://json-schema.org/draft-04/schema#"
},
"data_samples": [
{
"Date": "2016-12-01",
"ShortVolume": "2950.0",
"ShortExemptVolume": "0.0",
"TotalVolume": "3420.0"
},
{
"Date": "2016-11-17",
"ShortVolume": "157.0",
"ShortExemptVolume": "0.0",
"TotalVolume": "357.0"
},
{
"Date": "2016-11-02",
"ShortVolume": "159.0",
"ShortExemptVolume": "0.0",
"TotalVolume": "318.0"
},
{
"Date": "2016-10-26",
"ShortVolume": "100.0",
"ShortExemptVolume": "0.0",
"TotalVolume": "200.0"
},
{
"Date": "2016-10-24",
"ShortVolume": "100.0",
"ShortExemptVolume": "0.0",
"TotalVolume": "100.0"
}
],
"source_path": {
"file": "Short_Volume/FINRA-FORF_ARZGY.csv"
},
"resource_type": "SOURCE",
"resource_id": 5006,
"name": "Short_Volume/FINRA-FORF_ARZGY.csv",
"description": "Detected data set from Example Financial Data Source"
}
]

Monitor Source

Use the methods listed in this section to monitor all ingestion history for a source.

Lifetime Ingestion Metrics

Lifetime ingestion metrics methods return information about total data ingested for a source since its creation. Metrics contain information about the number of records ingested as well the estimated volume of data.

Lifetime Ingestion Metrics: Request
GET /data_sources/5001/metrics
Lifetime Ingestion Metrics: Response
{
"status": 200,
"metrics": {
"records": 4,
"size": 582
}
}

Aggregated Ingestion Metrics

Aggregated ingestion metrics methods return information about total data ingested every day for a source. Metrics contain information about the number of records ingested as well the estimated volume of data.

Aggregations can be fetched in different aggregation units. Use the method below to fetch reports aggregated daily:

Daily Ingestion Metrics: Request
GET /data_sources/5001/metrics?aggregate=1
...
Optional Payload Parameters:
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"page": <integer page number>,
"size": <number of entries in page>
}
Daily Ingestion Metrics: Response
{
"status": 200,
"metrics": [
{
"time": "2017-02-08",
"record": 53054,
"size": 12476341
},
{
"time": "2017-02-09",
"record": 66618,
"size": 15829589
},
{
"time": "2017-02-10",
"record": 25832,
"size": 6645994
}
]
}

Sources can be configured to scan for data at a specific ingestion frequency. Use the methods below to view ingestion metrics per ingestion cycle.

Aggregated By Ingestion Frequency: Request
GET /data_sources/5001/metrics/run_summary
...
Optional Payload Parameters:
{
"runId": <starting from unix epoch time of ingestion events>,
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"page": <integer page number>,
"size": <number of entries in page>
}
Aggregated By Ingestion Frequency: Response
{
"status": 200,
"metrics": {
"1539970426049": {
"records": 1364,
"size": 971330,
"errors": 0
},
"1539990426049": {
"records": 330,
"size": 235029,
"errors": 0
}
}
}

Granular Ingestion Status Metrics

Apart from aggregated ingestion metrics methods above that provide visibility into total number of records and total volume of data ingested over a period of time, Nexla also provides methods to view granular details about ingestion events.

You can retrieve ingestion status of a file source to find information like how many files have been read fully, failed ingestion, or queued for ingestion in next ingestion cycle.

File Source Ingestion Status: Request
GET /data_sources/5001/metrics/files_stats
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL"
}
File Source Ingestion Status: Response
{
"status": 200,
"metrics": {
"data": {
"COMPLETE": 17
},
"meta": {
"currentPage": 1,
"totalCount": 1,
"pageCount": 1
}
}
}

You can view ingestion status and history per file of a file source. The file source ingestion history methods below return one entry per file by aggregating all ingestion events for each file.

Ingestion History Per File: Request
/data_sources/5001/metrics/files
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
Ingestion History Per File: Response
{
"status": 200,
"metrics": {
"data": [
{
"dataSourceId": 1110,
"dataSetId": 2881,
"size": 436180,
"ingestionStatus": "COMPLETE",
"recordCount": 1000,
"name": "/2017/05/05/22/sub-in-5038-00000-000000000000.json",
"id": null,
"lastModified": "2018-06-04T11:31:24Z",
"error": null,
"lastIngested": "2018-06-04T11:43:11Z",
"errorCount": null
},
{
"dataSourceId": 1110,
"dataSetId": 2881,
"size": 423605,
"ingestionStatus": "COMPLETE",
"recordCount": 1000,
"name": "/2017/05/05/22/sub-in-5038-00000-0000000000001.json",
"id": null,
"lastModified": "2018-06-04T11:31:27Z",
"error": null,
"lastIngested": "2018-06-04T11:43:04Z",
"errorCount": null
}
],
"meta": {
"currentPage": 2,
"totalCount": 12,
"pageCount": 2
}
}
}

You can also bypass per file aggregation and fetch full ingestion history of each file even if it was scanned multiple times.

Raw File Ingestion Status: Request
GET /data_sources/5001/metrics/files_raw?from=2017-09-25T02:25:26&to=2017-09-28T02:25:26
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
Raw File Ingestion Status: Response
{
"status": 200,
"metrics": [
{
"dataSourceId": 1542,
"dataSetId": 4062,
"size": 3692,
"ingestionStatus": "COMPLETE",
"recordCount": 25,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124681,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:36Z",
"errorCount": null
},
{
"dataSourceId": 1542,
"dataSetId": null,
"size": 0,
"ingestionStatus": "NOT_STARTED",
"recordCount": 0,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124680,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:23Z",
"errorCount": null
},
{
"dataSourceId": 1542,
"dataSetId": null,
"size": 0,
"ingestionStatus": "NOT_STARTED",
"recordCount": 0,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124679,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:19Z",
"errorCount": null
}
]
}

You can call the methods below to retrieve source ingestion status per ingestion poll cycle.

Ingestion Status By Frequency: Request
GET /data_sources/5003/metrics/files_cron
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
Ingestion Status By Frequency: Response
{
"data": [
{
"dataSourceId": 5003,
"dataSetId": null,
"size": 2064,
"ingestionStatus": "COMPLETE",
"recordCount": 5,
"name": null,
"id": null,
"lastModified": "2018-09-20T04:56:44Z",
"runId": 1537394123916,
"error": null,
"lastIngested": "2018-09-20T04:57:13Z",
"errorCount": null
}
],
"meta": {
"currentPage": 1,
"totalCount": 1,
"pageCount": 1
}
}

Other Monitoring Events

See the section on Monitoring resources for method to view source errors, notifications, quarantine samples, and audit logs.