Skip to main content

Datasets

Datasets

A Nexla dataset is a virtual representation of the data model containing schema, samples and metadata inferred from the data. The distinguishing attributes of a data set are its input schema (either detected in the source or from a parent data set) and its set of transformations. The transformations applied to the input schema, or data records matching that schema, define the outgoing schema, which may be associated with a dataset.

List All Datasets

Both Nexla API and Nexla CLI support methods to list all datasets in the authenticated user's account. A successful call returns detailed information like id, owner, dataset's parent (source or another dataset), schema(input and output), and the transform rules to generate each dataset.

List All Datasets: Request
GET /data_sets

Example:
curl https://api.nexla.io/data_sets \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
List All Sources: Response
[
{
"id": 8159,
"owner": {
"id": 82,
"full_name": "Kunjal Sharma",
"email": "kunjal@nexla.com",
"email_verified_at": "2018-03-06T22:24:47.000Z"
},
"org": {
"id": 1,
"name": "Nexla",
"email_domain": "nexla.com",
"email": null
},
"version": 19116,
"name": null,
"description": null,
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"parent_data_sets": [
{
"id": 8158,
"name": null,
"description": null,
"updated_at": "2019-08-12T12:59:19.000Z",
"created_at": "2019-08-12T12:54:50.000Z"
}
],
"data_sinks": [
{
"id": 5888,
"name": "test_s3",
"sink_type": "s3"
}
],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-12T12:59:22.000Z",
"created_at": "2019-08-12T12:56:00.000Z",
"tags": []
},
{
"id": 8158,
"owner": {
"id": 82,
"full_name": "Kunjal Sharma",
"email": "kunjal@nexla.com",
"email_verified_at": "2018-03-06T22:24:47.000Z"
},
"org": {
"id": 1,
"name": "Nexla",
"email_domain": "nexla.com",
"email": null
},
"version": 19115,
"name": null,
"description": null,
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"parent_data_sets": [
{
"id": 8157,
"name": "test_s3_1",
"description": "DataSet #1 detected from test_s3",
"updated_at": "2019-08-12T12:54:51.000Z",
"created_at": "2019-08-12T12:25:53.000Z"
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-12T12:59:19.000Z",
"created_at": "2019-08-12T12:54:50.000Z",
"tags": []
}
]

List Datasets for Source

You can retrieve a list of all data sets associated with a particular data source by including a data_source_id query parameter. You can limit the list further by including the source_schema_id query parameter in the GET request. The API will return only data sets for the given data source which have the matching source_schema_id attribute.

List Datasets By Source: Request
GET /data_sets?data_source_id={data_source_id}&expand=1
List Datasets By Source: Response
[
{
"id": 8085,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #1 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
"type": "object",
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
},
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #2 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
"type": "object",
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
}
]
List Dataset By Source Schema: Request
GET /data_sets?data_source_id={data_source_id}&source_schema_id={source_schema_id}&expand=1
List Dataset By Source Schema: Response
[
{
"id": 8085,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #1 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
}
]

Show One Dataset

Fetch a specific dataset accessible to the authenticated user. In case of Nexla API, add an expand query param with a truthy value to get more details about the dataset. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.

A data set has either a non-null source_schema or parent_data_set. The former refers to a schema detected in data read from the data source itself. The latter refers to the data set which precedes the current one in the pipeline of data processing.

A data set always has a transform attribute, which may be null or an empty object. This transform is applied to an data incoming from the source or parent data set to produce outgoing data matching the output_schema.

A data set may have non-empty data_samples attribute containing one or more objects matching the schema from source_schema or parent_data_set.

Show One Dataset: Request
GET /data_sets/{data_set_id}?expand=1
Show One Dataset: Response
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}

Create A Dataset

Dataset creation requires a parent dataset to define the input of that dataset and a transform to define how the input will be modified to generate the output of that dataset. See section on transforms for the different ways of creating transforms.

Create Source: Request
POST /data_sets

Example Request Body
...
{
"name": "Test Dataset",
"description": "",
"parent_data_set_id": 22186,
"has_custom_transform": true,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [],
"custom": true
}
}
Create A Dataset: Response
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "Test Dataset",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}

Update a Dataset

Nexla API supports methods to update any property of an existing dataset the authenticated user has access to.

Update Dataset: Request
PUT /data_sets/5023

Example Request Body
...
{
"name": "Test Dataset",
}

Update A Dataset: Response
{
"id": 5023,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}

Update with Custom Transform

Data set transforms are normally constructed through the schema editor in the Nexla UI, which contains logic for translating user actions on the data set attributes in transform rule syntax.

You can set a transform directly on a data set by including it in your POST or PUT input. Note, the has_custom_transform attribute should be omitted or set to false if the transform you're saving is compatible with the schema editor in the Nexla UI. If your transform contains syntax or modifiers that are not supported in the UI, set has_custom_transform to true to disable the transform tools in the schema editor (which might override or delete your custom modifications).

You can also specify a transform_id of a previously created transform instead of the transform object.

Update Dataset: Request
PUT /data_sets/{dataset_id}

Example Request Body
...
{
"has_custom_transform": true,
"transform" : {
"version" : 1,
"data_maps" : [],
"transforms" : [
{
"operation" : "shift",
"spec" : {
"time": "timestamp",
"userId": "userId",
"eventType": "eventType"
}
}
]
}
}
Update A Dataset: Response
{
"id": 5023,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform" : {
"version" : 1,
"data_maps" : [],
"transforms" : [
{
"operation" : "shift",
"spec" : {
"time": "timestamp",
"userId": "userId",
"eventType": "eventType"
}
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}

Delete A Dataset

Nexla API supports methods to delete any dataset that the authenticated user has administrative/ownership rights to.

If the dataset is paused and does not have any associated downstream resources Nexla can delete the dataset safely. A successful request to delete a dataset returns Ok (200) with no response body.

If the dataset is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the dataset.

Delete Dataset: Request
DELETE /data_sets/{data_set_id}
Delete Dataset: Response
Empty response with status 200 for success
Error response with reason if dataset could not be deleted

Activate and Pause Datas

Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.

Activate Source: Request
PUT /data_sources/{data_source_id}/activate

On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.

Pause Source: Request
PUT /data_sources/{data_source_id}/pause

Monitor Dataset

See the section on Monitoring resources for method to view dataset errors, notifications, quarantine samples, and audit logs.