Datasets

A Nexla dataset is a virtual representation of the data model containing schema, samples and metadata inferred from the data. The distinguishing attributes of a data set are its input schema (either detected in the source or from a parent data set) and its set of transformations. The transformations applied to the input schema, or data records matching that schema, define the outgoing schema, which may be associated with a dataset.

List All Datasets

Both Nexla API and Nexla CLI support methods to list all datasets in the authenticated user's account. A successful call returns detailed information like id, owner, dataset's parent (source or another dataset), schema(input and output), and the transform rules to generate each dataset.

Nexla API
Nexla CLI

List All Datasets: Request
GET /data_sets

Example:
curl https://api.nexla.io/data_sets          \
  -H "Authorization: Bearer <Access-Token>"     \
  -H "Accept: application/vnd.nexla.api.v1+json"

List All Datasets: Request

nexla dataset list

Nexla API
Nexla CLI

List All Sources: Response
[
  {
    "id": 8159,
    "owner": {
      "id": 82,
      "full_name": "Kunjal Sharma",
      "email": "kunjal@nexla.com",
      "email_verified_at": "2018-03-06T22:24:47.000Z"
    },
    "org": {
      "id": 1,
      "name": "Nexla",
      "email_domain": "nexla.com",
      "email": null
    },
    "version": 19116,
    "name": null,
    "description": null,
    "access_roles": ["owner"],
    "status": "ACTIVE",
    "sample_service_id": null,
    "source_path": {},
    "public": false,
    "managed": false,
    "data_source_id": null,
    "parent_data_sets": [
      {
        "id": 8158,
        "name": null,
        "description": null,
        "updated_at": "2019-08-12T12:59:19.000Z",
        "created_at": "2019-08-12T12:54:50.000Z"
      }
    ],
    "data_sinks": [
      {
        "id": 5888,
        "name": "test_s3",
        "sink_type": "s3"
      }
    ],
    "sharers": [],
    "external_sharers": [],
    "has_custom_transform": false,
    "output_schema_validation_enabled": false,
    "generate_output_schema": false,
    "updated_at": "2019-08-12T12:59:22.000Z",
    "created_at": "2019-08-12T12:56:00.000Z",
    "tags": []
  },
  {
    "id": 8158,
    "owner": {
      "id": 82,
      "full_name": "Kunjal Sharma",
      "email": "kunjal@nexla.com",
      "email_verified_at": "2018-03-06T22:24:47.000Z"
    },
    "org": {
      "id": 1,
      "name": "Nexla",
      "email_domain": "nexla.com",
      "email": null
    },
    "version": 19115,
    "name": null,
    "description": null,
    "access_roles": ["owner"],
    "status": "ACTIVE",
    "sample_service_id": null,
    "source_path": {},
    "public": false,
    "managed": false,
    "data_source_id": null,
    "parent_data_sets": [
      {
        "id": 8157,
        "name": "test_s3_1",
        "description": "DataSet #1 detected from test_s3",
        "updated_at": "2019-08-12T12:54:51.000Z",
        "created_at": "2019-08-12T12:25:53.000Z"
      }
    ],
    "data_sinks": [],
    "sharers": [],
    "external_sharers": [],
    "has_custom_transform": false,
    "output_schema_validation_enabled": false,
    "generate_output_schema": false,
    "updated_at": "2019-08-12T12:59:19.000Z",
    "created_at": "2019-08-12T12:54:50.000Z",
    "tags": []
  }
]

List All Datasets: Response
  id      status       name
----     --------     -----------------------------
5081      PAUSED       test_dataset
5666      INIT         test1_dataset

List Datasets for Source

You can retrieve a list of all data sets associated with a particular data source by including a data_source_id query parameter. You can limit the list further by including the source_schema_id query parameter in the GET request. The API will return only data sets for the given data source which have the matching source_schema_id attribute.

Nexla API

List Datasets By Source: Request
GET /data_sets?data_source_id={data_source_id}&expand=1

Nexla API

List Datasets By Source: Response
[
  {
    "id": 8085,
    "owner": {
      "id": 82,
      ...
    },
    "org": {
      "id": 1,
      ...
    },
    "version": 18906,
    "name": "1 - echo",
    "description": "DataSet #1 detected from echo",
    "access_roles": ["owner"],
    "status": "INIT",
    "sample_service_id": null,
    "source_path": {},
    "public": false,
    "managed": false,
    "data_source_id": 5963,
    "data_source": {
      "id": 5963,
      ...
    },
    "source_schema_id": "1072858493",
    "source_schema": {
      "type": "object",
       ...
    },
    "parent_data_sets": [],
    "data_sinks": [],
    "sharers": [],
    "external_sharers": [],
    "has_custom_transform": false,
    "transform_id": null,
    "transform": {
      "version": 1,
      "data_maps": [],
      "transforms": []
    },
    "output_schema": {
      "type": "object",
      ...
    },
    "output_validation_schema": {},
    "output_schema_validation_enabled": false,
    "generate_output_schema": false,
    "updated_at": "2019-08-01T12:13:54.000Z",
    "created_at": "2019-07-11T10:16:52.000Z",
    "tags": []
  },
  {
    "id": 8086,
    "owner": {
      "id": 82,
      ...
    },
    "org": {
      "id": 1,
       ...
    },
    "version": 18906,
    "name": "1 - echo",
    "description": "DataSet #2 detected from echo",
    "access_roles": ["owner"],
    "status": "INIT",
    "sample_service_id": null,
    "source_path": {},
    "public": false,
    "managed": false,
    "data_source_id": 5963,
    "data_source": {
      "id": 5963,
      ...
    },
    "source_schema_id": "1072858493",
    "source_schema": {
      "type": "object",
      ...
    },
    "parent_data_sets": [],
    "data_sinks": [],
    "sharers": [],
    "external_sharers": [],
    "has_custom_transform": false,
    "transform_id": null,
    "transform": {
      "version": 1,
      "data_maps": [],
      "transforms": []
    },
    "output_schema": {
      "type": "object",
      ...
    },
    "output_validation_schema": {},
    "output_schema_validation_enabled": false,
    "generate_output_schema": false,
    "updated_at": "2019-08-01T12:13:54.000Z",
    "created_at": "2019-07-11T10:16:52.000Z",
    "tags": []
  }
]

Nexla API

List Dataset By Source Schema: Request
GET /data_sets?data_source_id={data_source_id}&source_schema_id={source_schema_id}&expand=1

Nexla API

List Dataset By Source Schema: Response
[
  {
    "id": 8085,
    "owner": {
      "id": 82,
      ...
    },
    "org": {
      "id": 1,
      ...
    },
    "version": 18906,
    "name": "1 - echo",
    "description": "DataSet #1 detected from echo",
    "access_roles": ["owner"],
    "status": "INIT",
    "sample_service_id": null,
    "source_path": {},
    "public": false,
    "managed": false,
    "data_source_id": 5963,
    "data_source": {
      "id": 5963,
       ...
    },
    "source_schema_id": "1072858493",
    "source_schema": {
    ...
    },
    "parent_data_sets": [],
    "data_sinks": [],
    "sharers": [],
    "external_sharers": [],
    "has_custom_transform": false,
    "transform_id": null,
    "transform": {
      "version": 1,
      "data_maps": [],
      "transforms": []
    },
    "output_schema": {
      "type": "object",
      ...
    },
    "output_validation_schema": {},
    "output_schema_validation_enabled": false,
    "generate_output_schema": false,
    "updated_at": "2019-08-01T12:13:54.000Z",
    "created_at": "2019-07-11T10:16:52.000Z",
    "tags": []
  }
]

Show One Dataset

Fetch a specific dataset accessible to the authenticated user. In case of Nexla API, add an expand query param with a truthy value to get more details about the dataset. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.

A data set has either a non-null source_schema or parent_data_set. The former refers to a schema detected in data read from the data source itself. The latter refers to the data set which precedes the current one in the pipeline of data processing.

A data set always has a transform attribute, which may be null or an empty object. This transform is applied to an data incoming from the source or parent data set to produce outgoing data matching the output_schema.

A data set may have non-empty data_samples attribute containing one or more objects matching the schema from source_schema or parent_data_set.

Nexla API
Nexla CLI

Show One Dataset: Request
GET /data_sets/{data_set_id}?expand=1

Show One Dataset: Request
nexla dataset get <dataset_id>

Nexla API
Nexla CLI

Show One Dataset: Response
{
  "id": 8086,
  "owner": {
    "id": 82,
    ...
  },
  "org": {
    "id": 1,
    ...
  },
  "version": 18914,
  "name": "echo",
  "description": "",
  "access_roles": ["owner"],
  "status": "ACTIVE",
  "sample_service_id": null,
  "source_path": {},
  "public": false,
  "managed": false,
  "data_source_id": null,
  "source_schema_id": null,
  "source_schema": {},
  "parent_data_sets": [
    {
      "id": 8085,
      ...
    }
  ],
  "data_sinks": [],
  "sharers": [],
  "external_sharers": [],
  "has_custom_transform": false,
  "transform_id": 10858,
  "transform": {
    "version": 1,
    "data_maps": [],
    "transforms": [
      {
        ...
      }
    ]
  },
  "output_schema": {
    "type": "object",
    "properties": {
      ...
    },
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$schema-id": 734129478
  },
  "output_validation_schema": {},
  "output_schema_validation_enabled": false,
  "generate_output_schema": false,
  "updated_at": "2019-08-01T12:57:54.000Z",
  "created_at": "2019-07-11T10:16:56.000Z",
  "tags": []
}

Show One Dataset: Response
{
  "id": 8086,
  "owner": {
    "id": 82,
    ...
  },
  "org": {
    "id": 1,
    ...
  },
  "version": 18914,
  "name": "echo",
  "description": "",
  "access_roles": ["owner"],
  "status": "ACTIVE",
  "sample_service_id": null,
  "source_path": {},
  "public": false,
  "managed": false,
  "data_source_id": null,
  "source_schema_id": null,
  "source_schema": {},
  "parent_data_sets": [
    {
      "id": 8085,
      ...
    }
  ],
  "data_sinks": [],
  "sharers": [],
  "external_sharers": [],
  "has_custom_transform": false,
  "transform_id": 10858,
  "transform": {
    "version": 1,
    "data_maps": [],
    "transforms": [
      {
        ...
      }
    ]
  },
  "output_schema": {
    "type": "object",
    "properties": {
      ...
    },
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$schema-id": 734129478
  },
  "output_validation_schema": {},
  "output_schema_validation_enabled": false,
  "generate_output_schema": false,
  "updated_at": "2019-08-01T12:57:54.000Z",
  "created_at": "2019-07-11T10:16:56.000Z",
  "tags": []
}

Create A Dataset

Dataset creation requires a parent dataset to define the input of that dataset and a transform to define how the input will be modified to generate the output of that dataset. See section on transforms for the different ways of creating transforms.

Nexla API
Nexla CLI

Create Source: Request
POST /data_sets

Example Request Body
...
{
  "name": "Test Dataset",
  "description": "",
  "parent_data_set_id": 22186,
  "has_custom_transform": true,
  "transform": {
    "version": 1,
    "data_maps": [],
    "transforms": [],
    "custom": true
  }
}

Nexla API

Create A Dataset: Response
{
  "id": 8086,
  "owner": {
    "id": 82,
    ...
  },
  "org": {
    "id": 1,
    ...
  },
  "version": 18914,
  "name": "Test Dataset",
  "description": "",
  "access_roles": ["owner"],
  "status": "ACTIVE",
  "sample_service_id": null,
  "source_path": {},
  "public": false,
  "managed": false,
  "data_source_id": null,
  "source_schema_id": null,
  "source_schema": {},
  "parent_data_sets": [
    {
      "id": 8085,
      ...
    }
  ],
  "data_sinks": [],
  "sharers": [],
  "external_sharers": [],
  "has_custom_transform": false,
  "transform_id": 10858,
  "transform": {
    "version": 1,
    "data_maps": [],
    "transforms": [
      {
        ...
      }
    ]
  },
  "output_schema": {
    "type": "object",
    "properties": {
      ...
    },
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$schema-id": 734129478
  },
  "output_validation_schema": {},
  "output_schema_validation_enabled": false,
  "generate_output_schema": false,
  "updated_at": "2019-08-01T12:57:54.000Z",
  "created_at": "2019-07-11T10:16:56.000Z",
  "tags": []
}

Update a Dataset

Nexla API supports methods to update any property of an existing dataset the authenticated user has access to.

Nexla API

Update Dataset: Request
PUT /data_sets/5023

Example Request Body
...
{
  "name": "Test Dataset",
}

Nexla API

Update A Dataset: Response
{
  "id": 5023,
  "owner": {
    "id": 82,
    ...
  },
  "org": {
    "id": 1,
    ...
  },
  "version": 18914,
  "name": "echo",
  "description": "",
  "access_roles": ["owner"],
  "status": "ACTIVE",
  "sample_service_id": null,
  "source_path": {},
  "public": false,
  "managed": false,
  "data_source_id": null,
  "source_schema_id": null,
  "source_schema": {},
  "parent_data_sets": [
    {
      "id": 8085,
      ...
    }
  ],
  "data_sinks": [],
  "sharers": [],
  "external_sharers": [],
  "has_custom_transform": false,
  "transform_id": 10858,
  "transform": {
    "version": 1,
    "data_maps": [],
    "transforms": [
      {
        ...
      }
    ]
  },
  "output_schema": {
    "type": "object",
    "properties": {
      ...
    },
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$schema-id": 734129478
  },
  "output_validation_schema": {},
  "output_schema_validation_enabled": false,
  "generate_output_schema": false,
  "updated_at": "2019-08-01T12:57:54.000Z",
  "created_at": "2019-07-11T10:16:56.000Z",
  "tags": []
}

Update with Custom Transform

Data set transforms are normally constructed through the schema editor in the Nexla UI, which contains logic for translating user actions on the data set attributes in transform rule syntax.

You can set a transform directly on a data set by including it in your POST or PUT input. Note, the has_custom_transform attribute should be omitted or set to false if the transform you're saving is compatible with the schema editor in the Nexla UI. If your transform contains syntax or modifiers that are not supported in the UI, set has_custom_transform to true to disable the transform tools in the schema editor (which might override or delete your custom modifications).

You can also specify a transform_id of a previously created transform instead of the transform object.

Nexla API

Update Dataset: Request
PUT /data_sets/{dataset_id}

Example Request Body
...
{
  "has_custom_transform": true,
  "transform" : {
    "version" : 1,
    "data_maps" : [],
    "transforms" : [
      {
        "operation" : "shift",
        "spec" : {
          "time": "timestamp",
          "userId": "userId",
          "eventType": "eventType"
        }
      }
    ]
  }
}

Nexla API

Update A Dataset: Response
{
  "id": 5023,
  "owner": {
    "id": 82,
    ...
  },
  "org": {
    "id": 1,
    ...
  },
  "version": 18914,
  "name": "echo",
  "description": "",
  "access_roles": ["owner"],
  "status": "ACTIVE",
  "sample_service_id": null,
  "source_path": {},
  "public": false,
  "managed": false,
  "data_source_id": null,
  "source_schema_id": null,
  "source_schema": {},
  "parent_data_sets": [
    {
      "id": 8085,
      ...
    }
  ],
  "data_sinks": [],
  "sharers": [],
  "external_sharers": [],
  "has_custom_transform": false,
  "transform_id": 10858,
  "transform" : {
    "version" : 1,
    "data_maps" : [],
    "transforms" : [
      {
        "operation" : "shift",
        "spec" : {
          "time": "timestamp",
          "userId": "userId",
          "eventType": "eventType"
        }
      }
    ]
  },
  "output_schema": {
    "type": "object",
    "properties": {
      ...
    },
    "$schema": "http://json-schema.org/draft-04/schema#",
    "$schema-id": 734129478
  },
  "output_validation_schema": {},
  "output_schema_validation_enabled": false,
  "generate_output_schema": false,
  "updated_at": "2019-08-01T12:57:54.000Z",
  "created_at": "2019-07-11T10:16:56.000Z",
  "tags": []
}

Delete A Dataset

Nexla API supports methods to delete any dataset that the authenticated user has administrative/ownership rights to.

If the dataset is paused and does not have any associated downstream resources Nexla can delete the dataset safely. A successful request to delete a dataset returns Ok (200) with no response body.

If the dataset is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the dataset.

Nexla API

Delete Dataset: Request
DELETE /data_sets/{data_set_id}

Nexla API

Delete Dataset: Response
Empty response with status 200 for success
Error response with reason if dataset could not be deleted

Activate and Pause Datas

Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.

Nexla API
Nexla CLI

Activate Source: Request
PUT /data_sources/{data_source_id}/activate

Activate Source: Request
nexla source activate <source_id>

On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.

Nexla API
Nexla CLI

Pause Source: Request
PUT /data_sources/{data_source_id}/pause

Pause Source: Request
nexla source pause <source_id>

Monitor Dataset

See the section on Monitoring resources for method to view dataset errors, notifications, quarantine samples, and audit logs.

Datasets​

List All Datasets​

List Datasets for Source​

Show One Dataset​

Create A Dataset​

Update a Dataset​

Update with Custom Transform​

Delete A Dataset​

Activate and Pause Datas​

Monitor Dataset​