In this tutorial we will create a Jupyter notebook that fetches data from a Nexla dataset and prints it as rows of data. Once setup, this notebook will always have access to the most recent data from the Nexla dataset.
Follow along the steps below and you'll be up and running in minutes. Or just skip ahead to the bottom and download the sample Jupyter notebook we created for this tutorial.
Jupyter notebooks are a great tool for doing data-analysis. But getting data in a shape where it is ready to be analysed is a non-trivial challenge considering the heterogenity of data sources (APIs, SaaS services, Warehouses, File Storage systems, etc.), never mind other operational challenges of managing data pipelines.
With Nexla, you can get the best of both worlds:
- Use Nexla's no-code / low-code UI, API, or CLI to quickly setup and manage all your data pipelines
- Use Nexla's Pandas / Jupyter integration to access your data right within your notebook and perform any analysis task.
Step 1: Create Nexla Pandas Destination
Since Nexla homogenizes data from disparate sources into similar looking datasets, let's start from the dataset we wish to use in our Jupyter notebook. We'll use Nexla UI for this tutorial, but we could do this step using Nexla API or Nexla CLI too.
- We'll in login to our Nexla UI instance and check out the Data Flow we want to work with. Next, we'll select the Dataset we wish to access in Jupyter.
- We can modify the Dataset with any transformation logic we wish. In this case, we just want to move the data as is.
- Now we'll select Pandas as a Destination
- Finally, let's configure a storage location for this Pandas destination. This can be any file storage system supported by Nexla. For this demo we will use a path in our S3 bucket for storing files. Nexla will automatically catalog data files by date.
Configure Pandas Destination
That's it! We are all done with setting up the Data Flow. All we need is the
Destination ID of this destination. For this tutorial we will work with
Destination ID: 7228
Step 2: Install and Import Nexla CLI
This step needs to be executed only once
Let's download and install the Nexla CLI SDK to our local development machine. Easy peasy!
Now, for this tutorial we also need to be able to access the S3 bucket for downloading the data files generated by Nexla. We'll use Nexla's in-built S3 function for managing the download. In this step we just need to configure S3 access information. Checkout Boto3's configuration guide for storing
Perfect! We are now ready to start creating our notebook. Let's fire up a Jupyter notebook server and start a new notebook. First we will add a cell for importing required Nexla packages.
Step 3: Authenticate User Session
All Nexla requests need to be authenticated by fetching a short lived Access Token from Nexla UI. Let's add a new cell for launching the authentication window.
This new window will display a
Nexla API URL and
Access Token after successful login. Let's copy this over into our Jupyter notebook and call Nexla CLI's
Auth method for starting an authenticated Nexla CLI sesssion.
Once the output of this cell says
You are now logged into Nexla CLI as <Your Username>[your email] we can move on to the next step.
Step 4: List Generated Destination Files
Now that we have an authenticated session going, we'll list out all data files generated by Nexla for this Pandas destination (ID: 7228). For this tutorial we will only use files generated in last 10 days, but Nexla CLI's
get_sink_filesList allows me more controls.
Step 5: Download Remote Files
Next we need to download the files in this dynamic list of Parquet files generated by Nexla. Let's first create a local download area.
Now we just need to call Nexla's S3 download function with the
Step 6: Load Data into Pandas Dataframe
Once we have the data files available locally within the Jupyter notebook server, we can process them and create a dataframe by calling standard Pandas Parquet file read methods.
Finish Line: Process Dataframe
That's it! We have a dataframe with a stream of data directly from the Nexla flow. As a proof of completion let's just print out the frame for sanity check.
Sweet Success! With just a few clicks and few lines of code we have an automated flow that regularly ingests data from our source (Webhook), processes it desired, and then delivers it to our notebook within minutes.
As a final test, let's push 3 more records to the Webhook and (after a few minutes for Nexla to process and upload the files) let's re-reun this notebook. Here's what we will see in the final cell.
Download Sample Notebook
Here's the final notebook that we created as part of this tutorial. Just enter your information in the relevant fields (Nexla API URL, Access Token, and Destination ID) and you are good to go!