The ultimate guide to engineering data flows in Apache NiFi

1
The ultimate guide to engineering data flows in Apache NiFi

Apache NiFi is an interactive web based application that be used to transfer and manipulate data from point A to point B. The application automates most of your data flows. In this guide we explore how we can use Apache NiFi to engineer our data flows.

Step 1: Installing Apache NiFi using terminal on Mac/Linux.

Apache NiFi has a couple of requirements depending on what you want to use it for. No matter what the use case you will require Java. In order to install Java on linux we use the code shown below:

After doing the installation ensure that you have the right version of Java installed just to be sure. You can do this using the code shown below:

The next step is to get NiFi up and running on your local machine/EC2 instance.

For this you will want to go to this – LINK and find out the mirror version of Apache NiFi that’s closest to where you live.

Once you find the link, copy it and paste it in the code shown below:

Next you want to use the code below to read the files into your machine.

Use the code below to check and see if the nifi.sh file exists in your local machine/EC2 instance.

In order to start NiFI we can use the code shown below:

Once you start NiFi you can launch it on your web browser by using:

http://localhost:8080/nifi

You stop the NiFi from running by using the code shown below:

Step 2: Build your first Data Flow

In this section we explore how we can get the most fundamental data flow running in order to illustrate the various components and terms that you will encounter while working with Apache NiFi.

Data transfer to and from directories in an EC2 instance:

Brief overview: This dataflow is engineered to transfer a CSV file called ‘protein.csv’ from the /home/datauser/storage in the EC2 instance to the /home/datauser/dest folder in the same EC2 instance.

Step 1: Drag and drop the GetFile processor.

Drag the ‘Processor’ icon found on the top left of your NiFi workspace.
Type ‘GetFile’ in the search box and click on the ‘Add’ option as shown in the image below

The GetFile processor is used to retrieve the file of choice from any directory of choice from our Local Machine or the EC2 instance.

Configure the processor with the properties shown below and click on Accept:

In the configuration above the Input Directory is the directory from which we want to extract the file of choice. The File Filter can be used to select a specific file that we want to extract. In this case we are going to extract the ‘protein.csv’ file.

Step 2: Drag and drop the PutFile Processor

The PutFile processor is used to transfer the file that we have selected using the GetFile processor and put it into a new directory of choice within our Local Machine or the EC2 instance.

Configure the PutFile processor with the properties shown below:

In this case the Directory is the directory that we want to transfer the ‘protein.csv’ file into.

Configure the ‘SETTINGS’ tab of the PutFile processor as shown below:

In the above configuration we are checking the boxes which tells NiFi to automatically terminate the data flow when there’s a failure and to automatically terminate the data flow when there’s a successful data flow.

The best part about Apache NiFi is that you can also schedule your data flows to occur at specific intervals of time. You can also have event based data flows where the data flows are only triggered when a specific event has taken place.

Scheduling can be configured by using the scheduling tab which can be found in any of the processors as illustrated below:

 

Step 3: Connect the two processors together.

Connect the two processors by clicking on the middle and dragging it down:

This will produce a pop-up asking you to configure the connection. Configure the connection as shown in the image below:

 

Step 4: Run the processors

In order to run a processor you need to – PRESS SHIFT WHILE DRAGGING YOUR MOUSE TO SELECT ALL PROCESSORS as illustrated below.

Once the section is made and all the processors has been highlighted in green we can play the data flow by clicking on the play option in the box found in the left of your NiFi workspace.

 

The dataflow successfully transfers files to and from folders within the same EC2 instance as expected.

Conclusion:

NiFi can be used for much more expansive data flow engineering procedures such as manipulating data using python scripts or transferring data from SFTP servers or monitoring data flows to and from S3 buckets. The distinct application of having a web based interface to view how your data flows work offers a unique advantage to anyone in your team with a base understanding of data.

Happy data flows with NiFi!

 

 

 

 

  1. I just couldn’t depart your web site prior to suggesting that I extremely enjoyed the standard info a person provide for your visitors? Is gonna be back often to check up on new posts

LEAVE A REPLY