Using Node.js to Read Really, Really Large Files (Pt 1)
This blog post has an interesting inspiration point. Last week, someone in one of my Slack channels, posted a coding challenge he’d received for a developer position with an insurance technology company.
It piqued my interest as the challenge involved reading through very large files of data from the Federal Elections Commission and displaying back specific data from those files. Since I’ve not worked much with raw data, and I’m always up for a new challenge, I decided to tackle this with Node.js and see if I could complete the challenge myself, for the fun of it.
Here’s the 4 questions asked, and a link to the data set that the program was to parse through.
- Write a program that will print out the total number of lines in the file.
- Notice that the 8th column contains a person’s name. Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names.
- Notice that the 5th column contains a form of date. Count how many donations occurred in each month and print out the results.
- Notice that the 8th column contains a person’s name. Create an array with each first name. Identify the most common first name in the data and how many times it occurs.
When you unzip the folder, you should see one main .txt
Not too terrible, right? Seems doable. So let’s talk about how I approached this.
The Two Original Node.js Solutions I Came Up With
Processing large files is nothing new to JavaScript, in fact, in the core functionality of Node.js, there are a number of standard solutions for reading and writing to and from files.
The most straightforward is wherein, the whole file is read into memory and then acted upon once Node has read it, and the second option is , which streams the data in (and out) similar to other languages like Python and Java.
The Solution I Chose to Run With & Why
Since my solution needed to involve such things as counting the total number of lines and parsing through each line to get donation names and dates, I chose to use the second method: fs.createReadStream()
. Then, I could use the function to get the necessary data from each line of code as I streamed through the document.
It seemed easier to me, than having to split apart the whole file once it was read in and run through the lines that way.
Node.js CreateReadStream() & ReadFile() Code Implementation
Below is the code I came up with using Node.js’s fs.createReadStream()
function. I’ll break it down below.