Problem Solving in real life: My internship project
My friend was drawing on the board while I looked on. We were in different teams, each assigned a project. His program currently took 5 hours to execute. The requirement stated one hour. As the ‘algorithm enthusiast’, I was called for help.
The program was to find the friends of all users in a social network. After explaining the use-case, my friend showed me his approach. I was very determined to find an optimization, hence I did. Immediately.
“An adjacency matrix?! No, use an adjacency list. The execution time will reduce in your case to O(N). Or you could use an adjacency tree! That would — blah blah….”
I made some calculations and guaranteed that the program could not take more than a second to run. My friend hurried to his computer, and I got back to my project. He came back the next day. The program had taken 5 hours again.
There must be a bug in his code. Looking into it, I found the problem. And it wasn’t any bug.
“You are making a DB call for each user?”
“Where is your DB?”
“California, I think.”
Each DB (Database) call was travelling halfway around the world to query a database before heading back. Every call took 5 seconds, and there were 1700 users. To add to it all, the DB gateway got suspicious at a large number of requests from a single client. It was throttling them.
“A stored procedure will do the job. You can code the algorithm into it.”
A stored procedure, like the name suggests, is a piece of code stored in a database. It executes within the database and sends results in a single call. This would avoid all the back and forth.
Within a few hours, his team was celebrating. Their program’s execution time had dropped to 12 minutes.
My own project was client-facing. The little backend work was coded in a script, written by the engineering manager.
From the start, I found the code disproportionately long. 300 lines to make a simple DB call? Here is the procedure:
- The database table was being queried entirely.
- Stored in a map.
- Filtered on a date range.
- Sorted according to the dates.
I changed it to:
- The database table was queried on a date range, with the results ordered and returned in a list. Only the relevant columns would be sent back.
It looks like I have just squashed the previous list into a single line. That makes a big difference though. Now, the sorting and filtering are at the DB layer and the network load is significantly smaller. The program speed went up by 400%.
I soon realized: it didn’t matter. Our data set was small, requiring no optimization. The previous script ran almost as fast, and I never mentioned the changes to the engineering manager.
I am now studying system design in detail. This involves understanding how organizations write software to handle millions of users. Specifically, how do large organizations stay available, reliable and fast? This despite the enormous load on their machines.