Performance Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 479 posts at DZone. You can read more from them at their website. View Full User Profile

Question 6 is a trap, a very useful one

08.21.2014
| 5649 views |
  • submit to reddit

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 requires that you create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Published at DZone with permission of Ayende Rahien, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Iftah Haimovitch replied on Thu, 2014/08/21 - 8:47am

Really this is the difficult question?   and really 8 hours?  in unix or mac where the "sort" system command is available it can be done real quick, maybe 30 minutes top:

1. Note the index creator should indicate what field(s) of the CSV is(are) indexed (eg. first+last name,  id, phone, etc...) -  this is an input to the program (ie. as column numbers) as well as the CSV file.

2. Loop through the file, write each line to a text file in the following way:
hash(indexed values),  file offset of row in the input-CSV

(the file offset can be read by fseek or some such file api in your programming language, or kept track of while reading the rows). 

3. Execute "sort" program on the resulting file in linux (it can sort files larger than available memory). 

4. The sorted file is your index - use binary search to find the line in the index:
a. hash the search value
b. jump to 50% offset in the index file read up to "\n", read line, split by separator, compare read hash to hash of search-string, move offset up or down in binary search until you find the right line). Once you find the line in the index (O(logN)) it will give you the offset to find the full line in the CSV.


At 400 billion items make sure the hash you choose isn't 32 bit...

If you can't don't want to use sort command to sort a huge file, you can use multiple files using the first characters in the hash as a filename ie. the line with hash ABCD123451233245 will be written to file  ABCD12.  Those files can be small enough to sort in memory using quicksort. 


Iftah Haimovitch replied on Thu, 2014/08/21 - 8:48am

---

Dapeng Liu replied on Thu, 2014/08/21 - 11:07am

in the absence of requirement, i can either purchase a super computer, with 16T ram 

or i buy from aws N number of servers, then it is pretty much map reduce ... what is the big deal

and *efficient search* the term is ambiguous ... time efficiency or space efficiency? 

lots of doubts to clarify before a sensible solution can be given 

Adam Davis replied on Thu, 2014/08/21 - 5:24pm

 I can see why these type of questions can be useful (because they give you an idea of the candidate's knowledge of algorithms), but they're so artificial. If someone at work wanted me to "create an index for a 15 TB CSV file" I'd wonder WHY. Why don't they just load the file into a database table? Why would they wait for the file to get to 15 TB before they asked someone to create an index? Why is the file so large? Why do they need an index? Why do they want me to create a 'bandaid' for them when it seems like their process is broken? Without more background than what you've given here I would skip this question and pick 3 of the others.

This may be a little harsh, but if your questions are completely divorced from reality then you shouldn't complain (too much) when the answers you are given are similarly unreal. In addition, I'd think that answering question 6 correctly would indicate the candidate is great at solving defined problems, but not necessarily good at defining problems or determining root causes. That would make them a "must hire" for some positions, but not for all positions.

Matthew Williams replied on Thu, 2014/08/21 - 10:33pm in response to: Adam Davis

I'm right there with you Adam.  I've seen this sort of hiring strategy before and understand it, but working for companies that hire the sort of individuals who answer question 6 is not the sort of place I want to work, for 2 reasons.

  1. there is not enough details in the question to give any sort of sensible answer and that means that the person asking it probably has a narrow focus view (they make a bunch of assumptions that they are not even aware they are making - where is the file? local or remote and if it is remote what is your connection? OS hosting file and running the app? does it have to be done just in Java? can external tools be used? ) and they expect you to have a similar narrow focus view.  
  2. People who can't look at problems like this and think in "real world" terms are not the sort of people who are generally worth working with.  They live far too much in the theoretical world often leaving the real work to others.  Great sort of people to have in an R&D dept, but when there are timeframes and pressure to implement you need people who can get stuff done - load the sucker into a database and use that for example - rather than writing your own code that will inevitably have some bug that you do not expect.

If the question was more along the lines of list some of the issues you might face while trying to parse and index a 15TB CSV on the local machine using java and nothing other than the JDK API - then that is a whole other story, expecting someone to be able to write the code off the cuff and then picking on the solutions they write is .......

8 hours I doubt it.

Raging Infernoz replied on Sat, 2014/08/30 - 7:25pm

For question 6:

1. Don't waste time reinventing the database, find one which can work fast with huge row counts, and use batch inserts to load the data.

2. If the data has common patterns, implement dictionary compression on the indexes and row data; this can significantly reduce the data size and speed up queries; I did this with web logs several years ago, I used separate dictionary tables with a 64-bit key for this and the space saving and query speed up were quite impressive!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.