Processing text files in Ruby simultaneously
Say you need to process several big text files with Ruby at the same time, in an orchestrated way. Processing a single file is pretty trivial with Ruby but processing multiple files at the same time is a pretty different problem.
I thought the best option would be using some external iterator mechanism that allowed active control over the processing flow. Of those, I wasn’t sure if an Enumerator would be better than IO#gets
. Also, I wondered how using threads, and internal iterators would work compared to those. I thought it would be a fun experiment so I went ahead and wrote some code to play with each approach.
For the sake of processing something, given several 30MB files, where each line is a number, the experiment calculates the sum of, for every line, the result of taking the first number and subtracting the rest of numbers from the rest of files. It is just a silly calculation that allows me to confirm that each experiment is calculating the same amount.
Loading contents into memory
module Multiple
class InMemoryExperiment < BaseExperiment
def run
lines_lists = files.collect{|file| File.readlines(file)}
count = 0
lines_lists[0].length.times do |index|
numbers = lines_lists.collect { |lines| lines[index].to_f }
count += numbers.reduce(:-)
end
count
end
end
end
Loading everything in memory will, well, load everything in memory, so it’s probably not the best option for processing big files. I say probably because, if memory is not a concern, it’s pretty fast and flexible, since dealing with an array of lines is easy.
Enumerator
module Multiple
class EnumeratorExperiment < BaseExperiment
def run
count = 0
enumerators = files.collect { |file| File.foreach(file) }
loop do
numbers = enumerators.collect{|enumerator| enumerator.next.to_f}
count += numbers.reduce(:-)
end
count
end
end
end
Most iteration methods in the standard Ruby library return an Enumerator when no block provided. IO
and File
classes are no exception.
IO#gets
class IO
def next
gets || raise(StopIteration)
end
end
module Multiple
class ExternalIteratorExperiment < BaseExperiment
def run
count = 0
io_files = files.collect{|file| File.open(file)}
loop do
numbers = io_files.collect{|io_file| io_file.next.to_f}
count += numbers.reduce(:-)
end
count
end
end
end
IO#gets
lets you consume the next line in the file every time you invoke it. Like with enumerators, this gives you full control over the iteration.
In this example, I added a next
method to IO
just to simulate the Enumerator interface. It will raise a StopIteration
when there are no more lines. It would be trivial to add other methods such as peek
if you needed to.
Threads
module Multiple
class ThreadsAndQueuesExperiment < BaseExperiment
def run
queues = Array.new(files.length) { Queue.new }
threads = files.collect.with_index do |file, index|
queue = queues[index]
Thread.new do
File.foreach(file) do |line|
queue << line.to_i
end
queue.close
end
end
count = 0
loop do
numbers = queues.collect { |queue| queue.pop.to_f }
count += numbers.reduce(:-)
break unless queues.find{|queue| !queue.empty? || !queue.closed?}
end
threads.each(&:join)
count
end
end
end
This approach uses a multi-threaded producer-consumer strategy for processing the files. It will create a thread and a queue for each file to be processed. It will use an internal iterator for reading the numbers from the files and place them into the queues. The main thread will read the numbers from the queues and do the processing. Notice that Ruby Queue implements the locking semantics to do this kind of processing with threads.
Performance results
Results for processing 3 files of 30MB each (4 millions of numbers in each file) with my 4.2 GHz i7.
Name | Time | Memory |
---|---|---|
Load contents into memory | 3.824s | 175.605MB |
Enumerator |
4.919s | 0MB |
IO#gets |
3.465s | 0MB |
Threads and Queues | 4.936s | 0MB |
IO#gets
is the fastest approach is way faster than using anEnumerator
- The threads-based approach is the slowest of all of them
- Unsurprisingly, loading contents in memory is bad for memory. The others approaches, working in streaming mode, remove that concern altogether.
Conclusions
IO#gets
is the way to go for processing multiple files. I was expecting it to be faster than the enumerator counterpart, but I wasn’t expecting such a big difference (it is 31% faster). I have already talked about enumerators being slow but wonderful. I still believe they are wonderful, but you definitely have to pay a tax for the internal sorcery they do with fibers.
I want to point out that, if memory is not a concern, loading files into memory is a pretty valid option. This is not likely to be the case for web apps, or apps where you don’t have the files to read under control, but in other cases, it might be suitable. It is fast and flexible.
I was really curious about playing with threads for this case. I have little experience with concurrent programming, so I would image there are better approaches than the one I tested. If you think there is, please let me know or, even better, create a pull request in the experiments GitHub project.