Processing text files in Ruby simultaneously

Say you need to process several big text files with Ruby at the same time, in an orchestrated way. Processing a single file is pretty trivial with Ruby but processing multiple files at the same time is a pretty different problem.

I thought the best option would be using some external iterator mechanism that allowed active control over the processing flow. Of those, I wasn’t sure if an Enumerator would be better than IO#gets. Also, I wondered how using threads, and internal iterators would work compared to those. I thought it would be a fun experiment so I went ahead and wrote some code to play with each approach.

For the sake of processing something, given several 30MB files, where each line is a number, the experiment calculates the sum of, for every line, the result of taking the first number and subtracting the rest of numbers from the rest of files. It is just a silly calculation that allows me to confirm that each experiment is calculating the same amount.

Loading contents into memory

module Multiple
  class InMemoryExperiment < BaseExperiment
    def run
      lines_lists = files.collect{|file| File.readlines(file)}

      count = 0

      lines_lists[0].length.times do |index|
        numbers = lines_lists.collect { |lines| lines[index].to_f }
        count += numbers.reduce(:-)
      end

      count
    end
  end
end

Loading everything in memory will, well, load everything in memory, so it’s probably not the best option for processing big files. I say probably because, if memory is not a concern, it’s pretty fast and flexible, since dealing with an array of lines is easy.

Enumerator

module Multiple
  class EnumeratorExperiment < BaseExperiment
    def run
      count = 0

      enumerators = files.collect { |file| File.foreach(file) }

      loop do
        numbers = enumerators.collect{|enumerator| enumerator.next.to_f}
        count += numbers.reduce(:-)
      end

      count
    end
  end
end

Most iteration methods in the standard Ruby library return an Enumerator when no block provided. IO and File classes are no exception.

`IO#gets`

class IO
  def next
    gets || raise(StopIteration)
  end
end

module Multiple
  class ExternalIteratorExperiment  < BaseExperiment
    def run
      count = 0

      io_files = files.collect{|file| File.open(file)}

      loop do
        numbers = io_files.collect{|io_file| io_file.next.to_f}
        count += numbers.reduce(:-)
      end

      count
    end
  end
end

IO#gets lets you consume the next line in the file every time you invoke it. Like with enumerators, this gives you full control over the iteration.

In this example, I added a next method to IO just to simulate the Enumerator interface. It will raise a StopIteration when there are no more lines. It would be trivial to add other methods such as peek if you needed to.

Threads

module Multiple
  class ThreadsAndQueuesExperiment < BaseExperiment
    def run
      queues = Array.new(files.length) { Queue.new }

      threads = files.collect.with_index do |file, index|
        queue = queues[index]
        Thread.new do
          File.foreach(file) do |line|
            queue << line.to_i
          end
          queue.close
        end
      end

      count = 0

      loop do
        numbers = queues.collect { |queue| queue.pop.to_f }
        count += numbers.reduce(:-)

        break unless queues.find{|queue| !queue.empty? || !queue.closed?}
      end

      threads.each(&:join)

      count
    end
  end
end

This approach uses a multi-threaded producer-consumer strategy for processing the files. It will create a thread and a queue for each file to be processed. It will use an internal iterator for reading the numbers from the files and place them into the queues. The main thread will read the numbers from the queues and do the processing. Notice that Ruby Queue implements the locking semantics to do this kind of processing with threads.

Performance results

Results for processing 3 files of 30MB each (4 millions of numbers in each file) with my 4.2 GHz i7.

Name	Time	Memory
Load contents into memory	3.824s	175.605MB
`Enumerator`	4.919s	0MB
`IO#gets`	3.465s	0MB
Threads and Queues	4.936s	0MB

IO#gets is the fastest approach is way faster than using an Enumerator
The threads-based approach is the slowest of all of them
Unsurprisingly, loading contents in memory is bad for memory. The others approaches, working in streaming mode, remove that concern altogether.

Conclusions

IO#gets is the way to go for processing multiple files. I was expecting it to be faster than the enumerator counterpart, but I wasn’t expecting such a big difference (it is 31% faster). I have already talked about enumerators being slow but wonderful. I still believe they are wonderful, but you definitely have to pay a tax for the internal sorcery they do with fibers.

I want to point out that, if memory is not a concern, loading files into memory is a pretty valid option. This is not likely to be the case for web apps, or apps where you don’t have the files to read under control, but in other cases, it might be suitable. It is fast and flexible.

I was really curious about playing with threads for this case. I have little experience with concurrent programming, so I would image there are better approaches than the one I tested. If you think there is, please let me know or, even better, create a pull request in the experiments GitHub project.

The code for this experiment is available in Github