We’ve been using the rake task, extract_fixtures (from Rails Recipe #42), as a convenient way of lifting data generated in development up into production (where we invoke the rake task, db:fixtures:load, to populate the db tables). Whilst not particularly quick to do the extract or load, it is very convenient, since the fixtures (yaml files) are part of the deployment.
This was great until we tried it with a large table (approx 100k rows, 5 columns, mostly short pieces of text). The extract took ~1hr and the load barfed, complaining about stack depth being exceeded, or something along those lines.
Some googling gave the following solution, to be set in the production shell:
ulimit -s 32768
… and that seemed enough to allow extract_fixtures to work, albeit still very slowly.
Digging around inside the code to extract and load the fixtures, it is clear that the fixtures approach, as coded, is only really practical for small tables.
- To extract a table’s contents into a yaml file,
- the table is loaded in its entirety into a hash in memory,
- then converted from the hash into a large yaml string in memory,
- then written to a file.
- To populate a table from a yaml file,
- the yaml file is loaded in its entirety into a string in memory,
- then converted from the string into a hash in memory,
- then inserted into the table one row at a time.
By slurping up the entire table into memory each time, it obviously not going to scale well.
Looking inside Fixtures.read_fixture_files, it seems to handle CSV files differently, streaming the data from the file into the table one row at a time.
All that is needed then is something like extract_fixtures_to_csv to stream the tables efficiently into files, and this whole approach should work ok for large tables (much faster, no table size limits, smaller data files):
- CSV::Writer is happy to write one row at a time, so that’s ok.
- That just leaves the use of ActiveRecord::Base.connection.select_all( sql % tablename ) which extracts the data from the table into a big list of hashes. Not sure what is the nicest way to read in the table one row at a time…
this feed
2 Comments
July 12, 2007 at 11:56 pm
I ran into the same issue today…I wrote a quick and dirty rake task that generates a yaml or csv file conditionally (based on the number of rows in the table). My extract has sped up quite a bit. I was hoping the csv files would speed up the load time as well, but it has been 15 minutes and still loading…Any thoughts on how to speed this up?
Rake task: http://pastie.caboo.se/78457
July 13, 2007 at 2:27 pm
hm. Probably two different issues going on here.
Moving to CVS in the extract stage will probably use up less memory, and hence be a bit quicker (less paging going on), even if the entire table is still being slurped into memory.
However, in the load stage, one of the main limiting factors will be the sql inserts, and that doesn’t go away with CSV.