December 10, 2007

:has_many, :belongs_to, :through, and :include

Hi,

I’m sure the following has been discussed 100s of times, and there is probably an obvious answer; either that or its all been fixed in Rails2.0 …

We have a simple relationship between three models, written in shorthand as follows:

Venue :has_many Gig :has_many Image

and

Image :belongs_to Gig :belongs to Venue.

It is easy to answer the question, “What are all the images for all the gigs associated with a venue?”.

Answer: specify “has_many :images, :through => :gigs” in the Venue model (shown below). This gives you the venue.images method, which will generate a single sql statement to do the job,

We found it harder to answer this question, “Given a list of image ids, what are all the associated venues?”, since :belongs_to does not provide the :through option. We couldn’t see how to get Rails to generate a single sql statement to answer this, although we could write our own by hand.

After some experimentation, we can get it down to two Rails-generated sql statements to answer the question. (But surely its doable in one?)

# given a list of image_ids, get a list of gig_ids, then a list of venue_ids.

gig_ids = Image.find(image_ids, :include => :gig ).map {|i| i.gig_id}.uniq
venue_ids = Gig.find(gig_ids, :include => :venue).map {|g| g.venue.id}.uniq

The use of :include means the :belongs_to can pre-emptively pull in the next model in the relationship as part of the sql generated by find. Incidentally, we did not realise until now that you could pass a list of ids to find and it will incorporate it into the one sql statement.

For the record, here are the models, where each model has a :string column called :name, and the relevant *_id foreign key:

class Image < ActiveRecord::Base
belongs_to :gig

def self.venues( image_ids )
gig_ids = Image.find( image_ids, :include => :gig ).map {|i| i.gig_id}.uniq
Gig.find(gig_ids, :include => :venue).map {|g| g.venue.id}.uniq
end

end

class Gig < ActiveRecord::Base
belongs_to :venue
has_many :images, :dependent => :destroy
end

class Venue < ActiveRecord::Base
has_many :gigs, :dependent => :destroy
has_many :images, :through => :gigs
end

And we quite liked this bit of code in a migration to create some initial dummy data to experiment with. Each venue has 3 gigs, and each gig has 3 images. By using << to hook up all the model instances to each other, e.g. adding a gig to a venue and adding an image to a gig, a single venue.save causes that venue and all its gig instances and all their image instances to be saved as well.

class CreateInitialHierarchy < ActiveRecord::Migration

def self.up

num = 3

(1..num).each do |v|
venue = Venue.new( :name => “venue #{v}”)

(1..num).each do |g|

gig = Gig.new( :name => “gig #{g} for venue #{v}”)

venue.gigs << gig

(1..num).each do |i|

image = Image.new( :name => “image #{i} for gig #{g} for venue #{v}”)

gig.images << image

end

end

venue.save

end

end

def self.down

(1..num).each do |v|

venue = Venue.find_by_name( “venue #{v}” )

if !venue.nil?

venue.destroy

end

end

end

end

July 12, 2007

using db:fixtures:load and extract_fixtures on large tables

We’ve been using the rake task, extract_fixtures (from Rails Recipe #42), as a convenient way of lifting data generated in development up into production (where we invoke the rake task, db:fixtures:load, to populate the db tables). Whilst not particularly quick to do the extract or load, it is very convenient, since the fixtures (yaml files) are part of the deployment.

This was great until we tried it with a large table (approx 100k rows, 5 columns, mostly short pieces of text). The extract took ~1hr and the load barfed, complaining about stack depth being exceeded, or something along those lines.

Some googling gave the following solution, to be set in the production shell:

ulimit -s 32768

… and that seemed enough to allow extract_fixtures to work, albeit still very slowly.

Digging around inside the code to extract and load the fixtures, it is clear that the fixtures approach, as coded, is only really practical for small tables.

  • To extract a table’s contents into a yaml file,
    • the table is loaded in its entirety into a hash in memory,
    • then converted from the hash into a large yaml string in memory,
    • then written to a file.
  • To populate a table from a yaml file,
    • the yaml file is loaded in its entirety into a string in memory,
    • then converted from the string into a hash in memory,
    • then inserted into the table one row at a time.

By slurping up the entire table into memory each time, it obviously not going to scale well.

Looking inside Fixtures.read_fixture_files, it seems to handle CSV files differently, streaming the data from the file into the table one row at a time.

All that is needed then is something like extract_fixtures_to_csv to stream the tables efficiently into files, and this whole approach should work ok for large tables (much faster, no table size limits, smaller data files):

  • CSV::Writer is happy to write one row at a time, so that’s ok.
  • That just leaves the use of ActiveRecord::Base.connection.select_all( sql % tablename ) which extracts the data from the table into a big list of hashes. Not sure what is the nicest way to read in the table one row at a time…

July 12, 2007

streaming with send_file, or redirect_to and apache

We recently had cause to generate and fling large pdf files from our rails app. In development, we used send_file to stream the pdf file out via mongrel, but it seemed sensible to redirect_to apache for the heavy lifting in production, freeing up mongrel and rails for the delicate stuff.

So, at the end of the action, with the pdf file already generated and sitting in pdf_dir=’public/static/’:

if ENV['RAILS_ENV'].eql?(’production’)
pdf_url = @request.protocol + [
@request.host_with_port,
'pdfing',
'static',
pdf_filename].join(’/')

redirect_to( pdf_url )
else
send_file( pdf_filename,
:type => ‘application/pdf’,
:disposition => ‘inline’,
:filename => pdf_dir + pdf_filename)
fi

There seems to be no nice railsy way of generating the absolute url other than using the @request object.

Presumably we could extract the project name from the request url rather than hardcoding it as ‘pdfing’ above.

Our apache is configured to serve the files in pdfing/static directly.

July 12, 2007

changing environment variables followed by mongrel_rails restart

When making a change to environment variables such as PATH and JAVA_HOME, don’t use “mongrel_rails restart” to bounce the rails app to pick up the changed values. You could find yourself wasting much of the morning futilely chasing obscure shell/mongrel/capistrano issues, desperately seeking any kind of hint as to why the rails app was refusing to pick up the changes.

No, don’t do that at all.

Much better to do “mongrel_rails stop” and “mongrel_rails start + assorted params” right away and find it all works perfectly well.

May 1, 2007

Rails’ find :include, Ruby’s Marshal, and ArgumentError: undefined class/module

Fresh from the perils of not knowing that a place belongs_to a country, we generated a data structure that consisted of places, other places, and countries. This took a while to compute so, naturally, to speed things up during development iterations we wanted to cache it. Equally naturally, we didn’t want to make use of something as classy and as lauded (and already written) as acts_as_cached. No, we rolled our own.

And got an error:

ArgumentError: undefined class/module Country

It comes down to the following.

If you grab some nested models,

places = Place.find( :all, :include => [:country])

and then store the data in a file,

File.open( stored_filename, ‘w+’ ) do |f|
Marshal.dump( places, f )
end

and then, having restarted the process, retrieved the cached data,

f = File.new( stored_filename, ‘r’ )
places = Marshal.load( f )
f.close()

you get “ArgumentError: undefined class/module Country“.

It seems Ruby won’t deserialise (aka unmarshal) (aka Marshal.load) a class it has not loaded yet. The Place class was ok, because all this was happening inside place.rb. But instances of the Country class which come from the initial Place.find, using :include to preload the child rows, are stored along with the Place instances. When the Marshal.load happens in a new process, there has been no Place.find, and no auto-loading of the Country class, and the error is thrown because Ruby barfs on the presence of instances of an undefined class.

The quick and unsatisfactory fix for now was to prefix the cacheing code with a dummy reference to the Country class, e.g.

Country.class

f = File.new( stored_filename, ‘r’ )
places = Marshal.load( f )
f.close()

Presumably there is a better way.

In terms of fixing the ‘bug’ in Ruby, perhaps it is not easy for Marshal to establish where not-yet-loaded classes are defined.

May 1, 2007

in Rails, a place belongs_to a country, and don’t you forget it

This time, some wrassling with relationships between models. Foolishly rejecting the easy option of doing everything the rails way, we hoped to be able to shoehorn a yucky data structure into rails.

We had a table called places, where each place had a two-letter country_code. We wanted to map this to full country names, the data for which was mixed up in the same table. Did say it was yucky.

Using migrations, it was simple to extract the name data, and insert it into a newly created table. In particular,

class LoadCountriesData < ActiveRecord::Migration
def self.up

# Probably not the kosher way of doing things, but seems simplest for now.
# Can tidy up later, take a snapshot of entire db schema and contents.

execute “insert into countries (code, name) select country_code, name from places where source=’ISO’ “
end

def self.down
Country.delete_all
end
end

Then, how to link the countries and places tables? Immediately assumed a place has_one country, but no, didn’t seem to work. Tried a variety of combinations: belongs_to, foreign_key, association_foreign_key, etc. Ignorance just wasn’t bliss. Googling revealed a vibrant chord of newbie confusion out there (and in here), and finally Duane’s Brain helped begin clear things up.

In actual fact, a place belongs_to :country, in app/models/place.rb. Yes. This means the countries table can sit there knowing nothing but country things, not caring about places, whilst the places table knows about countries, and holds the foreign key that indexes into the countries table. A mnemonic for this might be something like “X holds a foreign_key which belongs_to Y”.

Still not working though. How to tell rails we wanted to hook belongs_to from places.country_code to countries.code, rather than the default rails approach from places.country_id to countries.id? Possibly the first one can be handled by using the foreign_key attribute when specifying belongs_to in places.rb, but to be more railsy we created a new field, places.country_id, and copied the value from places.country_code (using migrations of course). The second part was a bit harder to work out. The auto-generated id field was useless to us since the only references we were using were two-letter country codes. Perhaps we could have done some shenanigans by looking up the id of each country_code in the countries table and writing that back into the places.country_id field. Turns out this was easier for now:

class Country < ActiveRecord::Base
self.primary_key = ‘code’
end

So, there you have it, a morning wasted not doing things the rails way. Migrations are very nice, though.

March 30, 2007

Rails: xml page caching with a different page_cache_directory

So, there were some generated xml pages that needed caching. After some assorted reading, the basic caches_page approach seemed sufficient.

First step was to tweak config/routes.rb to bring the url params into the path to make the url cacheable. Something like:

map.connect “feed/resorts_and_holidays/:country_name/:how_recent”, :controller => ‘feed’, :action => ‘resorts_and_holidays’

 

Then add the following to app/controller/feed_controller.rb, to cache the output of the resorts_and_holidays page:

caches_page :resorts_and_holidays

Oh yes, and the following added to config/environments/development.rb to ensure that caching takes place in development:

config.action_controller.perform_caching = true

And it worked! Hurrah! Stuff was duly being cached into public/feed/…

However, on closer inspection, all was not well. All the cache filenames ended with .html, which was a bad thing since they contained xml. Argh but, luckily, better minds than mine had reached the same argh. There was lots of discussion and several alternatives, but it seemed right to plump for altering the map.connect config to explicitly specify a .xml extension

map.connect “feed/:country_name/:how_recent/resorts_and_holidays.xml”, …

And this did indeed work, although now you could not have default values for the url params. Country_name and how_recent must be set explicitly. e.g. feed/all/3/resorts_and_holidays.xml. A small price to pay. Hurrah once again.

Flushed with success, that cache now needed expiring. The rails-caching-tutorial describes sweeping, and sure enough it does work. Trying the direct approach, there was lots of faffing needed to plonk expire_page into all the relevant places in the models, taking care to ensure the right cache entries were swept for particular calls. It became clear that being precise was not a quick route to happiness.

After further reading of the caching tutorial, and an article about lazy sweeping, blatting the entire cache every time new data was written was the way to go. This would not be terribly efficient if you had lots of little writes that do only affect a few cache entries each time, but with a daily update that more or less rewrites everything, a full sweep was in fact likely to be quicker than a little sweep per model update.

Following the instructions to sweep lazily, the first step was to move the cache directory (presumably to make the recursive file deletion easier/safer), by tweaking config/environment.rb:

config.action_controller.page_cache_directory = RAILS_ROOT + ‘/public/cache/’

Then define a new sweeper class (by copying the lazy sweeping code)  in a new directory, app/sweepers/site_sweeper.rb, not forgetting to add the following to config/environment.rb:

config.load_paths += %W( #{RAILS_ROOT}/app/sweepers )

To hook the sweeper into the controller, add the following into app/controller/application.rb, in the parent class of the controller to be cached:

cache_sweeper :site_sweeper, :o nly => [ :new, :create, :edit, :update, :destroy, :do_upload_from_file ]

specifying the actions which are to trigger the lazy sweeping.

Another opportunity for a hurrah perhaps? No, something was wrong. The cache files were being written with the correct .xml extension, swept, but otherwise bypassed or ignored by the controller when handling a request. Every response was still being generated in full rather than being served from cache.

It turned out, after much delving and hair pulling and googling, that the problem was with the change to config.action_controller.page_cache_directory. Rails cannot properly handle this being changed.

One solution was to rely on apache rewrite rules so that apache does the checking of the cache before passing the request on to rails, and this is the way to go anyway in production (and is probably why the various blogs which had suggested changing the cache dir had not noticed or mentioned the rails bug).  A non-apache option was to leave that page_cache_directory param on its default setting, and amend the site_sweeper to sweep the directory public/feed (where feed was the controller being cached). Thus, the caches_page approach was finally working.

There were two further, minor tweaks, paying homage to the spirit of DRY; a last few polishes to the gleaming wonder of cached xml pages:

  • refer to the cache root dir in the sweeper class using ActionController::Base.page_cache_directory, rather than hard-coding it.
  • use a named route when contructing urls. So, instead of map.connect “…” in config/route.rb, you can have map.feed_route “…”, and construct the url for that route in the controller by invoking feed_route_url(…) instead of using url_for(…).

Looking back on the coding needed to get this working, it really amounts to a few lines here and there, and could have been done by One Who Knows in about 20 mins. Instead, it took the best part of two days. Hopefully this write-up will be of some use to Others Who Come After.

March 22, 2007

Hello world!

Hi, another blog. Sorry.

Still, at least this one is intended to be useful to anyone thinking of giving Rails a try. We recently started from scratch, not knowing Ruby or Rails. We chose Rails as a framework for prototyping our dabblings, partly to try something new, partly to try out a ‘happening’ technology, partly to make our CVs more buzzword compliant. Its been interesting/frustrating so far, and would have been even more so without direct input from the calm and wise Matt Biddulph, and lots of googling for blog entries explaining how to deal with each issue as it arose.

The available documentation is sort of ok, ish, maybe, as long as you know what questions to ask, but does not adequately help unveil the ropes and pulleys behind the magic of it-just-(doesn’t quite always)-work(s) Rails. To be honest, the documentation has been our biggest winge, mitigated to a large extent by the many helpful Rails blogs out there. You can get down and dirty with the Ruby code underlying Rails, and the layers of Rails configurations, which is a good thing, but only if you understand what you are looking at, which mostly we didn’t.

This blog will act for a while as a jotter for assorted difficulties wot we have succumbed to, googled after, and possibly even recovered from. Nothing earthshaking, but if we can spare Those Who Come After a few hours of fruitless mongrel_rails restarts, we can consider our karmic debt decreased a notch.