[geeks] faster www cache, better news

Charles Shannon Hendrix shannon at widomaker.com
Fri Sep 8 10:46:47 CDT 2006


OK, I spent some time on improving squid and more analysis of it.

Since we never get summaries on geeks, I'll try to start a trend by
following up on what I found out.

cachemgr.cgi statistics are not trustable, or at least don't really give
you information useful for tracking effectiveness.

Hit rate is a pair of numbers. One is the hit rate per request, as in
how many requests were hits. The other is the hit rate per megabyte, as
in how much data transferred was from the cache.

It's number one, URL hit rate, that most matters to you in reducing
latency, and is usually going to be the highest number.

Data hit rate is low depending on how you use the proxy.  If you use
your proxy for any kind of downloading, your hit rate will look awful,
because when you download a file you tend to get it once and never
again.

In reality, the majority of the data you get from WWW browsing might be
data hits, but file downloading skews the statistics so it looks a lot
worse than it really is.

By default, squid is tuned largely to cache static data, and uses hints
from the remote server quite literally when deciding what to cache and
when to hit rather than fetch.  A lot of remote servers are deliberately
configured to lie to proxy servers, which makes your cache less
effective. 

The good news is that you can fix this problem fairly easily in most
cases.

So, what does all this mean?

For starters, it turns out that my squid proxy has a 63% URL hit rate,
and around a 16% data hit rate.  The data hit rate is low because I use
the proxy for downloading a lot of files, including *BSD updates and
packages, Slackware packages, images, and even movies that show up on
web sites.  Since most of that is downloaded only once, it makes the
hit rate look worse than it is.

I found a program called squij that shows your cache effectiveness.  In
playing with this program I also learned a lot more about how squid
works, and how web servers talk to browsers.  squij reports statistics
on your squid pattern setup, which almost no one ever bothers with.
The default pattern setup for squid is not particularly effective,
especially for dynamic sites.

In your squid.conf file there are lines called "refresh_pattern", where
you can control how squid caches various URLs.  The reason you will want
to do this is because most servers either lie about the staleness of
URLs, or the hints are inaccurate for other reasons.

Even on a totally dynamic website, most of the images are still static.
I found a couple of white papers on long term analysis of fresh versus
stale proxy hits, and they found that most servers are far too
conservative in terms of what URLs are still fresh.  You can use the
refresh_pattern lines in squid.conf to overcome this problem, increasing
your hit rates without hurting the web sites.

In fact, you can even override http protocol in squid for extreme
caching, but that is a violation of protocol.  It can be used to
override problems with certain filetypes if you find that's a major
problem for your users.  For example, if you happen to know that every
PDF file your users download is garanteed fresh for 24 hours, you can
force squid to cache them for that long.

I reconfigured my squid server to use a set of recommended refresh
patterns for images, pdf, html, movies, ftp, etc and it has made a
difference.  Some really complex websites now load faster, even
newegg.com.

Note: my usage will not be the same as your usage.  I play with KDE a
lot, and I have several programs which fetch newsfeeds, images, and
other WWW information running in the background.  This causes my cache
hit rate to be higher than it might otherwise be because I'm hitting the
same data over and over.

Your hits could be higher even, depending on what you visit.

I'll stop for now as I really need more time to know for sure the effect
of all my changes.

Unfortunately, I didn't save the white papers I read, but they were not
hard to find once I got curious enough to look.  Try searching for
papers mentioning refresh_pattern and squid for starters.

Here are my refresh patterns.  I copied most of them from a white paper,
and they change as I play with them, so I'm making sure I post their
(mostly) unmodified test configuration here.

The format of each line is:

keyword 
case insensitive flag -i
regular expression for URL match
time to live in seconds
percentage coloring for staleness (I am probably butchering that)
maximum time to live in seconds

refresh_pattern     ^ftp:       2880    20% 10080
refresh_pattern     ^gopher:    2880    0%  1440
refresh_pattern     .           60  40% 4320
refresh_pattern -i  \.gif$      1440 500% 262800
refresh_pattern -i  \.png$      1440 500% 262800
refresh_pattern -i  \.jpg$      1440 500% 262800
refresh_pattern -i  \.htm$      40 500% 40320
refresh_pattern -i  \.html$     40 500% 40320
refresh_pattern     \/$         15 25% 20160
refresh_pattern -i  \.exe       2880 1000% 262800
refresh_pattern -i  \.ps       2880 1000% 262800
refresh_pattern -i  \.pdf       2880 1000% 262800
refresh_pattern -i  \.zip       2880 1000% 262800
refresh_pattern -i  \.gz       2880 1000% 262800
refresh_pattern -i  \.mov       2880 1000% 262800
refresh_pattern -i  \.avi       2880 1000% 262800
refresh_pattern -i  \.mpg       2880 1000% 262800
refresh_pattern -i  \.rm       2880 1000% 262800
refresh_pattern -i  \.ram       2880 1000% 262800

In bumped up the html? pattern a bit from the paper because it
doesn't seem too bad for staleness, but I might lower it to 25%.

I recommend you don't just go modifiying these setups without reading
the documentation and a couple of papers about it.  If you try to cache
things for more than 2-3 hours, or days, you will break websites for
certain.  My settings might even be a bit extreme, but I wanted to see
if I got the same benefits as the paper mentioned, and so far I think I
have.

Then again, I've only been testing for a few hours.

I'd be interested in any feedback from you guys if you try tuning this
stuff on large sites or different usage patterns.


-- 
shannon "AT" widomaker.com -- ["An Irishman is never drunk as long as he
can hold onto one blade of grass and not fall off the face of the earth."]



More information about the geeks mailing list