We found and fixed a rare race condition in our session handling

polote · on March 18, 2021

> While our log analysis, conducted from March 5 through March 8, confirmed that this was a rare issue, it could not rule out the possibility that a session had been incorrectly returned but then never used. This was not a risk we were willing to take, given the potential impact of even one of these incorrectly returned sessions being used.

It is funny how they make it seem like they are super cautious. When in fact they had no other choice than do it when you rephrase the problem as " there were maybe some people who got the admin right of the github account of some of our customers"

lathiat · on March 19, 2021

You say that, but resetting all the sessions is very publicly visible and plenty of smaller companies would paper over such things. Honestly I don't really expect any less from Github or Microsoft scale but don't kid yourself into thinking that kind of things goes un-noticed, un-investigated or un-actioned elsewhere.

ComputerGuru · on March 19, 2021

To the contrary, I find my session tokens get reset left and right by websites that don’t even care about how much of an inconvenience it is for the customers.

chrisandchris · on March 19, 2021

Personal opinion: Clear cookies every time you stop the browser process, which basically is at least once a day.

Logs you out of all services. It‘s a nightmare as a user, but I don‘t want to stick those tracking to cookies around longer than a single day.

XCSme · on March 20, 2021

If your concern is tracking, clearing the cookies doesn't help much. Next day you will login on the same services and websites that you logged before, and if they share data with 3rd parties they will keep sharing it, with or without cookies.

SamuelAdams · on March 19, 2021

Firefox can do this automatically. Options > Privacy & Security > check the box "Delete cookies and site data when Firefox is closed".

Anything you intentionally want to retain can be added to the exceptions list.

fsckboy · on March 19, 2021

"Delete cookies and site data when Firefox is closed" has always bothered me; it should be "Delete cookies and site data when Firefox is opened", because abnormal termination is not "closing", and you'd still want the cookies deleted when you start up again.

motiejus · on March 19, 2021

That way I'd lose the GDPR opt-outs from the number of sites I've gone through the process.

Perhaps I should just accept the consent and remove cookies regularly, instead of being careful with the consent forms... Hmm, good idea.

jjav · on March 19, 2021

Yes, for sure. I use cookie autodelete to remove all cookies as soon as the tab is closed, so nothing lingers.

ncann · on March 19, 2021

Yeah, I don't visit GitHub that often, maybe once a week but it logs me out every damn time. I wonder if it's because I have 2FA setup or something.

nexuist · on March 19, 2021

Is it really that hard to log in again? Using a password manager it's basically two clicks to autofill the right account and hit the submit button.

Login limits are a good thing. It means that the chances of someone taking over your account and using it long term is minimized.

Ennea · on March 19, 2021

It is not. But a ton of sites seem to be inconsistent about it, or it feels that way at least. Some sites I frequent don't log me out for months at a time, and then suddenly they do. And then they do it again a few weeks later. Followed by multiple months of not doing it once more.

yunohn · on March 19, 2021

I use a food delivery service (website only, no app) that doesn’t even remember the chosen language setting, and of course, never keeps you logged in past 24hrs. So effectively, need to login every time I make an order...

alpaca128 · on March 19, 2021

That happens to me even on YouTube. Every couple of months I have to fix the language setting and switch back to dark theme. What can't be fixed are Google's weird attempts at auto-translating video titles from other languages, making me click on videos that I simply cannot understand. Don't do that, Google. Do full title, thumbnail and video translation with 90% accuracy or just...don't. Half-assed tech is easy but not helpful for anyone.

jspash · on March 19, 2021

This is my biggest pet peeve about delivery service websites. What are the chances that I've moved house since I last ordered a pizza two days ago? Why do you incessantly ask for my postcode? When I log in, assume I'm the same person at the same address. If I have moved and accidentally order to the old house...well that's on me and I'd have to be pretty stupid (or hungry) to do such as thing.

noir_lord · on March 19, 2021

I mean they could do both.

Default to last postcode but display it on the confirmation screen "Sending your Pizza to LE13 XYZ" - amazon does it right (as you'd mostly expect)

imtringued · on March 19, 2021

Because Discord doesn't let you change avatars I have created three different Discord accounts. I get logged out of all three all the time. It's a massive pain.

WayToDoor · on March 19, 2021

Huh ? You can change your avatar in discord settings...

kroltan · on March 19, 2021

Maybe part of the sentence is missing and they meant change avatars _per server_? That would be the only reason I can imagine someone might want to have multiple accounts.

TriNetra · on March 19, 2021

Obviously if they are logging all actions of all users, and there's some decent retention period, they can find out how many people got unintended access and what all they did with it. Recently, I got an advice from a C-level executive to include such analytics in ASPSecurityKit [0], because that's what companies are looking for these days. This GH incident makes me consider his suggestion more seriously.

0: https://ASPSecurityKit.net

abgr-y · on March 19, 2021

This points to the last A in AAA of security which stands for Authentication, Authorization & Accounting, AAA moniker is commonly used in reference to either RADIUS or Diameter (network protocols), the concept is widely used for software application security as well. So Accounting implies What resources were accessed, at what time, by whom, and what commands were issued?

bvanderveen · on March 19, 2021

Well that's embarrassing. Passing a mutable dictionary—containing callbacks that could be closed over heaven-even-knows-what, no less!—between threads? And it's re-used between requests?

They are lucky this wasn't much, much worse.

Why does the error reporting service need to call the web thread back anyway?

Yet another case where immutable-everything would have prevented this hairball from accreting in the first place.

arkadiyt · on March 19, 2021

> Well that's embarrassing.

An IDOR is embarrassing - this was a complex bug and Github's investigation/response was great here. I'd be proud if my company handled a security incident like this.

arghwhat · on March 19, 2021

Creating such unnecessary complexity that should ahead of time be known to be fragile, and then having these resulting issues, is embarrassing.

The follow-up report is good, but it doesn't matter when the conclusion is "our footguns backfired".

renewiltord · on March 19, 2021

As someone who has never added incremental complexity to software until it manifested bugs, I agree.

yarg · on March 19, 2021

It can be both.

brainzap · on March 19, 2021

why the acronym

alpaca128 · on March 19, 2021

For the people as confused as me, it stands for "Insecure Direct Object Reference" and not "International Day of Radiology" or "Iowa Department of Revenue".

matthewaveryusa · on March 19, 2021

Never underestimate the inertia of working code that makes money.

wfleming · on March 19, 2021

Unicorn isn't GitHub's code, though. So the object-reuse that was the root cause here wasn't their code, and treating your webserver as a working black box is pretty typical (right up until you hit an edge case like this).

And Unicorn is pretty old. It wouldn't surprise me if this object reuse dates back to mongrel's code base, back when doing anything with threads in Ruby would have been unusual.

It's still bad code on unicorn's part (when I got to that part of the post I said "Unicorn does WHAT?" out loud), and GH should have been more careful introducing threads to an environment that didn't have concurrency before, but every step of the chain is pretty understandable IMHO.

andix · on March 19, 2021

> Well that's embarrassing

Probably it wasn't that obvious in the code to see.

But if they would've used immutable objects, or at least pure functions with scalar parameters in the first place, such an error would never occur.

strictfp · on March 19, 2021

I agree, and the bug in the report is likely not the only one caused by these changes.

the_mitsuhiko · on March 18, 2021

I think this issue is relevant for Python programmers as well. uwsgi for many years shipped a default allocation mode for the WSGI environment which just cleared it between requests. One had to switch over to `wsgi-env-behaviour=holy` to prevent it. Holy is now the default but many people still set it to cheap because that prevented some segfaults for a while in uWSGI.

skytreader · on March 18, 2021

> On March 2, 2021, we received a report via our support team from a user who, while using GitHub.com logged in as their own user, was suddenly authenticated as another user.

Sobering thought for me is how probable it is that a similar issue might be present in other platforms. The title calls it "rare" but given (a) the number of complex-enough web/app platforms in use today, (b) just the endless number of ways/infra permutations for "doing x", and (c) the size of a typical company's codebase and dependencies, could it be a more common occurrence than we think?

Total anecdote, up for you to believe: Back in April 2016, one of my then-housemates bought a brand-new Macbook Air. A week into his ownership, he exclaims surprise that he is logged-in to another Facebook account, someone we are totally not acquainted with. I borrow his MBA and try to investigate but, alas, I lack expertise for it. The only relevant thing I can eke out is that they both went to the same event a week or so ago and probably shared a Wifi AP, where a "leak" could've happened. Or maybe a hash collision in FB's side?

I would've reported it to FB but then I don't have enough details; with the two conjectures I have, I'm basically holding water with a sieve. But now the thought that someone I totally don't know might end up logged-in in my FB account is a possibility that bothers me. And I have no idea how to prevent it other than minimizing what could be compromised (and no, "Get off FB!" just isn't in the cards, I'm sorry).

Kudos for Github for investigating and fixing this issue. Another thought that occurred to me while writing this is how many users of other platforms might be filing issues like this but can't provide enough detail and so their reports are eventually marked as "Could Not Reproduce". Not exactly leaky sessions but just general security mishaps that they would struggle to describe to support.

thraway123412 · on March 18, 2021

Also on GOG: https://www.gog.com/forum/general/warning_massive_security_r...

> Hello everyone! This is forum regular fronzelneekburm, posting from the account of some poor Chinese guy that gog randomly logged me into. I'll explain the situation in further detail in another post from my actual account once I logged out of this one.

alickz · on March 19, 2021

Happened on Steam too a couple of years ago, people being logged into other people's accounts.

efreak · on March 19, 2021

My understanding was that was a caching issue and you couldn't actually do anything. Unless maybe I'm thinking of the wrong thing?

wholien · on March 18, 2021

The earlier report[1] says:

> The underlying bug existed on GitHub.com for a cumulative period of less than two weeks at various times between February 8, 2021 and March 5, 2021. Once the root cause was identified and a fix developed, we immediately patched GitHub.com on March 5. A second patch was deployed on March 8 to implement additional measures to further harden our application from this type of bug. There is no indication that other GitHub.com properties or products were affected by this issue, including GitHub Enterprise Server. We believe that this session misrouting occurred in fewer than 0.001% of authenticated sessions on GitHub.com.

If we use the 0.001% figure and assume all 56 million of their users are authenticated, then this occurred with 560 sessions at most.

You are right in that when something only happens to a minuscule % of users is hard to investigate and repro. But at the same time something as critical as "I logged in and was authenticated as another user" should probably be immediately escalated to the highest levels within your company.

1: https://github.blog/2021-03-08-github-security-update-a-bug-...

abdupo · on March 18, 2021

So I am one of the users who reported this. I quickly filed a support ticket but then I also searched my network to find someone on GH's security team to email. I'm not sure if support or the security person I emailed escalated it first.

codethief · on March 19, 2021

I recently reported a phishing site to GitHub which looked exactly like GitHub.com and GitHub responded within a day. So maybe your escalation wouldn't even have been necessary. :)

nodesocket · on March 19, 2021

Did you get any bounty?

deckard1 · on March 19, 2021

> The title calls it "rare"

I believe they are referring to the rarity of hitting this bug during the period of time it was broken, and not the rarity of the bug in general.

People were speculating on this here: https://hackertimes.com/item?id=26395138, and I even left a comment about this type of bug, but on Node. This category of bug, where session data leaks across requests, is actually incredibly common.

sellyme · on March 19, 2021

> The only relevant thing I can eke out is that they both went to the same event a week or so ago and probably shared a Wifi AP, where a "leak" could've happened. Or maybe a hash collision in FB's side?

Given two users in extremely close geographic proximity, the chances of them having manually logged in on the device themselves with your housemate either forgetting or not knowing are much much higher than the chances of an authentication exploit in the world's most popular site that would have gone undiscovered for a subsequent five years at this point.

Probably somewhere between the two (but closer to the "they just logged in on that computer" side of things) is the chance of the LAN being some eldritch abomination that just occasionally serves responses to the wrong local IPs, but that's not something any online service needs to care about.

mschuster91 · on March 19, 2021

> is the chance of the LAN being some eldritch abomination that just occasionally serves responses to the wrong local IPs

This could have worked in ye olde pre HTTP days, but with HTTPS this cannot reasonably have happened.

vanviegen · on March 19, 2021

That sounds like the event was forcing the use of an HTTP proxy server with over-eager caching. But in 2016, that would be a bit anachronistic..

BurningFrog · on March 19, 2021

> one of my then-housemates bought a brand-new Macbook Air

My most fun theories are

1. Somewhere in the supply chain, someone logged in to their FB account on this machine.

2. During the first week he owned it, someone logged in to their FB account on this machine.

3. You ex housemate lied about this. Crazy people are rare, but MUCH more common likely bugs that could cause this.

robindebaets · on March 19, 2021

I once had the exact same thing happen on PayPal. I logged into my account, but was greeted by the information of another user. I immediately logged out. I never reported it and it it never happened again.

ficklepickle · on March 18, 2021

If it were a hash collision, I would think the odds of it involving two users geographically close to each other would be quite small.

That's a weird one, thanks for sharing it!

tcgv · on March 19, 2021

Not if you consider that FB servers are somewhat geographically partitioned to optmize response times

exdsq · on March 18, 2021

Don’t sell yourself short, I’m sure you’re more than experienced to get your own MBA!

thecodemonkey · on March 18, 2021

Really interesting write-up. Curious how they do logging, and how they were able to get such a detailed look at the previous requests, including the HTTP response. Wouldn’t it be a massive amount of data if all HTTP requests/responses were logged? (Not to mention the security implications of that)

bob1029 · on March 18, 2021

It would indeed be a massive amount of data, but bear in mind that only a small fraction of HTTP requests hitting GitHub are actually both authenticated and attempting to change the state of the system. Most requests are unauthenticated read-only and cause no state changes in GitHub.

vlovich123 · on March 18, 2021

It’s a reasonable theory but step one required to trigger this bug is an unauthenticated request. It’s unclear if their logs indicated that or they got lucky trying to repro that someone thought to try with an unauthenticated request first when the back to back session didn’t repro.

aidos · on March 18, 2021

Could be that there’s some tell tale in the logs.

We recently fixed a bug with incomplete requests by noticing in nginx logs that the size of the headers + the size of the body was the size of the buffer of a linux socket.

whimsicalism · on March 18, 2021

From the writeup, it sounds like the latter, but we obviously can't be sure.

kg · on March 19, 2021

It's pretty common to do exhaustive logging even if the data is enormous. It's valuable for debugging and security investigations. As far as the security of storing log data, you need log messages to not contain PII regardless of how long you retain them.

It's not uncommon to see companies talk about processing terabytes or even petabytes of log data per day. In the end the storage isn't that expensive [1], and you can discard everything after a little while.

[1] Assuming of course that any company generating a petabyte of logs every day is doing so because they have an enormous number of customers, AFAIK the cost of storing 1PB of logs on something like Glacier is in the 100-300k USD/yr range. Absolutely a tolerable expense for the peace of mind you get by knowing you can go back in 2 months and find every single request that touched a security vulnerability or hunt down the path an attacker took to breach your infrastructure.

marshmellman · on March 20, 2021

Any idea how GitHub could tell that the cookies in the response are wrong, without logging the session cookie?

londons_explore · on March 21, 2021

Just log the session cookie...

ckuehl · on March 18, 2021

Purely a guess, but I've sometimes been able to work backwards from the content length (very often included in HTTP server logs) to figure out what the body had to be.

SilverRed · on March 18, 2021

Yes it is a large amount of data but its worth paying to store it because it lets you go back in time and work out wtf happened which will save you far more hours and solve more problems making it worth it.

andix · on March 19, 2021

I guess the incident happened more often, then they wanted to tell us. So as a first step they probably turned up the logging and waited for new reports...

xvector · on March 18, 2021

Distributed tracing with a small sample size or small persistence duration?

dmlittle · on March 18, 2021

Given the rarity of this bug I doubt a small sample sized would have captured the right requests to be able to debug this. The probability would be the probability of this bug occurring * (sample percentage)^3 [ you'd need to have sampled the correct 3 consecutive requests].

firebaze · on March 18, 2021

As expensive as it is, datadog provides the required level of insight via, among other drill-down methods, detailed flamegraphs, to get to the bottom of problems like this one.

No, not affiliated with datadog, and not convinced the cost/benefit ratio is positive for us in the long run.

nullsense · on March 19, 2021

We've implemented it recently and it's helped us tremendously.

HellsMaddy · on March 18, 2021

I’m very impressed by the level of detail provided in this post mortem, but it leaves me with a few questions.

What we know based on this report:

- This bug was live in production for at least 3 days.

- It was common enough to happen numerous times in the wild.

- GitHub says that they identified some specific user sessions that may have been impacted by this bug by analyzing their logs.

What we do not know:

- How many leaked user sessions were identified in the followup investigation?

- Is GitHub confident that they identified all leaked user sessions?

- Was there any evidence of malicious use of these sessions?

- Did GitHub notify the users whose sessions were leaked?

- How long/for how many requests could these sessions have been valid between their leak and when they were definitively revoked on the 8th?

- Is it possible that any private user data was leaked? If so, what data?

dfcowell · on March 18, 2021

Most of that information is available in the initial report linked to in the first paragraph of this blog post: https://github.blog/2021-03-08-github-security-update-a-bug-...

HellsMaddy · on March 18, 2021

That provides more information but doesn’t fully answer any of my questions.

Things that are still unclear:

- GitHub provides an estimate of the prevalence at 0.001%, but doesn’t state how many leaked sessions were actually identified.

- Is GitHub confident that they identified all leaked user sessions? This has bearing on my question about notification of users whose sessions were leaked; if they did not identify all leaked sessions, they obviously could not contact those users.

- GitHub states that this bug could not be intentionally triggered by a malicious party (why?), but they don’t state whether any of the natural occurrences may have been used maliciously.

- How long/for how many requests could these sessions have been valid between their leak and when they were definitively revoked on the 8th?

- Is it possible that any private user data was leaked? If so, what data?

People with technical understanding can infer most of this information, but I think GitHub should issue some straightforward answers rather than beating around the bush.

(Correction to my original comment: the bug was live for a cumulative period of “less than 2 weeks”)

ac29 · on March 19, 2021

> GitHub states that this bug could not be intentionally triggered by a malicious party (why?),

Because this bug requires two users making authenticated requests at nearly the same time. An attacker cant force their target to perform actions, especially not in the same small time frame that is likely required (milliseconds? less?). I suppose if the malicious party didnt care what account they got swapped into, and github didnt rate limit their requests, perhaps they could trigger this bug with enough time?

Boltgolt · on March 18, 2021

GitHub says there could have been sessions that were leaked but can't be identified now:

"While our log analysis, conducted from March 5 through March 8, confirmed that this was a rare issue, it could not rule out the possibility that a session had been incorrectly returned but then never used."

Those sessions are now invalidated and apparently never used

azernik · on March 18, 2021

> - Is GitHub confident that they identified all leaked user sessions?

Given that they revoked all current logins, rather than just the affected ones, I assume that they have no way to tell which sessions were affected.

shay_ker · on March 18, 2021

Brutal - Unicorn didn't use Rack.env in a thread safe way. It's tougher given that this was in a widely used app-server and not even in Rails code.

Impressed with GH folks in finding this, though I wonder how many more of these types of bugs are out there in Ruby libs. It'd be nice to make this more difficult to do, a la FP languages, by making it really intentional to modify state. Maybe in Ruby 4??

SilverRed · on March 18, 2021

I feel like if you dig down deep enough, every single site on the web has a serious security bug in its app code or the layers it runs on top of. How many more bugs are there still in the CPU your server runs on?

Thorrez · on March 19, 2021

>Unicorn didn't use Rack.env in a thread safe way.

Was Unicorn itself not thread safe? It sounds like Unicorn had the requirement that once you send back the request to the user you must stop accessing the request object. Maybe Unicorn didn't document that requirement clearly, but even if it's not documented it's not clear to me that Unicorn itself is not thread safe, but rather that it has undocumented pitfalls that can easily lead to thread unsafety.

chucke · on March 19, 2021

unicorn is single-threaded by design, and is not expected to support any background threading workarounds that people build on top of it. That "feature" in github shouldn't be passing the env hash.

vemv · on March 19, 2021

This is a good example of why anything related with threading makes me nervous when using Ruby. When a language is mutable and thread-unsafe by default, keeping the whole stack sane at all times seems prohibitively arduous. One cannot code-review a whole ecosystem.

I believe that this could be addressed with an architecture that favored single-threaded processes that can handle a single request at a time. Said processes would be numerous, have a (best-effort) low memory footprint, and be pooled under some master process written in a concurrent/performant manner.

The result wouldn't be necessarily pretty or resource-efficient (especially in terms of RAM), but it also can be seen as a consequence of Ruby's requirements for safety.

An org like GH almost certainly can afford using more/beefier servers it it means that entire classes of bugs will go away.

amarshall · on March 19, 2021

> when using Ruby

To be fair to Ruby, most languages are largely mutable and without any strong thread-safety guarantees. Does that make it okay? No, but it’s far from a Ruby-specific problem.

> I believe that this could be addressed with an architecture that favored single-threaded processes that can handle a single request at a time.

Unicorn (the application web server GitHub is using) does follow this model. GitHub added their own additional thread within that process that broke that design in an unforeseen way.

mcherm · on March 18, 2021

I just want to make clear that for me, write ups containing this level of detail analyzing the error that occurred give me greater confidence in the company rather than less confidence. Greater even than if I had never heard of the original bug. There are many bugs I have never heard about, but any company that uses this level of care and openness is likely to be very reliable.

m463 · on March 18, 2021

I have to say the presentation is exceptional, including very good diagrams of the problem.

But there's a thought in the back of my head wondering what kind of stuff hit the fan, and wondering who had to create the presentation. Was it the person who checked in the broken code, was it the bug fixer or his manager, or was it a pm that got stuck having to write up what happened?

fotta · on March 18, 2021

My guess is that someone close to the remediation wrote it (or it was written by the eng team) and it went through 5 levels of editing before publication.

m463 · on March 19, 2021

I wonder about the email thread. how many recipients on the TO, how many on the CC (how many on BCC?)

codetrotter · on March 19, 2021

> how many on BCC?

I guess the only way we will know that is if there’s ever a vulnerability in the session handling code of the mail server that they use also xD

petters · on March 18, 2021

I agree with you in general, but in this particular case they knowingly deployed thread-unsafe code in production while increasing the threaded concurrency. That should have broken internal development rules.

eli · on March 18, 2021

Now imagine what happens behind the scenes at other service providers. I’d bet the ones that have policies that bury the details of incidents, on average, have worse problems.

yuliyp · on March 18, 2021

Yes. Writing code with bugs breaks internal development rules. The issue wasn't thread-unsafe code. It was confusion about object lifecycle. An env object was assumed to no longer be used by the web framework after a request terminated, while in reality it was recycled for future requests as a performance optimization.

simonw · on March 18, 2021

It sounds to me like they created new internal development rules as a result, which is a smart response to the issue.

Complexicate · on March 18, 2021

Agreed. However, I smirked a bit when I read your comment, since this is ultimately coming from Microsoft. Let's hope some of this mentality seeps into other Microsoft projects

thitcanh · on March 18, 2021

GitHub is owned by Microsoft but as far as I know it operates pretty independently. Let’s hope it stays that way.

craftinator · on March 18, 2021

At this point we're all just waiting for that last E.

johannes1234321 · on March 18, 2021

What would be the benefit for Microsoft in that case? - They are transforming themselves from a platform company based around windows, it a platform company based around Azure. GitHub is a relevant part there.

This doesn't mean they are only doing good™ with GitHub, but destroying it and open source vendors would be counted to their interests.

arthurcolle · on March 18, 2021

I don't understand this reference, can you elucidate

scintill76 · on March 18, 2021

https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_extingu...

artificial · on March 18, 2021

Most likely a reference to Embrace, Extend, Extinguish. A strategy Microsoft employed in the past.

arthurcolle · on March 18, 2021

Ah thanks, yeah. Familiar with the strategy, but didn't get the connection - been a long day over here

m463 · on March 18, 2021

We should Eulogize

xxqp · on March 18, 2021

elucidate, extend, extinguish

edoceo · on March 18, 2021

Or, pro-actively moving to GitLab - either self-hosted or paid and using GH as a simple mirror for discovery.

TedShiller · on March 18, 2021

This is such a n00b mistake though, the "greater confidence" is quickly canceled out. I end up having less confidence in this company.

SilverRed · on March 18, 2021

Every bug is a "n00b mistake" when you are looking right at it. There is not a single large project that is not filled with or at least has had trivial bugs with serious repercussions.

sellyme · on March 19, 2021

> Every bug is a "n00b mistake" when you are looking right at it.

Except the ones that are just the universe conspiring against you, like bit flipping in the SM-1800 because of heavily irradiated cattle being transported on a nearby train line.

TedShiller · on March 19, 2021

If you’re a noob then yes every mistake is a noob mistake

ShroudedNight · on March 18, 2021

I think it's more a statement of the quality and clarity of the postmortem that this problem appear so obvious in hindsight.

tropshop · on March 19, 2021

Some time ago I dealt with this exact bug for an e-commerce site with millions of requests per day, and it was also Rails + Unicorn.

ANY time ruby 1.8 raised a timeout, it was silently swallowed, and Unicorn returned the adjacent request session_id. We had to constantly defend against any code that might hit the native timeout.

artursapek · on March 19, 2021

Lmao I am so glad Ruby/Unicorn/Rails are a distant memory in my career at this point. So much magic and poorly defined behavior

imtringued · on March 19, 2021

I don't write Ruby applications but I have never heard anything good about Unicorn.

alexeiz · on March 18, 2021

> We initially thought this to be an internal reporting problem only and that we would see some data logged for an otherwise unrelated request from the background thread.

So they already conceded that their code is fundamentally not thread-safe and they decided to let it slide.

macintux · on March 19, 2021

Yes, but in fairness, we all make tradeoffs to get more done in less time. It's much easier to justify throwing your top engineers at a race condition for a few days when you know it's placing security at risk.

nexuist · on March 19, 2021

What is the alternative? Throw the project away and start over?

tcgv · on March 19, 2021

> One such performance improvement involved moving the logic to check a user’s enabled features into a background thread that refreshed on an interval rather than checking their enablement while the request was being processed

It's never a good idea to perform multithreaded operations within the web server, unless you have a strong knowledge of its internal workings. For instance, if you miss manage threads in an IIS web application you can end up having your web requests queued since the web server has by default a fixed number of worker threads to serve web requests.

From their bug description it's clear that ruby on rails is no exception to this rule. In such cases where background processing is required due to some condition triggered by a web request it's best to stick with a propper queue system [1]

[1] https://thomasvilhena.com/2019/07/using-queues-to-offload-we...

beders · on March 19, 2021

Another case where using immutable data by default would have prevented any shenanigans. It also shows the highly entangled relationship between state and complexity. (in this case, the env object)

Immutable data forces you to design your system in a different way and remove layers of complexity.

brundolf · on March 19, 2021

While in general I agree, in this case that aspect of the problem entirely came down to "don't use object pools that span multiple requests". That's a conscious optimization that was made and then retracted, not a subtlety that accidentally slipped in due to mutability itself.

beders · on March 19, 2021

object pools == mutability (why else would you keep objects around in a pool)

brundolf · on March 19, 2021

Object pools require mutability; by no means does mutability require you to use object pools

beders · on March 19, 2021

Obvious.

But not what I meant. I meant that it was a conscious decision to use an object pool and re-use objects.

That doesn't happen with immutable by default languages.

Ergo, these particular kinds of bugs just don't happen.

joebob42 · on March 19, 2021

I guess the point here is that an object pool isn't the default anyways: theyve chosen to do something non-default for performance, which you can still do in immutable by default languages.

ed25519FUUU · on March 19, 2021

How would that help here? They would just exchange one piece immutable data (un-signed in user cookie) with another piece of wrong immutable data (someone else’s session).

beders · on March 19, 2021

"While the immediate risk has been mitigated, we worked with the maintainer of Unicorn to upstream the change to make new requests allocate their own environment hashes. If Unicorn uses new hashes for each environment, it removes the possibility of one request mistakenly getting a hold of an object that can affect the next request."

beders · on March 19, 2021

A mutable object 'env' was shared across threads.

stephenr · on March 19, 2021

> One such performance improvement involved moving the logic to check a user’s enabled features into a background thread that refreshed on an interval rather than checking their enablement while the request was being processed.

This type of thing is why PHP's shared-nothing model is vastly undervalued IMO.

Does PHP have the ability to use threads if you really need it? Yes.

Are they generally a PITA to use? Yes.

Is this a problem for the vast, vast, vast majority of projects? Not in the slightest.

Having a separate invocation of the runtime for each request is a feature, not a failing.

joshdifabio · on March 19, 2021

Not for much longer!

https://externals.io/message/113419#113555

pianoben · on March 18, 2021

This is some fantastic debugging. I love war stories like this! They give me confidence that the world is full of mere mortals like myself, and we can do great things nonetheless.

ghiculescu · on March 18, 2021

Anyone know if puma would be susceptible to the same bug or if it’s unicorn specific?

vemv · on March 19, 2021

It would not trigger the same bug.

But since it's a multi-threaded server, it will exercise arbitrary code (Rails core, your app, depended-on gems) in a concurrent manner.

This can unmask race conditions that wouldn't manifest themselves under a single-threaded environment.

...that's been my experience with Puma - brilliant server in itself, somewhat risky approach when considering the Ruby ecosystem.

rantwasp · on March 19, 2021

nope. puma does the right thing. this is exactly what i looked at when i saw unicorn in the article

ksec · on March 19, 2021

I thought Github at one point looked at Puma and decided not to switch over. I wonder if they will reconsider that now.

rantwasp · on March 19, 2021

to be fair, switching over is probably non-trivial and/or risky

jakswa · on March 18, 2021

This is what I'm going to be digging for later... I'm wondering if puma also shares `env` objects across requests.

Freaky · on March 18, 2021

Each server does share a "proto env" object, but it is duplicated for each Client:

https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8...

and Client objects are not reused:

https://github.com/puma/puma/blob/0cc3f7d71d1550dfa8f545ece8...

tyingq · on March 18, 2021

I may be missing something here. I do see that the patch[1] now creates a new request object, and thus a new "env" Ruby hash for each request.

But I don't see the old behavior described as "allocates one single Ruby Hash that is then cleared (using Hash#clear) between each request"...even after poking around in http_request.rb and other places. I reads like the pre-patched version would just re-use the hash as-is, only overwriting/adding if it saw a new key or value, but not deleting any keys.

[1] https://yhbt.net/unicorn-public/66A68DD8-83EF-4C7A-80E8-3F1F...

lstamour · on March 18, 2021

Maybe here? https://github.com/defunkt/unicorn/blob/2c347116305338710331...

Admittedly, that's an older copy, I'd have to look up a newer copy of the source to see if that function or a similar one still exists.

The change was made in Unicorn v6 according to https://yhbt.net/unicorn/NEWS.html so I'd have to find code earlier than that, maybe.

https://yhbt.net/unicorn.git/tree/ext/unicorn_http/unicorn_h... looks like it's still there.

If I'm reading this right, the init function in the above unicorn_http.rl function is what's referred to as HttpRequest#new thanks to https://yhbt.net/unicorn.git/tree/lib/unicorn/http_request.r... but I might be misreading this.

Actually... compared to uWSGI's source code, this is quite pleasant to read, warts and all. :D

tyingq · on March 18, 2021

Ahh, yep. I hadn't thought to look at Ragel source.

bkartal · on March 19, 2021

It is scary to think this is happening with the bank account systems.

pmarreck · on March 19, 2021

> Because the Rack environment is the same object for all requests

Who in the heck thought this would be a good idea? For the record, back when Desk.com existed, I also saw strange session issues that were probably complex and irreproducible race conditions as well.

Note that in Elixir/Phoenix, this class of bug is literally impossible, thanks to immutability-all-the-way-down. Turing bless it.

jacobparker · on March 19, 2021

At D2L we have a large C# code base which gets deployed in a few different subsets, but is largely a monolithic fleet of web servers.

To prevent these kind of problems we have a few approaches, but the main way is to prevent shared mutable state. To do this we have a custom C# code analyzer (source here: https://github.com/Brightspace/D2L.CodeStyle/tree/master/src... , but it's not documented for external consumption at this point... ImmutableDefinitionChecker is the main place to look.) It goes like this:

  [Immutable]
  public class Foo : Bar {
    // object is a scary field type, but "new object()"
    // is always an immutable value.
    private readonly object m_lock = new object();
  
    // we'll check that ISomething is [Immutable] here
    private readonly ISomething m_something;
  
    // Fine because this lambdas in initializers can't
    // close over mutable state that we wouldn't
    // otherwise catch.
    private readonly Action m_abc = () => {};
  
    // see the constructor
    private readonly Func<int, int> m_xyz;
  
    // danger: not readonly
    private int m_zzz;

    // danger: arrays are always mutable
    private readonly int[] m_array1 new[]{ 1, 2, 3 };

    // ok
    private readonly ImmutableArray<int> m_array2
      = ImmutableArray.Create( 1, 2, 3 );
  
    public Foo( ISomething something ) {
      m_something = something;
  
      // ok: static lambdas can't close over mutable state
      m_xyz = static _ => 3;
  
      // danger: lambdas can in general close over mutable
      // state.
      int i = 0;
      m_xyz = x => { i += x; return i; };
    }
  }

Here we see that a type has the [Immutable] attribute on this class, so we will check that all the members are readonly and also contain immutable values (via looking at all assignments, which is easy for readonly fields/properties). Additionally, we will check that instances of Bar (our base class) are known to be immutable.

Any class that were to extend Foo (as a base class) will be required to be [Immutable] as well. There's a decent number of aspects to this analysis (e.g. a generic type can be "conditionally immutable" -- ImmutableArray<int> is, but ImmutableArray<object> is not), check the source if you're interested.

We require all static (global) variables be immutable, and any "singleton" type (of which we have tens of thousands if I remember correctly) also must be.

Another important thing we do to is to cut off any deferred boot-up code from accessing customer data (via a thread-local variable that is checked before access through the data layer). This prevents accidentally caching something for a particular customer inside, say, a persistent Lazy<T> (or the constructor for a singleton, etc.)

We've adapted our code base to be very strict about this over the last few years and it discovered many obscure thread-safety bugs and doubtlessly prevented many from happening since.

tybit · on March 19, 2021

This is really cool, it seems similar in intent to Rust's "shared or mutable, but not both" philosophy. I hope it spreads to C# more generally somehow.

benmmurphy · on March 18, 2021

This post mortem and changes to unicorn should be useful for a lot of other ruby shops. I suspect github aren't the only people who are not defensively copying rack data when sharing it with background threads.

makkesk8 · on March 19, 2021

How do you even have logs for something like this? I feel like the log level to see the raw cookie responses would be a no go since you'd easily have hundreds of gigabytes of logs every day.

benlivengood · on March 18, 2021

This is the kind of bug that makes me leery of frameworks and tall application stacks. Having to trust more than one layer of multithreaded code between an application and the kernel gets riskier with every layer.

Also another reason not to mix the main application code with the authentication/authorization code. Setting security cookies should be done by dedicated code in a dedicated process to avoid this kind of interaction, and the load-balancer layer should filter the setting of security cookies from non-security backends to enforce it.

rtpg · on March 19, 2021

I really like Python, but the difficulties in using the same process for multiple requests really get me (and I think this bug labels a similar sort of difficulty with Ruby).

There are of course safe ways to do all of this kind of stuff, but there are _so many_ unsafe ways to do things, and accidentally share things across requests. At least in the Django world I think ASGI is going to make it more likely that stuff is done the right way but there's a big learning curve there.

wicket · on March 19, 2021

It doesn't seem like they took this security incident very seriously. It's a pretty serious bug that was reported by two different users within a few hours. It could have been abused by others, but it doesn't look like it occurred to them to roll back recent changes whilst they investigated the problem.

gizmo · on March 18, 2021

There is a lesson here about the way we go about sharing memory between threads that is implicit and fundamentally broken. Adding extra threading in order to improve performance is clearly the wrong tradeoff when you're in an environment where threads can introduce bugs like these. When it takes 10 senior engineers to figure out why one user session gets data from another session you've already lost.

Shared memory environments are very unsafe and we should be working towards a memory model where data that belongs to different organizations can never get mixed. This isn't impossible, but it requires new programming languages that allow us to express which memory areas can't get mixed up.

We've got to start designing our backend infrastructure differently such that this entire class of memory errors can't ever occur.

DougWebb · on March 18, 2021

Replace Rails and similar server-side frameworks with CGI scripts, and you're done. Every request is a new process with its own memory space.

Yes, performance is an issue with that old approach. Around the turn of the century I was responsible for a Perl web application that ran as a CGI script, and to speed it up I wrote my own Perl httpd server that had my web application code pre-loaded. Then for each request I forked a new process, which carried over all of the pre-loaded code and already-running perl.exe, so there was very little startup time (compared to a normal CGI script startup.) I was careful to take full advantage of the operating system's Copy-on-write memory semantics, so that the only memory that had to be allocated for the new process was Perl's runtime stack. All of the memory containing the application code was shared across processes because it never got written to.

This web application is still running today, though I left the company in 2010. Back then, it was handling about six million requests per day, about 75% of which was during US daytime working hours. So, around 160 requests/second, spread across five or six servers that were getting old in 2010.

The modern frameworks have taken the same concept a step further, and replaced the process fork with a thread pool. That's faster, but it allows memory leaking in a way process forking does not. You have to go out of your way to share memory between processes, but with threads it's easy to do accidentally.

hu3 · on March 18, 2021

Workerman [1] uses a similar approach to put PHP pretty high on TechEmpower [2]. Specially now with PHP8's JIT.

[1] https://github.com/walkor/Workerman

[2] https://www.techempower.com/benchmarks

lstamour · on March 18, 2021

In many languages and web frameworks, the assumption is that the application server will do this for you. For example, by issuing a new env variable with each request such that requests can't easily share memory with other requests, though as noted that might not stop background threads from retaining such memory.

In other cases, you might assume containers would be enough separation, or maybe different databases (sharding, for example). One approach I've considered but not yet implemented is to take end-user JWT cookies and use them to authenticate to a database such that a user can only ever see the rows they should be allowed access to, and all other SQL queries would fail.

But no matter how you implement a security control, it's possible to make mistakes and allow others to see data you didn't intend to share with them. Even Rust, famous for its explicit memory ownership model, could still have similar bugs, they just might be a bit more obvious to spot in some session connection pool implemented incorrectly, for example, or maybe exposed via a compiler optimization gone wrong. Code review will still likely be one of the primary defenses of this kind of bug, though hopefully better tooling and language support to detect race conditions or shared ownership warnings will help over time?

Cloudef · on March 19, 2021

It's called separate processes or unshare after new thread.

teraku · on March 19, 2021

This happened to a student colleague of mine in I think 2014. We were working on uni projects and when he logged in to check the repository, he was suddenly another user and we were all really surprised.

jrockway · on March 19, 2021

Does Ruby have a data race detector, and would a race detector have caught this unexpected sharing?

rantwasp · on March 19, 2021

this is not Ruby. This is unicorn doing some dumb stuff

jrockway · on March 20, 2021

I'm not sure what difference it makes. Ultimately, all memory gets reused (you don't throw out your 32GB DIMM after you've written 32GB to it), so some tooling is required to manage the reuse. You can read this story and say "I wouldn't reuse a per-request data structure between requests" or especially "if I did, I wouldn't pass it to a background thread", but it doesn't get at the core of the problem -- if your program has multiple threads of execution, you need some way to check that there isn't unexpected sharing. One way is a tool like ThreadSanitizer, run your tests against it, run 1% of your production traffic against it, and you have a higher probability of detecting data races. Another way is to have some sort of structured concurrency (see "goroutines considered harmful" which was on HN just the other day).

Ultimately, I don't think one can ever be happy about "there was a giant security hole in our app, and we only found out because a customer noticed it and reported it to us, it turns out it was in a random library that we didn't even write and so now that we found this everything is fine!" Everything is not fine, and this will just happen again.

client4 · on March 19, 2021

It would be super cool if they did more work to prevent brute force account takeovers.

TedShiller · on March 19, 2021

“Rare”? How about “catastrophic”? Is this written by PR people?

ibraheemdev · on March 19, 2021

Funny, I don't believe I was auto-logged out on March 8th.

rantwasp · on March 19, 2021

food for thought: if they have used puma (which is sort of what people have been using since 2017 when it comes to ruby/rails) would this bug have still existed?

__alexs · on March 19, 2021

uWSGI had a similar bug for a long time that was only fixed by accident and was never reported as security issue. Concurrency is hard I guess.

lambda_obrien · on March 18, 2021

Always log important events with a timestamp and save them as long as possible.

It saved me more than once in similar situations with several threads.

gowld · on March 18, 2021

> Always log important events with a timestamp and save them as long as possible.

Except when those events are associated with private user data or behavior.

closeparen · on March 18, 2021

Nope, especially in those cases. The more sensitive the data, the more important it is to have records of who accessed or changed it and when.

temp667 · on March 18, 2021

Huh? I want my bank to be VERY clear on when I transferred money for my home closing (and keep that record). They need to keep records and the browser used, IP used, authentication flow etc. Why does this have to be thrown away?

lambda_obrien · on March 18, 2021

The personal data itself should not be logged but the event itself should be logged with lots of details that aren't private info.

oars · on March 18, 2021

Wow, impressive!

lambda_obrien · on March 18, 2021

I'm guessing that's sarcasm, I edited my comment.

aquarese · on March 19, 2021

Thanks

naebother · on March 19, 2021

cgi-bin for the win

jbverschoor · on March 19, 2021

Is there a reason for the vague title? Any other title would cause this to be the top post for 1 or 2 days

baggy_trough · on March 18, 2021

The smartest engineers in the world are not smart enough to write bug free multithreaded code.

WJW · on March 18, 2021

Nobody is smart enough to write bug free code, but at least some companies have the smarts to fix their code and blog about how they did it.

robotresearcher · on March 18, 2021

You're right but the parent's point is that multithreaded code with shared objects is known to be particularly difficult to do perfectly, and hard work to debug. It's so difficult that the rational strategy may be to architect to avoid it. We will have bugs, but not those bugs. Those bugs stink.

baggy_trough · on March 18, 2021

may > definitely is, whenever possible.

fanf2 · on March 18, 2021

This is the kind of multithreading bug that Rust’s borrow checker can prevent: the compiler would have complained that the background task and the main request handler have write access to the env object at the same time, which is not allowed.

astrange · on March 18, 2021

Writing your website in PHP would also have prevented it, because its amateur process per request design is actually the best way to do things, scales well, and has the right security properties. Having an app server that can have requests from different users in the same memory space is a mistake.

joshdifabio · on March 19, 2021

PHP might move away from this safe approach with fibers, which are about to be added to the language.

https://externals.io/message/113419#113555

baggy_trough · on March 18, 2021

You can also run Ruby's unicorn or puma servers as a single-threaded process cluster.

trav4225 · on March 18, 2021

It depends on the complexity.

WORMS_EAT_WORMS · on March 18, 2021

Oof, my brain. With that giant fiasco master/slave thread the other day, I glanced at “race condition” and “GitHub” and immediately thought this was related to that.

mtnygard · on March 18, 2021

I'm looking forward to learning about how Rust makes this kind of bug impossible.

vlovich123 · on March 18, 2021

It doesn’t. The issue here seems like a more vanilla logical data race + object reuse that Rust can’t really do anything about. Rust only makes sure that only one thread at a time writes and other reads don’t race that write. It can’t make guarantees that the relative ordering of reads and wires makes sense for your use-case.

It can simplify certain abstractions of course so that you can provide easier tools for devs to avoid footguns.

cogman10 · on March 18, 2021

What rust would do for this particular situation is make it painful to share an object like this across threads and make it obvious that such a share is happening. That wouldn't prevent this problem, but it would make it harder to stumble into.

vlovich123 · on March 19, 2021

Maybe, maybe not. A claim like that is definitely in the realm of "people think race condition bugs are solved by switching to Rust". I don't know Rust well enough to evaluate your claim but that feels like an incorrect statement - sharing data between threads should be as easy as wrapping something in Arc & Mutex to get similar semantics I think.

Even if that particular bug is actually harder to stumble into, there's no guarantee the space of thread synchronization issues one might stumble into is inherently smaller for Rust than for Ruby, so all you're doing is exchanging one class of failure modes for another.

cogman10 · on March 19, 2021

As I said, they aren't solved but they are hard to do unknowingly.

In most other languages, sharing mutable data across threads is unprotected. You are lucky if you get a compiler warning. Rust disallows that completely. You cannot share mutable data across threads without first wrapping that data in thread safety constructs.

That's what happened here. A mutable variable was shared between threads and updated from different threads. Rust makes it obvious that this variable is accessed from multiple threads even if it doesn't prevent 2 threads from writing to the variable at wrong times. That's where it is a partial solution to the problem.

Sharing is as easy as wrapping the thing in a Mutex, and if you are a programmer that sees something wrapped in a mutex you should immediately start thinking of the thread implications. You don't have to, but you should.

That mutex follows the variable around anywhere you want to start changing it.

In a language like C or Ruby, that's not something that is patently obvious. Shared variables aren't wrapped with anything. It's perfectly understandable for someone to not realize this bit of shared state is mutated across threads.

EMM_386 · on March 19, 2021

I wonder if Microsoft has long-term plans to move Github off of things like Unicorn/Rails/Erlang into a technology stack that would be more familiar to them.

I'm not judging the choices of technology, but I'm thinking it was easier for them to buy Github outright then try to compete with it. I assume Github still operates independently, with engineers who are familiar with this stack.

rantwasp · on March 19, 2021

hah. good luck just moving it to another stack. this just doesn’t happen.

also, what does familiar to them means in the MS context? i’m sure the people that work on github are familiar with the stack

EMM_386 · on March 19, 2021

Agreed, the comment was short sighted. Was hesitant to even post it considering many involved are likely here on HN.

I wasn't sure how independent Github is from Microsoft after the acquisition. My thoughts were initially in moving engineers around internally from within Microsoft to work on Github.

Multi-threaded race conditions in a Rails app under Unicorn Rack takes a lot of experience in specific tooling.

dvt · on March 18, 2021

I love how everyone here on HN shits on Gab's mishap just because they happen to host a lot of nonsense alt-right garbage and their CEO is a weirdo, but everyone's totally cool with GitHub knowingly having inconsistencies across threads:

> We initially thought this to be an internal reporting problem only and that we would see some data logged for an otherwise unrelated request from the background thread. Though inconsistent, we considered this safe since each request has its own request data and Rails creates a new controller object instance for each request.

I mean this is like an SNL skit: "though inconsistent, we considered this safe" -- inconsistency, no matter where it's happening, literally means your code isn't safe. Like, jeez... this is GITHUB we're talking about. I mean, I give money to GitHub -- this is pretty serious and we should really stop with the "omg I love these detailed post-mortems!" stuff. A billion-dollar company knowingly having code that's not thread safe in their flagship product and going "eh, it's probably fine" is bonkers to me.

reissbaker · on March 18, 2021

I think there's a pretty big difference between an engineer writing thread-unsafe code, testing and seeing inconsistencies that (incorrectly) appear to not affect any external users, shipping it, and then the security team finding the problem and developing new processes to prevent this in the future — vs the CTO of Gab writing raw SQL using string concatenation, shipping it without code review, and then trying to cover it up afterwards.

No software maintained by a large team will be bug-free forever, and that includes security bugs. The response to bugs is what matters. In this case Github's response was quite mature; in Gab's case, it wasn't.

colesantiago · on March 19, 2021

> In this case Github's response was quite mature; in Gab's case, it wasn't.

This is irrelevant.

There is no excuse for this, especially if you are a billion dollar company.

djmips · on March 19, 2021

The billions doesn't automatically scale development. You know that. Every development is still done at the human level. But you still might be right that there is no excuse.

bredren · on March 19, 2021

Since this was essentially a potential organizational administrator account takeover security vulnerability, I can see why this got a write up. And as others have said, it's impressive.

Github Actions has had many problems that affect many people this year alone. While the missed deadlines and stress for people can't be quantified by Github, they are happening.

I'd urge Github to take ongoing service interruptions in Github Actions more seriously. I hope Github will take a stronger stance on providing customer communication on migration timelines between Packages and Container Registry.

Just because a bug or service interruption doesn't risk random user session exposure doesn't mean it isn't important to communicate about and describe how the business is working to prevent it from happening again.

zelon88 · on March 18, 2021

> We believe that transparency is key in earning and keeping the trust of our users and want to share more about this bug.

I find it ironic that the largest repository of open source code is closed source and and owned by the largest closed source software company on the planet, yet they somehow still "believe in transparency."

That's like McDonald's "transparently" recalling burgers because of a chemical contamination, and then when you ask what else is in the burgers they tell you it's a secret.

spaetzleesser · on March 18, 2021

I wonder if open sourcing their code would even be a problem for them now. I don’t think the code is worth much vs the brand name and scale.

simonw · on March 18, 2021

Strategic issues aside, open sourcing a codebase that is over a decade old is extremely hard. You need to be absolutely sure there's nothing in the code or commit history that's not suitable for release.

Just one example: what if someone pasted an error log message into a commit message that included a piece of PII?

xeeeeeeeeeeenu · on March 18, 2021

Open-sourcing the codebase might affect the sales of the on-premises version.

berniemadoff69 · on March 19, 2021

Fortunately Microsoft, like McDonalds, can use their billions of dollars in the bank to produce high quality graphics to 'explain the situation' - and when people visit the site, they will be Delighted and Impressed with these high quality graphics - instantly remedying the situation. 'Look how nice we explained it - look at the cute cartoon characters solving a puzzle in the header at the top of the page, see how slick we are?'