Tuesday, July 17, 2012

Asymmetric Keying -- after implementation

I wrote a straw proposal for how to use asymmetric keying and while the idea is a little frightening from the localStorage perspective  I still don't see it as a deal breaker. At least for now. And at least for the types of sites that I'm interested in thinking about which are Phresheez-like sites. That is, sites that  people would ordinarily use low to medium value passwords. Since then, I've shopped the idea around and aside from the expected asshats who dismiss anything unless you're part of their tribe, I've actually been encouraged that this isn't entirely crazy. So I went ahead and implemented it about 3 weeks ago over a weekend.

The main difficulty was frankly getting a crypto library together that had all of the needed pieces. The quality of the crypto library is certainly not above reproach, but that's fundamentally a debugging problem -- all of the Bignum, ASN.1 and other bits and pieces are well specified and while you might think that writing them in javascript is bizarre, it's just another language at the end of the day. Random number generation -- a constant problem in crypto -- is still a problem. But if this proves popular enough, there's nothing stopping browser vendors to expose things like various openssl library functions up through a crypto object in javascript too, so PRNG can be seeded with /dev/urandom, say. So I'm just going to ignore those objections for now because they have straightforward longer term solutions.

In any case, there seems to be a library that lots of folks are using for bignum support from a guy at Stanford. There's also a javascript RSA project on sourceforge, though it lacks support for signing. I found another package that extends the RSA package to do signatures from Kenji Urushima. His library was missing a few bits and pieces, but I finally managed to cobble them all together. So I now have something to sign a blob, as I outlined in the strawman post, something to do keygen, something to extract a naked public key. Now all I need to do is sit down and write it.

Signed Login

It was surprisingly easy. I decided that the canonicalization that I'd use is just to sign the URL itself in its encoded format. That is, after you run all the parameters through encodeURIComponent (). I extended the library's RSAKey object to add a new method signURL:


RSAKey.prototype.signURL = function (url) {
    url += '&pubkey='+encodeURIComponent (this.nakedPEM ());
    url += '&curtime='+new Date ().getTime ();
    url +='&signature=' + encodeURIComponent (hex2b64 (this.signString(url, 'sha1')));
    return url;
};

which all it does is appends a standard set of url parameters to query. It doesn't matter if it's a GET or POST, so long as it's www-form-urlencoded if it's a POST. The actual names of the url parameters do not need to be standardized: they just need to be agreed upon between the client javascript and the server, of which a website controls both. Likewise, there need be no standardized location: I just use:

http://phresheez.com/site/login.php?mode=ajax&uname=Mike&[add the standard sig params here]

My current implementation requires signature to be the last parameter, but that's only because I was lazy and didn't feel like making it position independent. That's just an implementation detail though, and should probably be fixed. I should note that nakedPEM returns a base64 encoding of the ASN.1 DER encoding of the public key exponent and modulus, but my minus the ----begin/end public key---- stuff and stripped of linebreaks.

On the server side, I'm just using the standard off the shelf version of openssl functions in PHP to do the openssl_verify after doing an openssl_get_publickey with the supplied public key in the url itself. Note I've signed the pubkey and the curtime. In my implementation, I'm only signing the script portion of the full url, and not the scheme/host/port. That's mostly an artifact of PHP (what isn't?) not having the full url available in a $_SERVER variable. This is most likely wrong, but I'm mostly after proof of concept here, not the last word on cryptography -- if this is an overall sound idea, they'll extract their pound of flesh.

So the server can verify the blob now, as well as create new accounts in much the same manner (you just add an email address). I've implemented a "remember me" feature which doesn't store the keys after keygen if it's not checked, and removes keys if they are available.

Enrolling New Devices

On the enroll from a new device front, the logic is pretty simple: if you type in a user name, and there isn't a key for it in localStorage, it pings the server and asks for it to set up a new temporary password for you to enroll the new browser. This could be the source of a reflection attack, so server implementations should be careful to rate limit such replies. The server just sends mail to the registered account at that point with instructions. The mail has both the temporary password, as well as a URL to complete the login. The first is in case you're reading the email from a different device than the new browser device. The second is the more normal case where you click through to complete the enrollment like a lot of mailing list use. I should note that at first my random password generator was using a very large alphabet. Bad idea: typing in complicated stuff on a phone is Not Fun At All. Keep it simple, even if it needs to be a little longer. This is just an OTP after all. Once the temporary password is entered, it's just appended to the login URL and signed as usual.

Replay

I mentioned in original post on the subject that replay was obviously a concern. For the time being, I've kept this pretty simple in that it has the expectation of synchronization between the browser's and server's clocks. The javascript client just puts the current system time into the URL, and the server side vets it against its system time. Like Kerberos, I also keep a replay cache for within the timeout window. This is done using a mysql table that is keyed off of the signature. If the signature is in the replay table, it gets rejected. If the timestamp in the signature is later than, say, an hour it gets rejected. I haven't quite figured what to do about timestamps in the future, I believe they just need to be ± 1 hour or they get rejected. That said, I do have some concerns about synchronization. It's a very NTP world these days, but with the tracking aspect of Phresheez I've seen some very bizarre timestamps. Like, as in, years in the future to just an hour or so. I can't really be certain if these are GPS subsystem related problems (probably), or something wrong with the system time. It's a reason to be cautious about a timestamp related scheme, and if needed a nonce-based scheme could be introduced. I'm not too worried about that as the crypto-pound-of-flesh folks will surely chime in, and this is definitely not new ground.


Sessions

I should note that I am not doing anything different on the session front. I'm still using the standard PHP session_start () which inserts a session cookie into the output to the browser, and logout just nukes the session cookie with session_destroy (). Nothing changes here. If you hate session cookies because of hijacking, you'll hate what I've done here too. If it bothers you enough, use TLS. It's not the point of this exercise . 

I should note that an interesting extension of this is that you don't actually need sessions per se with this mechanism if you are willing to sign each outgoing URL. While this not very efficient and is quite cumbersome for markup, it is possible and may be useful in some situations where you have a more casual relationship with a site.

Todo

A lot of this mechanism relies on the notion that the supplied email address is a good one when a user joins. That happens to be an OK assumption with Android based joins since you can get their gmail address which is known good, but other platforms not so much. So for joins, I might want to require a two step confirm-by-email step. This sort of step decreases usability though. It may be ok to require the email confirmation, but not require it right away though -- enable the account right away, but put the account in purgatory if it hasn't been confirmed after a day or two of nagging. I still haven't worked that out. There isn't much new ground here either though as anything that requires an operational/owned email address confronts the same usability issues, and it's honestly an issue with the way Phresheez currently does passwords now.

I haven't tackled much on the shared device kinds of problems either. Those include being able revoke keys both locally as well as the stored public keys on the server (two separate problems!). Nor have I implemented local passwords to secure the localStorage credentials. They should be part of the solution, but I just haven't got around to it.  It's mostly coming up with the right UI which is never easy.

Oddities

One thing I hadn't quite groked initially about the same-origin policy is that subdomains do not have access to the same localStorage as parent domains do. It's a strict equality test. For larger/more complex sites this could be an issue that requires enrollment into subdomains with the requisite boos from users. One way to get around this is to use session cookies as they can be used across subdomains. That is, you login using a single well-known domain, and then use session cookies to navigate around your site.

Performance

Actually performance was surprisingly good. I have an ancient version of Chrome on my linux box, and keygen -- the most intensive part of the whole scheme -- is generally less than a couple of seconds. I was quite surprised that on my Android N1 (very old these days), it was quite similar. Firefox seemed to take longest which is sort of surprising as it's up to date so it's js engine should be better. I haven't tried it on a newer IE yet which is likely to suck given how it usually just sucks. For signing, there is absolutely no perceptible lag. Again, it would be better on many fronts to have the browser expose openssl, say, to javascript, but the point here is that it's not even that bad with current-ish javascript engines.

Conclusion

It really only took me a couple of days to code this all up, and the better part of that was just getting the javascript libraries together. Once I had that, setting up the credential storage in localStorage, modifying the login UI, and signing URL's on the client side was quite easy. Likewise, setting up the list of public keys table per UID was simple, and the replay cache code wasn't particularly difficult either. My test bed has been running it for several weeks now, and it seems to work. I still have concerns about UX.

Another interesting thing that's happened is that the LinkedIn debacle has obviously hit a nerve with a lot of us. I found out that there's been an ongoing discussion on SAAG about this, and that Stephen Farrell and Paul Hoffman have a sketch of a draft they call HOBA which is trying to do this at the HTTP layer using a new Authorization: method. I've been talking to Stephen about this, and we're mutually encouraging each other. I hope that their involvement will bring some grounding to this problem space from the absolutist neigh sayers who like to inhabit security venues and throw darts at the passers-by. If anything happens here to back out of this Chinese finger trap we call password authentication, we've done some good.

Saturday, July 7, 2012

ASN.1 considered harmful

So I've recently spent a bunch of time playing with Javascript crypto libraries. There are Javascript crypto libraries you ask? Well yes, such as they are. The one that seems most complete for my purposes is jsrsasign, but it's still missing things that I needed, so I had to scrounge the net to cobble then together. It itself is a frankenlib cobbled together from various sources and extended as the author needed.

The one thing that the library didn't have was a PEM public key decoder method. PEM is a base64 encoded  ASN.1 DER (dynamic encoding rules) encoding of the public key's exponent and modulus. That is, two numbers. It also has some meta information about the keys, but I've never had a need to find out what's in there so I can't tell you what it is. The final point is actually symptomatic of why I have such incuriosity: it's part of the ASN.1 train wreck.

So I decided that this can't be too hard so I'll code it up myself. There was an example in the code which decodes the RSA private key from its PEM format, so how hard could this be? Very hard, as it turns out. Ridiculously hard. It's just two fucking numbers. Why is this so hard? If this were JSON encoded, it would have taken 10 seconds tops to write routines which encode and decode those two numbers. In another 15 seconds, I could have written the encoder for the private key too -- hey, it's got several more fields and it takes time to type. With ASN.1 DER encoding? It took literally 2 days of futzing with it. And that's just the decoding of the PEM public key. Had I needed to encode them as well, it would have been even longer.

What's particularly bad about this is that I've actually had experience with SNMP in the ancient past, so both knew about ASN.1, and have coded mibs etc which requires BER (basic encoding rules). Yes, ASN.1 has not just one encoding, not two, but at least 3 different encodings -- the last being PER (packed encoding rules). All binary. All utterly opaque to the uninitiated. Heck, I'll say that they're all utterly opaque to the initiated too.

So why am I ranting about ASN.1? After all, once the library is created nobody will have to deal with the ugliness. But that misses the point in two different areas. As a programmer, there's lots of stuff that's abstracted away so that you don't have to deal with the nitty-gritty details. That's goodness to a point: if you're using something regularly, it's good to at least be somewhat familiar what's under the hood even if you have no reason to tinker with it. In this case, I really had no idea what was in a PEM formatted public key file, and I've dealt with them for years. It's just two fucking numbers. In the abstract I knew that but ASN.1's opaqueness took away all curiosity to actually understand it, and when I saw methods that actually required the two parameters of modulus and exponent, I'd get all panicky since I didn't really know how they related to the magic openssl_get_publickey which decodes the PEM file for you. Seriously. How stupid is that? Had it been a json or XML blob I'd have been able at once to recognize that there's nothing to be afraid of. But it took actually finding an ASN.1 dumper and looking at the contents to realize how silly this was.

That gets to my next point: after I found out that was really just two fucking numbers, it still took me two days to finally slay the PEM public key decoding problem. The problem here is innovation. Maybe there are masochists who would prefer spending time encoding and decoding ASN.1 and they are entitled to their kink, but the vast majority of us want nothing to do with it. Even if there were good ASN.1 encoding and decoding tools -- which there are not in the free software world -- I'd still have to go to the trouble writing things up in their textual language and run it through an ASN.1 compiler. Ugh. It's not just javascript that has this problem, it's everything. ASN.1 lost. It's not supported. It makes people avoid it at almost all costs. That hampers innovation because if you want to add even one field to a structure, you're most likely breaking all kinds of software. Or at least that's what any sane programmer should assume: most of this ASN.1 code is purpose built, and not generalized so you should be very scared that you'll break vast quantities if you added something. End result: stifled innovation.

I write this because I think that a lot of the problems with getting people to understand crypto is tied up with needless distractions (the other is that certificate PKI != public key cryptography, but that's a rant for another day). Crypto libraries are hard to understand generally because let's face it, crypto is hard. But crypto libraries/standards use of ASN.1 makes things much, much more difficult to understand especially when all you're talking about is two fucking numbers. It's all lot of what the problem is in my opinion, and it's a real shame that there doesn't seem any practical way out of this predicament.




Friday, June 29, 2012

localStorage security

In my previous post, I wrote about a scheme to replace symmetric key login with an asymmetric key scheme where it keeps its keys in localStorage. I mentioned that the use of localStorage is certainly a questionable proposition, and looking around the web there are definitely a lot of OMG YOU CAN"T DO THAT!!! floating around. So it's worth going over the potential attacks on this scheme which so relies on the relative security of localStorage.

First off, what are the security properties of localStorage itself? Not much. The browser walls localStorage off from other origins so that other sites can't snarf up its contents, but other than that anything executing within the context of the origin site can grab the entire localStorage object and have at it. That sounds pretty scary to be saving rsa private keys. But taking the next step, I can really only see two vector that could be used to reveal those keys: somebody or something with access to, say, a javascript console could trivially grab those keys. The other vulnerability is XSS and other kinds of injection attacks where malicious javascript loaded into the browser could gain access to the keys.

Local Access to LocalStorage

So how much of a threat is malware or somebody with console access snooping on keys? Well, it's obviously a big threat. If somebody has unfettered access to your device they can do just about anything they want including going into your browser and opening up the javascript console, for example, and snarfing your keys. So that sounds fatal, right? Well yes it is fatal, but it's not uniquely fatal: they can also snarf up all of the stored browser passwords trivially, install keyboard loggers to grab your credit cards, and generally wreak havoc. So pointing out that bad guys 0wning your computer is bad doesn't say much about this scheme. Does this scheme make things even worse? I don't see it. You might point out that I can lock my browser with a password which assumedly encrypts its stored passwords. Assuming that that password hasn't been compromised (a big assumption), it provides some protection against snoopers. But there's nothing fundamental about my scheme that prevents using a local PIN/password to unlock the key storage as well. Indeed, some apps may very well want to do that to increase the security at the cost of more hassle for the user. Another possibility is that we can do some cat-and-mouse games with the keys. See below.

Evil Code and Key Security

Like most environments, executing evil code is not a particularly good idea. For javascript, once a piece of code has been eval'd, its just so much byte code in the interpreter and the browser cares the least about its provenance with respect to who can see what within that browser session. So the short answer is that evil code loaded into your site's browser window is going to give complete access to the user's private key. We can maybe make that a little harder for the attacker (see below), but at its heart I cannot see how you can truly avoid disclosure. That said, code injection is a serious threat period: if evil code is inserted into your browser session, it can snoop on your keyboard strokes to get blackmail material when you're on sites that maybe you ought not be, make ajax calls back to your site in the user's name to, oh say, buy those UGG's the attacker has been drooling over, and generally do, well, any number of evil things. So there is ample reason to protect against code injection attacks and the web is full of information about what one must to to protect yourself. So I don't see how we're making things worse by keeping private keys in localStorage. 

Storing Passwords vs Asymmetric Keys

It's worth noting that we're talking about asymmetric keys and not passwords when evaluating threats. As Linkedin shows, the real threat of a password breach is not really so much to the site that was breached, it's all of the sites you used the same password on too. That is, the collateral damage is really what's scary because you've probably forgotten 90% of the sites that have required you to create an account. So storing a password in localStorage gives a multiplier effect to the attacker: one successful crack gains them potential access to 100's if not 1000's of sites. That's pretty scary especially considering that the attacker just has to penetrate one site that's clueless about XSS to get the party rolling. Worse: revocation requires the user to diligently go to every site they used that password on and change it. Assuming you remember every site. Assuming the bad guys haven't got there first.

So for the inherent vulnerabilities of storing asymmetric keys in localStorage, it's worth noting that we gain some important security properties, namely the localization of damage due to being cracked. If an attacker gains access to a user's private key they only gain access to the site that the key is associated with, not the entire Internet full of sites that require accounts. And revocation is trivial: all that is required to stanch a breach is to revoke the individual key that was compromised on the individual browser that stored it. It's not even clear that you ought to revoke all of the keys associated with a given compromised user's account since they'd still need to have access to the individual devices with the two attacks above. But it might be prudent to do so anyway. And it seems very prudent when storing the public keys associated with a user account to store some identifying information like the IP address, browser string, etc so that users can try to get some clue whether new enrollments have been performed (which would indicate that their email password has been breached too given the way we do enrollment). 

Cat and Mouse 

One last thing, everybody with half a security head will most likely tell you that security through obscurity is worthless. Heck, I've said it myself any number of times. And it's still mostly true. However, it's also true that bad guys are as a general rule extremely opportunistic and are much more likely to go after something that's easy to attack than something that's harder -- all things being equal. And if an attack gives you a multiplier effect like getting a widely shared password, even better. So while we can't absolutely thwart bad guys once they're in and have free reign, we can make it a little harder to both find the keys in localStorage, and to make use of them once they find them. For one, we could conceivable encrypt the private key in localStorage with a key supplied by the server. Yes of course the attacker can get that key if it has access to the code, but the object here is to make their life harder. If they have to individually customize their attack scripts, they're going to ask the obvious question of opportunity cost: is it worth their while to go to the effort, especially if the code fingerprint changes often? Most likely they'll go after easier marks. 

Likewise, we could change the name of the localStorage key that contains the credentials and set up a lot of decoys in the localStorage dictionary to require more effort on the attacker. And server-side, we could be on the lookout for code that is in succession going through decoy keys looking for the real key. And so on. The point here is to get the attacker to go try to mug somebody else.

So is it Safe?

Security is a notoriously tricky thing so anybody who claims that something is "safe" is really asking for it. Are there more attacks than I've gone over here? It would be foolish to answer "no". But these are the big considerations that I see. Security in the end is about weighing risks. We all know that passwords suck in the biggest possible way, so the real question is whether this scheme is safer. Unless there are some serious other attacks against the localStorage -- or the scheme as a whole -- that I'm unaware of, it sure looks to me that the new risks introduced by this scheme are better than the old risks of massively shared passwords. But part of the reason I've made this public is to shine light on the subject. If you can think of other attack vectors, I'm all ears.

Friday, June 22, 2012

Asymmetric Key Login/Join


Using Asymmetric Keys for Web Join/Login

I've written in the past about how Phresheez does things a little different on the username/password front by auto-generating a password for each user at join time. This has the great property that if Phresheez has a compromise, it doesn't affect zillions of other accounts on the net. However, the password is still a symmetric key which has to be stored encrypted for password recovery. Storing any symmetric key is not ideal.

As I mentioned in the previous post, what we're really doing here is enrolling a new device (browser, phone, etc) to be able to access the Phresheez server resources -- either the first time as when you join, or for subsequent logins. So I've come up with a new method using asymmetric keys which neatly avoids the problem for storing sensitive symmetric key data on the server. The server instead stores public keys, which by definition are... public. So if Phresheez is compromised, they get none of the sensitive credential information, just public keys which are worth something to Phresheez, but worthless to everybody else unless they have the corresponding private key, which is supposedly hard to obtain. This takes this compromise isolation scheme to the next step, and is surprisingly straightforward.

For web use, I'm taking advantage of the new html 5 localStorage feature to store the asymmetric key. localStorage seems sort of frightening, but the browser does enforce a same origin policy to limit who can use it. If we believe that the browser protections are adequate (and that's worth questioning), then we can use it to store the asymmetric key. Note that although these flows do this in terms of web sites, there is nothing that *require* using web technology. The cool thing is that it works within the *confines* of current web technology, which is sort of a least common denominator.

Join


  1.  To join, the app generates a public/private key pair and prompts the user for a username and   some recovery fallbacks (eg, email, sms, etc). The asymmetric key is stored in localStorage   for later use along with the username it is bound to. The app then signs the join information   (see below) using the private key of the key pair, and forms a message with the public key and the signed data.
  2. The server receives the message and verifies the signed data using the supplied key. This proves that the app was in possession of the private key for the key pair. The server creates the account and adds the public key to a list of public keys associated with this account.

Login


  1. Each time the user needs to log in to the server, it creates a login message (see below) and does a private key signature of the message using the asymmetric key stored in localStorage. The signed login message along with the associated public key is sent to the server.
  2. The server receives the message and verifies the signed data. If the supplied public key is amongst the set of valid public keys for the supplied username, then the login proceeds.  See below for a discussion about replay.

Enrolling a New Display Head


  1. When a user wants to start using a different device, they have two choices: use a currently enrolled device to permit the enrollment or resort to recovery mode using email, sms, etc
  2. To enroll a new device using an existing app, the app can prompt the user for a temporary pass phrase on the currently enrolled app. this password is a one-use password and expires in a fixed amount of time (say, 30 minutes). It doesn't need to be an overly fussy password since it's one-time and timed out. The app sends this temporary password to the server with a login message (see below). The server saves the temporary password and timestamps it for deletion -- say less than one hour. An alternative, is that the app can generate a one-time password for the user and send it to the server. Either work.
  3. Alternatively, if an enrolled device is not available, a user can request that a temporary password be mailed, sms'ed, etc to the user. The server stores the temporary and timestamped password as above. The user receives the temporary password and follows steps 4 and 5 to complete the enrollment.
  4. The user then goes to the new device where they are prompted for a username and the temporary password. The new device creates a public/private key pair as in the join flow, and signs a json blob with the username and temporary password (see below). The new key pair is stored in the new device's localStorage
  5. The server receives the signed json data along with the new public key and checks the temp password stored in 2a or 2b. If they match, the temp password is deleted on the server, and the new public key is added to the list of acceptable public keys for this user. Subsequent logins from this device follow the login flow.

Message Formats


Note that there isn't anything sacrosanct about json here. It could be done using a GET/POST URLEncoded form data too. I just happen to find json a nice meta language. And by Signed, I mean a sha1|256 hash over the data and signed with the private key. I suppose I could sign the pubkey as well, but that's just details, just like i'm not specifying the canonicalization of what's in the hash.

login/join message


{"pubkey":"--pubkey data--", "signature":"RSA-signature over login-blob", "body":"--signed login/join--"}

signed login


{"cmd":"login", "username":"bob", "timestamp":"unix-timestamp", "optional-temp-password":"otp"}

signed join


{"cmd":"join", "username":"bob", "timestamp":"unix-timestamp", "email":"bob@example.com", "sms":"1.555.1212"}

replies (not exhaustive)


{"sts":200, "comment":"ok"}
{"sts":400, "comment":"database down"}
{"sts":500, "comment":"bad encrypted data"}
{"sts":500, "comment":"timestamp expired"}
{"sts":501, "comment":"username taken"} // joins, but a re-join with a enrolled key is ok

Replay Protection

It's worth discussing replay protection. Here I have a timestamp which would assumedly need to be fairly well synchronized with the server time, and be relatively short lived -- say a few minutes. Alternatively, if it's acceptable to add a step, the client can request that the server send a nonce and add the nonce to the encrypted blobs instead.

In all cases, however, it should be assumed that the entire transaction is sent using TLS so that the server-client communication is private. Subsequent transactions may or may not be sent over tls... session management, etc is out of scope of this idea.

Multiple Accounts on One Device

A shared device with multiple account is possible if the username is stored along with the asymmetric key pair binding them to each other. Multiple entries can be kept, one for each credential, and selected by the current user. This, of course, is fraught with the possiblity for abuse, since you're enrolling the device potentially long-term. A couple of things can possibly be done to combat that. First, the user can request that the credential be erased from localStorage. Similarly, in the enrollment phase, a user could request that the key pair only be kept for a certain amount of time, or that it not be stored at all. Last, it's probably best to just not use shared devices at all since that's never especially safe.

About Public Key Encryption

I'm a little creaky on RSA right now so forgive me if I get some of the details wrong. I've checked on a newish linux box running Chrome and public key verifies are cheap while private key signatures from within javascript are more painful (~1s for a 1024bit rsa). I doubt that's a deal breaker, and in the long term giving native BIGNUM support to javascript may not be a bad idea. For hybrid apps like Phresheez, it could even reach out to the native layer to get keys and signatures if it's a big problem. Likewise, generating keys can be slow, but possibly backgrounded while the user is typing username, email, etc if they need to. And of course, there's the perennial question of the RNG. How good Math.Random() in js is certainly an interesting question. However, we have to keep in perspective what we're really changing from which is crappy megashared user passwords. A 512 bit RSA key with a not terribly good RNG is most likely still better that the current situation, and there's a pretty darn good chance that we can do better - maybe even much better.

Conclusion

In conclusion, this mechanism provides a way to finally break the logjam of pervasive insecure  shared secret schemes that are so prevalent not only on the web, but everywhere. The server never needs to keep a long term and potentially sensitive symmetric key, nor does it ever need to store anything that is not fundamentally public (ie, a public key). This wasn't really available for use one the web until we could use localStorage to store credentials in the browser. In conjunction with out of band recovery mechanisms like email, SMS, as well as currently enrolled devices, we can enroll new devices using that generate their own credentials so that a compromise of one device doesn't even compromise other  devices you own. 

The big question is whether we can make the user experience close to what people's current expectations are, but with a few twists -- like, for example, making clear that "recovery" isn't a moral failing, but the expected way you enroll new devices. UX is a tricky thing, and should not be discounted, but it seems there is at least hope that it could be successful.

Saturday, June 9, 2012

Client vs Server Charts

Charts are a very quick way to view statistical data, and good charting packages can bring a lot of neat ways to slice and dice that data. Since Phresheez started out as a fairly typical server side web site, it was pretty natural to generate the charts server side as well. Back then, javascript was still pretty slow, and html5 canvas support nonexistent. A more serious problem in reality is that I hadn't taken the step to process and cache the statistics for a day, so the amount of data required to be sent to the browser could pretty big -- on the order 100kb typically. So I never really considered it.

Besides, the graphing package I used (jpgraph) is pretty complete and I've really had no complaints about it per se. It's biggest problem honestly is its reliance on an underlying library -- libgd -- which isn't the best. Ok, having done graphics kernels before, it pretty much sucks. In particular, the curve algorithm really sucks producing jaggies and that really bothers me (it's almost as if they're using a two direction ellipse step algorithm rather than 3 directions which didn't Bresenham figure out ages ago?) . And it can't figure out how to write text on the baseline when it's anything other than horizontal, which makes the graphs look rather amateurish. But they have served me well, and it's definitely useful because sometimes only an image will work, like when you need to post goodies to Facebook which doesn't allow arbitrary blobs of html and javascript for pretty obvious reasons.

In the past year, I had made some changes to the server side graphs to freshen them up. This included using a graphic artist's best friend -- gradients. Without getting ratholed about whether that's a good or bad thing, one noticeable effect of using gradients is that the size of the .png (.jpg's look horrible) goes up dramatically. No surprise, but the once 10-20kb graphs were now 30-40kb each. Given that they looked slicker, it seemed a decent tradeoff. A more pernicious issue, however, was caching. People jump between pages with various graphs all of the time. Since people are looking at the graphs, oh say, at lunch when they've been skiing, the images cannot reasonably be cached -- the GPS uploaders are all busy at work for both you and your friends, and you expect that the charts will be kept up to date. So in reality, that 30-40kb is multiplied by the number of charts, your number of friends, and the the number of times you look at the app again. While it was certainly server load, I was much more concerned about user experience since often the reception at resorts suck and trying to download a 30-40kb image each time seems... slow.

So I had long ago fixed the stat aggregation caching problem for it's own obvious benefit. I had been playing around with html5 canvas stuff and was generally impressed with how well it behaved cross platform -- even ie9 does a pretty good job from what I can tell. So I decided on a whim to start looking for a javascript package that does graphing. I'll admit that my research on the subject wasn't the deepest -- in the beginning I was mainly interested in just testing the waters -- but I eventually settled on RGraph. Since the server side graphs had been evolving for years, I was rather worried about how long it would take to just get to the baseline of what I had server side, but I was rather pleasantly surprised that it only took me a week, maybe two tops to get to parity. Better is that the rendering on browsers is much better than libgd, so goodbye jaggies. And it can do cute little animations. And since it's client side, I can attach events to the graphs more easily -- yes, I know that it's a hack since it's a Canvas rather than SVG, but still it's easier to contemplate than krufty image maps. 

I had been vacillating about whether to make the change for quite some time for one reason: it increases the size of the web widget by about 100kb, which was pretty substantial. What finally won me over was that I realized that I was being penny wise pound foolish: the cost of an image is say 30kb, and you might look at 3 of them for yourself at one sitting, several for your friends and then you may have several sittings as well. This all adds up, and as I mentioned it creates noticeable lag in the user experience. The client side graphs, on the other hand, all use the same cached statistics blob which is about 10k uncompressed, 3-4k compressed over the wire. So where you might be looking at 200-300kb or more of data transfer over a day, doing it client side is probably on the order 10-20kb, if that. And it appears almost instantaneously, especially if it doesn't need to refresh the stats blob. Compare that to the upfront 100kb code investment which is amortized over the life of the web widget which is generally every couple of weeks, maybe longer, it became obvious that this was a no-brainer. 

So I've managed to convert everything over and push out a release. Everything seems to be working, but corner cases on graphs are hard to ferret out (thinking labeling, grrr) so I expect there will be some futzing as they crop up. The support at RGraph was very quick and they're receptive to upgrades which I have a few smallish things that I've dropped the ball on. It's a client side world.

Friday, June 8, 2012

Phresheez Join Passwords

Phresheez requires that you create an account because in order to do anything interesting it needs a place to send points to on the backend. However, we had quite a bit of evidence that users abandon the app before ever signing up. There's probably a variety of reasons for this, but it probably boils down to one of two reasons: either they just don't want to have yet another thing that requires a username and password, or they find that it is too onerous to type all of the necessary information in. I read a very interesting piece on iPad Usability which mostly applies to phones too, and one not very surprising observation is that people really dislike typing on their phones. For Phresheez that's probably even worse because they are probably finding out about us through friends and are probably in a hurry to get out skiing.

So I asked myself, what can I do to lower that energy barrier? Starting out with a naked form is the least friendly, and auto-generating a user account is the best. So I started looking on Android and lo and behold, there is a way to get the user's gmail address. Groovy. Since the left hand side of a gmail address is very likely to be globally unique, I can then use that as the seed to create a unique user id. For the email address, it's a no brainer since we already have the email address. That just leaves the password.

When I started thinking about this, it occurred to me that I could just auto-create a good strong password for them. The app stores the password so it doesn't have to be something they need to remember. Well that's almost true: Phresheez is both an app and a web site, so they may want to know the password to see their stuff on the site as well. I fretted about this quite a bit, but ultimately I decided that a compromise was that I'd auto-generate their password, but leave it in clear text until they decided to lock it which gave them the opportunity to type their password into the web site. That and there's always password recovery. That's where things currently stand.

In doing this I realized that this method has a very interesting security property: since Phresheez generates the password for you, any compromise of Phresheez will not compromise other sites where you might otherwise use the same/similar password. Yes we all know that it is bad to use the same password on multiple sites, but it is the reality of the world that people do this. And why wouldn't they? People are required to join probably hundreds if not thousands of sites for various reasons. Are we really to expect that they create a unique and hard to guess password for each site? Of course not, that's complete idiocy and anybody who spouts such a thing  should be flayed alive.

The Linkedin fuckup got me to thinking about this again though. In my annoyance, I posted to NANOG what I thought was so completely wrong about the blog post's posturing toward st00pid lusers. Many people chimed in that anybody who isn't using a password vault thingamajig deserves what's coming to them. But that really misses the point: putting the onus on users to protect themselves is first of all a provably losing proposition, but also obscures the fact that we have been putting them in a completely untenable situation. The current username/password scheme is nearly 50 years old and it really shows. Everybody knows it sucks, so scolding users for being human is not the answer for what is really an engineering failure.

What occurred to me is that the real security advantage of the way that Phresheez does things is that it puts security in the hands of Phresheez rather than users who don't have any clue. They don't have to know to download and use some password vault thingy. Apps can already store your credentials, and all browsers have password rememberers. And even if the browser doesn't have a rememberer, you can almost certainly use html5 localStorage to remember it. As for the need for cross-device passwords that vexed me? Well, now that I consider it, the real answer is password recovery. Every site needs the ability to recover usernames and/or passwords and it is done via your supplied email address. This is just a fact, and is completely orthogonal to password generation. If password recovery has to be there anyway, why not use as feature rather than a necessary evil? Since it is a necessity we shouldn't make password recovery a semi-shameful thing that you "forgot", but the normal way of enrolling a new display head to the site. Maybe we should put a positive spin on password recovery from being something you "forgot" to being something that allows you to add a new device to see your goodies on. That it's the *normal* and expected way to see stuff on multiple display heads, not a failure of character.

In conclusion, I started down this road because auto-generating passwords was more user friendly, but it has turned out that it is seemingly a much more secure way of enrolling users as well. And it puts the onus for better security on developers rather than end users. Snicker all you like about that, but at least there's a chance that developers can be beaten to do the right thing, especially since this isn't all that hard to do.

Sunday, March 18, 2012

Phresheez Has a Yard Sale

Kaboom

I wouldn't have said that Phresheez had an ironclad disaster recovery plan, but at least we had a plan. We do mysql database replication back to my server on mtcc.com and do daily backups of the entire database. We have backups of the server setup, and more importantly have a step-by-step build-the-server-from-the-disto along with config files are checked into svn. We only have one active production server, so that implicitly accepts that significant downtime is possible. On the other hand, we have been running non-stop since 2009 with exactly one glitch where a misconfigured Apache ate all its VM and wedged. That lasted about an hour or so -- hey, we were skiing at the time, so all in all not terrible for a shoestring budget.

Our downtime doesn't take into account routine maintenance, and I had been in need of doing a schema update on our largest table, the GPS point database. So I happened to wake up at 2am and decided to use the opportunity to make the change. Nothing complicated -- just take the site offline and make a duplicate table with the new schema. That's when the fun began. Each of several times I tried, the master side gave up complaining about something going wrong with the old table. I then tried to do a repair table on it, and it bombed too. Strange. After the fact -- a mistake, but it didn't matter as it turns out -- I decided to do a file system copy of the database table file. Death. Dmesg is definitely not happy either about the disk. I tried to see if the index file was ok, and same problem. I tried other large tables, and they seemed ok. A mysql utility confirmed that it was just the GPS point database, even though that was pretty bad by itself.

So... I was pretty much hosed. Something had blown a hole into the file system and torched my biggest table -- some 15 Gig big. Fsck with some prodding discovered and repaired the file system, but couldn't salvage the files themselves. So it's restore from backup time. Ugh.

There were two options at this point: do a backup of the slave or just copy over the slave's data file. I wasn't entirely coherent (it was early), and decided to give the first a try. Here's where the first gigantic hole in our strategy came in: either method required that a huge file be copied from the slave server to the master. Except the slave is a machine on a home DSL uplink getting about 100 KB/sec throughput: scp was saying about 12 hours transfer time. Oops.

The long and short is that after about 12 hours, the file was copied over, the index was regenerated and Phresheez was up again no worse for wear as far as I can tell. A very long day for me, and a bunch of unhappy Phresheez users.

Post Mortem

So here's what I learned out of all of this.

  • First and foremost, the speed of recovery was completely dependent on the speed of copying a backup to the production server. This needs to be dealt with some way. First might be copying the backups to usb flash and finding somebody with a fast upstream to be able to copy stuff to the production machine. Better would be to spend more money per month and put the slave on its own server in the clould. But that costs money.
  • Large tables are not so good. I've heard this over and over, and have been uncomfortable about the GPS point table's size (~300M rows), but had been thinking about it more from a performance standpoint than a disaster standpoint. I've had a plan to shard that table, but wasn't planning on doing anything until the summer low season. However, since the downtime was purely a function of the size of the damaged table, this is really worth doing.
Disaster on the Cheap

The long and short of this is that when you have single points of failure, you get single points of failure. Duh. The real question is how to finesse this on the cheap. The first thing is that getting access to copy the backup over the net quickly would have cut the downtime about an order of magnitude in this case. Sharding would have also cut the downtime significantly, and for that table really needs to be done anyway.

However, this is really just nibbling at the edges of what a "real" system should be. Had the disk been completely cratered, it would have required a complete rebuild of the server and its contents and it would have still been hours, though maybe not the 12 hours of downtime we suffered. Throwing some money at the problem could significantly reduce the downtime though. Moving the replication to another server in the cloud instead of on home DSL would help quite a bit because the net copy would take minutes at most.

A better solution would be to set up two identical systems where you can switch the slave to being a master on a moment's notice. The nominal cost is 2x-3x or more because of the cost of storing the daily backups -- disk space costs on servers. The slave could be scaled down for CPU/RAM, but that only reduces cost to a point. Another strategy could be to keep the current situation where I replicate to cheap storage at home and keep the long term backups there, but keep a second live replication on another server in the cloud. The advantage of this is that it's likely that a meltdown on the master doesn't affect the slave (as was the case above), so a quick shutdown of the cloud slave to get a backup, or switching it over to be a master would lead to much better uptime. Keeping the long term backups on mtcc.com just becomes the third part of triple redundancy and is only for complete nightmare scenarios.

Is it worth it? I'm not sure. It may be that just getting a fast way to upload backups is acceptable at this point. One thing to be said is that introducing complexity makes the system more prone to errors, and even catastrophic ones. I use replication because it would be unacceptable to have 2 hours of nightly downtime to do backups. However, mysql replication is, shall we say, sort of brittle and it still makes me nervous. Likewise, adding a bunch of automated complexity to the system increases the chances of a giant clusterfuck at the worst possible time. So I'm cautious for now -- what's the smallest and safest thing I can do get my uptime after disaster into an acceptable zone. For now, that's finding a way to get backups onto that server pronto, and I'll think about the other costs/complexities before I rush headlong into it.

So What Happened to the File System Anyway?

At some level, shit just happens. It's not whether something will fail, but when it will fail and how quickly you can recover. But this is the second time in about a month or so that I've had problems that required fsck to come to the rescue. My provider had recently moved me to a new SAN because the previous one was oversubscribed. Did something happen in the xfer? Or is their SAN gear buggy leading to corruption? I dunno. All I know is that I haven't had any reboots of any kind since I moved to the new SAN so there shouldn't have been a problem unless it went back before the SAN move. I sort of doubt that the underlying Linux file system is the cause -- there's so much mileage on that software I'd be surprised. However, after fighting with my provider about horrible performance (100kb/second transfer for days on end) with their SAN's and now this... I'm thinking very seriously about the options.