Tuesday, August 3, 2021

Some of the Things You Get Thrown into When First Hired

Drinking from the Fire Hose

 

So you're fresh out of school with a shiny new CS or Software Engineering degree, congratulations! You are totally unprepared for the reality of being a software engineer! Seriously, you are about to be thrown into a pit of despair and suffer imposter syndrome! You're going to have all of your algorithms at the tip your tongue, your data structures and graph theory raring to go and... find out that almost all of it is rarely used because it's all implemented in libraries for the languages you use. Which isn't to say that it was useless because you still need to know what you're looking for. But so disappointing.

This piece is not even comprehensive so in fact it's much worse than what I paint here. And if you're in some specialized field you'll have to learn that specialization on top of all of this, like if you're doing scientific or robotic programming. But all is not lost: despite this long, long list of things you're going to have to deal with the vast majority of engineers make it, and even excel in it because they're so excited after the grind of college to actually start using the knowledge for their craft.

Development 

Development Environments

Let's start with the physical development environment. Pre-pandemic, the standard issue space for engineers is to put them on banquet table in row crops pretty much shoulder to shoulder. The rationale was to "promote interaction and communication", but the reality is that the first thing people do is get headphones and turn them on high so they don't hear anything and are impossible to be distracted. So the rationale never passed the sniff test which lays bare why HR does this: money. The pandemic has fortunately shown that the shoulder to shoulder physical "networking" that never happened in real life is trivially replaced with a longer wire which was the way it actually happened. Chatting across town is just the same as chatting two seats down. This is a win for everybody, but you will have to learn how to get into routines when you telecommute and how to deal with human interaction when it's needed. Fortunately, just about everybody is in the same boat figuring this out so you actually have an advantage of not having any expectations.

Next up, they have probably had you write some code while you were in school. You may already be writing your own code for your own projects. Maybe they specified the development environment maybe they let you choose your own. If the former, they likely threw you into some fancy integrated development environment (IDE) which are often language or platform specific (think Eclipse for Java or Xcode for Apple). IDE's can be nice, but they aren't inevitable and they can too often become their own ends, rather than a means to an end (see section on  Frameworks). Learning them and their idiosyncrasies can be time consuming. My experience is that they are often rife with bugs (looking at you Xcode) and inexplicable behavior where googling is your only hope. And there are lots of them making the likelihood of fuckery exponential.

While you're not always going to have a choice because of operating environments or company mandates, it's really good to have a base line capability to edit code, compile or whatever you need to run it, and be able to debug it where debugging by printf is perfect valid and good. There are tons of editors out there and its religious which one is best, but vi, emacs, and others are all good choices. Emacs is, of course, the best because my god told me so and I believe Him.

Debugging

The ability to intelligently debug is an essential capability of any software engineer. They probably didn't teach you how to debug other than mentioning that debuggers exist for whatever languages they were using, and then more likely not even that. There are a variety of ways to debug something from simple using of printf and tailing logs to sophisticated debuggers with breakpoints, watchpoints, the ability to look up variable symbolically with arbitrary expressions, etc. For things like the web, there are built in application specific debuggers that make traversing the DOM really easy. 

Debugging is much akin to the Scientific Method. First of all you find out that something is misbehaving so you make a hypothesis about what might be going on. You then create experiments to test your hypothesis and rinse and repeat until you have a hypothesis that meets the observations. You can then attempt to fix the problem which further confirms your hypothesis. Code review is really nothing less than peer review in the Scientific Method where outsiders can throw darts when the fix looks like a Rube Goldberg contraption that look like it fundamentally misses what the core of the problem is, and that what you have is a bandaide not a fix.

You will learn about one of the deadliest of all bugs: the Heisenbug. Like the Heisenberg Uncertainty Principle which states that you can't know a particle's position and momentum at the same time with accuracy, a Heisenbug similarly vanishes when an attempt to observe it is made. It is maddening and especially prone with multi-threaded code with race conditions.

A cousin of the Heisenbug is the Schrodingbug. While hunting for an obscure and intermittent bug you finally find the cause and how to fix this. Unfortunate for you, you cause the bug's wave function to collapse and cat died and all hell breaks loose until the patch is applied. Once you detect that this code should have never worked at all, it's curtains for the kitty.

Learning Languages

It really annoys me and is a peeve that companies hire for $LANGUAGE programmers. The reality  is that languages are tools and frankly the differences are mostly angels on a pinhead. You will learn languages over your life (see SHINY below) and they will change and evolve whether they need to or not (see FOR ITS OWN SAKE below). 

It's not to say that language differences are superficial, but a lot of them are. By far more important in my opinion is the richness and consistency of the libraries that you can access easily from them. Some languages get traction with new stuff going on and become go-to. A current example is with machine learning and Python. I haven't checked for sure, but I doubt there is anything inherent with Python that makes it good for ML, it probably just sort of happened. If you have a not-run-of-the-mill problem, libraries and other goodies should be a big consideration in choosing a language.

Last there are some important considerations for language choice. Memory management with garbage collection or not. Object Oriented or not. Raw access to hardware features or not. Typed or not. They all have their tradeoffs and as always there ain't no such thing as a free lunch. My experience is that most tasks don't require much resources so it far more important to optimize for the speed of coding and maintainability -- not the speed of the code. Even if you end up having a scary inner loop, you can often architect it to drop into low level code which is controlled by a higher level language, either as a native extension, or some other means.

API's and Libraries

API's and libraries are the lifeblood of writing code. When you're learning a new language a lot of your time is going to be spent hunting down how to not reinvent the wheel. No seriously, you don't need to reinvent malloc (though I have). You need to get the skills to find things so that you can concentrate on whatever the problem at hand is. When you're new to the scene you're probably going to be given a language to work with which will imprint on you like a baby bird. Mama bird is goodness and never wrong. There can only be one best mama bird and her ways will always be the one true way. Those of us who have been around a long time see mama bird -- and the whole flock behind her who look for all intents and purposes the same.

That's not to say that all API's are of similar quality of course. The C runtime library is sort of a mess and definitely shows its age, and why things like PHP and PERL who slavishly copied it really missed their chance. But once you get standardized facilities like oh, say, hashes it really doesn't make a lot of difference if they are called a Dictionary in Python or an Object in Javascript: they all do pretty much do the same thing albeit with different interfaces and/or syntax. Your job is to recognize these patterns and then go hunt for the equivalent in whatever the base API's are for your language.

API's can also be used define calls over the net for various services. In this case you are not required to make the API represent a Remote Procedure Call (RPC) and frankly those seem to have gone out of fashion (buh-bye SOAP), but it is conceptually the same as pushing parameters onto a runtime stack to call a method. With both, you'll learn that what is needed is protocol agreement. That is, the thing making the call and the thing interpreting the call must agree to what parameters are there, what is optional, and so on. 

Network based API calls may or may not come with language specific wrappers that hide some of the messiness. You should probably take some time to see what's going on under the hood at least for a few of them just so it's not so mysterious because some day you might be called on to to design one yourself like it the next section.

Writing Libraries and/or API's 

Hopefully this is not one of your first tasks because designing libraries and API's is really an art as much as a piece of technical competence. There are a lot of variables and considerations to designing them. Designing them upfront is usually the easy part because you have some functionality that you want to expose and that is purpose built for a given task. Great! That was easy. You most assuredly fucked up.

Requirements change. Features are added. You get more and more users using your API. How easy is it to modify the API? Can you make breaking changes? Can you deprecate things that were fuck ups? How do you design API's that are more resistant to breaking changes? How do you go about deprecating mistakes and/or obsolete? How much churn with users is acceptable? 

I'm not going to say what makes a great API designer because I'm not an expert at it and when I've had the chance I like to beg, borrow or steal from existing API's. But the one piece of advice I'd give is to be very reluctant to expose a net facing API for as long as possible and make certain that it makes sense from a business standpoint. Maintaining code that is used by thousands of sites that started out as a "gee it would be fun for my programmer friends to be able to play with this" will make you very sorry you didn't think it through.

Algorithms

The truth of the matter is that the vast majority of programming is mundane. For any one project, there is likely to be only one or two interesting algorithms. If you're at a place with more senior engineers, that algorithm is not going to have your name on it. I've been lucky to have been able to design some really interesting algorithms at a young age, but I was working at startups, one of which I was literally the only software engineer. Sometimes the sharks aren't hungry when you're thrown into that shark tank, but these days VC money is usually not naive on that front and if they are, they view it as a lottery ticket which will get re-engineered if needed.

What you can do is find those key algorithms and find out why they are key. Was it implemented well? What are its advantages? What are its deficiencies? Has it been optimized? Does it really need to be optimized to make an operational difference? If you think it could be improved and it will make a positive difference should you bring it up with who maintains it? These sort of things are usually somebody's baby and you're about to call it ugly. You have to learn to be tactful. If you can prove your changes in reality rather than theory that bolsters your case.  

"Hey, I've been trying to understand $ALGORITHM and have been playing with it on my own. Here are some things I hacked on and made it $X percent faster is this reasonable or am I missing something?"

Frameworks

Ok, I'll be right up front: I am a framework skeptic. See my section on "for it's own sake" for one of the big reasons. Like designing a computer language or writing an operating system, it is often the life goal for every self-appointed hot shot engineer to design a framework to do something. That every other framework that has come before it is shit and Only I Can Save You. Sorry, you're not and the chances that your framework is anything beyond mediocre are vanishingly small. Unless your entire existence is wrapped up in your framework and its evangelizing, your chances are pretty much zero of it being important.

As for frameworks themselves they are far too often Procrustean. The author has a view of the world and the only way to salvation is to view it that way too. Rails advertises itself as shamelessly having that attitude, but the fact of the matter is that they all have that attitude even if it's not stated. Frameworks get old and creaky, often a victim of their success without acknowledging their shortcomings. New frameworks are a dime a dozen and usually riddled with bugs and poor design and in the end don't solve the problem any better than what they are trying to replace. 

To keep whipping Rails, people went oohh---awww when a single command could generate a web site with all of the CRUD operations generated from templates connected to database tables. Nobody had the presence of mind to ask who would use such a web site. In the early 80's Ingress was a relational database which had a front end program called QBF (Query by Form). Rails is essentially QBF 40 years removed. And thus without asking those basic questions an entire generation of programmers started using Rails, all to find that that is not how real web sites are designed.

The flip side of frameworks are its users and for young engineers, leads us to the next section...

Shiny

Shiny is a subspecies of Fear of Missing Out (FOMO). Young engineers are completely convinced that most senior engineers are complete idiots who are stuck in their ways and that if only they had youth and vigor they would be able to appreciate the sheer beauty and worth of $SHINY. Of course it's going to revolutionize everything. I mean, they say so themselves! I'm reminded about AWS Lambda when it first came out. What is this I asked? Investigate a little: oh, new age batch jobs. Oh and Docker, what is this? Investigate a little: oh, new age time sharing. For the most part nothing is new under the sun and it's all been done before. The canonical trap that young engineers spring with Shiny is lock in. Somebody isn't giving you this wondrous new miracle for the good of humanity. They are far too often trying to lock you into their walled garden. Shiny is almost always the enemy and should always be viewed with extreme suspicion. We didn't get these grey hairs for nothing.  

A subsection of Shiny is language-isms. Lots of languages like to generate new and shiny ways to code something up that is completely idiosyncratic to that particular language and are impenetrable to somebody not as familiar with the language, or even people who are very familiar but are not caught up in the desire for Shiny. If you can design something with relatively language independent constructs and it doesn't materially hurt performance goals, it is far preferable to do that from a maintenance standpoint. Not all engineers have the same level of language archana and even if they don't need to fix something, they might need to look under the hood at how it works for maybe a similar problem. Don't be that dick who makes it impenetrable gratuitously.

For Its Own Sake 

All things fill to available capacity. It's the law of the land. Something similar happens with software projects: they don't know when they are done. The can't know when they are done, because that admits there is actually an end state which is tantamount to defeat with all of the similar projects who can't know they are done for the same reason. 

Like Shiny, newer should not be taken as better on its face. If something is working well for your purposes and has good bug and security patching, there isn't a lot of motivation for upgrading for the sake of upgrading. Upgrades cause churn and either create bugs or expose bugs. The latter is OK, but the former is not worth it unless there is good motivation. 

Interacting with the OS 

At this point the world looks pretty Unix-y. I don't know what the percentages are for servers in the backend but Linux has to be dominate. For front end you basically have three choices: Web, IOS (Unix), and Android (Linux). Windows as an OS is not terribly relevant since writing native apps for it is pretty stagnant. Yes, laptops and desktops are still overwhelming Windows, but that doesn't mean it has a lot of relevance to you as a new programmer. Linux and its distos are generally free and you are free to pick and chose. Windows is a business model with walled gardens they are enticing you to go into. It's best to stay out. Same goes for other walled gardens like AWS.

So you'll need to learn the basics of how Unix OS calls work. Much of that is abstracted away in higher level languages where you can open a URL as easily as you can open a local file, but a lot of the API's in the OS parts of languages are patterned after Unix OS calls and Berkeley sockets. You don't need to go crazy understanding every system call -- mmap and brk are probably not going to be thrown at you any time soon -- but open/close/read/write/lseek may. Unless you're writing relatively low level, you're probably not going to need to know much about how signals work, but if they gave you a intro hardware architecture course, they are hardware interrupts translated into user space.

Mostly you need to get familiar with the basic of the OS itself but just as importantly you need to get familiar with all of the utilities from the command line. It's OK to not know how to use find(1) off the top of your head because man(1) is there to help you. Using locate(1) to find files, learning how to redirect output so you can show somebody else that something hosed is going on... all of these things are going to be a daily part of your job. You're going to have to learn them quickly because they didn't teach you any of this in school.

Networking in Reality

They probably taught you about the OSI network stack. Unless you're a networking geek like me that's probably about the amount you need to understand the plumbing that goes on after you bits leave your program. But there really is much more to it than that within your app. The world is pretty much clients and servers with clients doing CRUD'ful operations and servers serving it up. You need to also learn about things like Websockets which unlike the client server paradigm, servers push out data proactively to clients. Think chat clients. You'll also need to get an understanding how to distribute load for server pushes so that you'll end up need to understand message buses like RabbitMQ. 

If you are doing anything remotely web related you have to learn about Ajax calls (aka HttpXmlRequest). The are a fundamental part of creating a modern interactive web app. My personal favorite design pattern is to have a skinny backend which has two purposes: serving up data, and performing access control. For the front end, it takes the raw data and builds the UI. Others may like more in the backend, but fundamentally it's going to be a mix at the very least. The days of static web sites are long gone.

If you want to add audio and video and especially conferencing you're going to learn how all of that is done. You don't necessarily need to know the nitty gritty of RTP transporting the output of codecs, but it is useful to know that you have those tools at your disposal if they are appropriate for an app. But you might need to know that integrating a point to point conference is cheap and easy, but if you need audio mixers and video muxes it becomes a more costly endeavor.

Bottom line is that the world is a network and it's the way you deliver data from source to destination.

Optimization

There are 10 types of people in the world: those who have heard Knuth's quote about optimization, and those who haven't. Knuth's quote is:

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."

If you don't even know who Donald Knuth is, your degree is revoked. While it's good to not write gratuitously gross algorithms, the truth is that most code does not need to scale to anything significant. Your main job just is to just get something up and running, not create a master's thesis. O(N) is just fine for the vast majority of in-memory searches. 

My general philosophy is to get something up and running and figure out where the hot spots are later. This is especially relevant with the unending changing requirements and scope creep. The thing you were assigned in the first place may end up looking nothing like what it ultimately needs to do in the end. Any optimization you do is very likely to be wasted time. Instead of spending a lot of time optimizing, spend more time making your code easier to refactor. That is, always consider hedging your bets that code might need to change in ways that reflect new requirements. One way to do this is write pluggable architectures. Don't go crazy though. Only do it for the highest level stuff.

For core mechanisms for key algorithms some amount of optimization is fine mainly to prove that it can be optimized further if needed. Optimization follows an exponential curve with efficiency on X and effort on Y. Sometimes companies will pay millions of dollars to optimize algorithms which are asymptotically close to zero like Fintech trading, but they are exceedingly rare.

Performance Profiling

So you finally were forced to face needing to optimize that you assiduously avoided by reading the previous section and postponing optimizing. Optimizing is something of a science and an art. It's both getting concrete data about what is going on and then often re-imagining how the same requirements can be done in better ways.

The are really two different kinds of profiling: ad hoc where you are pretty sure you know what's running hot, and then using built in profiling software provided by the disto or the language you're using. The former is easier to grok because you know what you're looking for and are trying to confirm or deny that it is part of the problem. The big problem with its is that you can be wrong in your conclusion about the problem and instead just nibble at the edges.

Bring in performance profilers. Profilers usually work by setting up hardware clock interrupt and sampling what's going to produce a histogram of where the program has been and how much time it's spending. The classic for C programs is gprof. Various higher level languages have bindings for their own profilers too which is necessary since things like gprof doesn't understand, say, Python's internal symbol table or anything like that. They are no panacea. Often they can be misleading because they show core functions getting hit, but not who is calling them. Some may have the ability to record farther up the call stack but usually at first blush it's just the active call that is recorded. And then of course profilers are invasive so if you have any kind of real time quality to your code, you get Heisenprof to contend with too. Last for upper level languages, their profilers can be fiddly and not terribly well supported. They are often idiosyncratic in my experience and not all that easy to understand. 

Fortunately profilers are not needed or even useful for a major source of poor performance: database interactions. The curse of trying to map an object model onto a relational database is the source all kinds of gigantic fails. Remember that "wow, I pressed enter on 'rails new foo' and I now have a web app"? Uh, yeah. Rails is hardly the only perp here, but it is very representative of the mindset of not thinking about what is happening in the underlying database. Out of sight, out of mind. Except when it matters. 

It is essential that you understand how the underlying database works, and how to see if the ORM (Object-Relational Mapping) is making good decisions about handing the mapping of objects to database. This is true of *any* database type, not just SQL. By way of example to show that even I who has been around relational databases for 4 decades, I can get snookered too. The simple way with Rails is to create models which are really just a layer on top of database tables and let the ORM deal with the joins. I had a small table with the definitions of a particular items which I'd join with another table which would fetch those definitions. This saves space in the other table since it's just endlessly repeated data.

The problem is that the other table got really, really big and it was indexed using a B-tree. B-trees you know full well with your shiny new degree are O(log n). So having to look up a key in the big table may involve looking up many many intermediate nodes on the tree at disk speeds (even with flash, it's still bad). I might be getting exactly how the problem manifested wrong because it's been many years, but the jist of it is that joins are not always your friend. Sometimes data replication is far and away the better choice.

I'll end here by saying that you should always be suspicious of tools that make things "easy" as you scale up. You should also have some humility to say that you have no clue what is causing something to be slow and that it will take some time to investigate it. Last, know when to say good is good enough. It's extremely tempting to once you've opened the hood that you tinker with it for a month, weeks longer than it really need to be tinkered with. Stop while you're ahead.

Databases

Databases are the beating heart of almost all applications. As shown above, they can also make or break an app when used incorrectly or carelessly. While nobody is saying that you should be a full blown DBA who knows every nook and cranny about SQL archana, it is good to understand the basics of what they do and why they do it. 

There has been a somewhat recent backlash to SQL like in the last decade with NOSQL databases. This is by and large SHINY in my opinion. In the vast majority of cases you do not need Cassandra and its ilk, and using it is pretty much showing that you have no clue and are just following trends. Likewise things like MongoDB. If you don't understand ACID and its tradeoffs, you are doomed to repeat every mistake that comes from being ignorant of it. Databases are really simple and fast when you don't support the hard things that slow them down. 

Here's the thing though: it's not like relational database vendors are blind and can't see the good parts of current trends. Mongo pioneered using JSON blobs and querying based on that. Postgres saw that and went "great idea, we can do that too!". I don't even know at this point if Mongo can be ACID compliant (I imagine it is now), but they had to retrofit it back in while things like Postgres has had it for decades and knows how to optimize when the various aspects of ACID are not needed. 

The other thing you need to learn is that databases are pretty much forever.  Once you make the decision to use one, you are going to be saddled with it on a project pretty much for life. Consider the problem you'd face: trying to go from databases that even if they are both SQL they have different extensions which are incompatible. Multiply this by fail if one or more are different kinds of database. Now consider that you are making this change while you're whole system is running in a world that doesn't sleep. 

Lock-in is multiplied by a zillion when it's a database on a hardware walled garden like AWS. Not only are you locked into a database for life now you're locked into a hardware platform too. You are fucked squared. Don't do that.

Your biggest take away should be that you are not a database expert just because you have a new CS degree and that generations of people studying this vital problem are not idiots. There are places for specialized databases where scaling is extreme but the likelihood you have or will have that problem is almost zero. Be extremely conservative.

Threading and Concurrency with Cores

Like a lot of things parallelization is as much of an art as it is a skill. When I was young I found out that I had a somewhat uncanny ability to visualize concurrency and especially race conditions that the company's engineers we were contracting for were amazed. I don't really know if it's ingrained or not, but it is something you'll need to at least be aware of if you're using threads and other parallel programming.

They no doubt taught you about mutual exclusion, but it's trickier to figure out whether it's needed or not. Mutexes are definitely not free so you want to avoid using them if at all possible, but if you miss one that is needed, prepare to weep because it will probably arrive as a Heisenbug. So you're going to need to be able visualize the situations when one thread can mash on common data causing another thread to puke.

One last thing since I don't want to belabor this much is that there is an unfortunate misconception that throwing cores at the problem with multiple threads will solve everything. If you are using an interpreted language think again: you need to understand the Global Interpreter Lock (GIL). With almost all higher level languages, there is lots of code that is not reentrant and thus not thread safe. Languages get around this by locking the interpreter when it needs to execute that code. The net effect is that all of those nice and fancy cores you are paying more for by the minute cannot be used except in cases where a thread blocks. There are quick diminishing returns on hardware cores in the face of a GIL. There are some languages which don't have a GIL but they are far and few between. The answer for the most part is to just have a bunch of processes with different interpreter instances. Sorry threading, you are no panacea.

Top Down/Bottom Up/Middle Out

Now we come to development styles. Different organizations or even dev groups are going to have different styles. Some times that will be out of necessity be strictly enforced. You don't want your space telescope launched to a La Grange Point to have coders be haphazardly playing around and fix bugs as needed. I personally have never understood Bottom Up but there are people who are like that including a friend of mine for whom I found it pretty maddening. 

Top down is also necessary in situations when you outsource writing code. If you get into a situation where some suit has the bright idea that they can save money by outsourcing writing code, the first thing you learn is that outsourcers will write exactly what it is you asked for and nothing more. They won't hedge bets, build in future proofing, prepare for obvious new features. Nothing. They write to the spec and that is that. If you are bad at writing specs, you're fucked and it's your fault. If you have changes in requirements as always happens or that the there are ambiguities in requirements you pay. If they are significant you may pay a lot. I personally would not want to do this, but there may be a time when you have to. The key take away is that they are a business trying to get your money. 

My personal favorite way to develop is middle-out. I often have an idea that sounds interesting but I'm not sure exactly how it will play out. I don't have all of the requirements and am not entirely sure what it is that I'm going for. So I build quick prototypes and see what happens not paying much attention to code cleanliness or speed or anything else, just trying to understand the problem space. This lends itself well to rapid prototyping, especially if you need to do a sell job to management to see where your head is at and why what you're working on is useful. If it's not evident, I'm obviously a fan of the 20% own time kind thing that Google and others have.

Refactoring

Middle-out programming explicitly relies on refactoring as a strategy: you find out that something is worthwhile and then you refactor it to clean it up and make it real. Refactoring is the process of looking at all of the moving parts and see how they interact with each other. You often find that parts are reusable in ways that you weren't thinking of when you originally designed a piece of code. This isn't a failing and if there is a failing to be had, it is on the side of over-generalizing things that are not in fact general. In my opinion, the more private methods you have, the better. Methods should only be promoted to public when there is a clear need because public implies support. If you have a public method, others have the right to bitch if it doesn't work correctly for their use.

Refactoring is just a way of life with writing code. You'll be doing it often and it is inevitable. But there is good refactoring and bad refactoring. Good refactoring is like washing your car to get rid of the gunk that accumulates over time. Bad refactoring is like finally getting around to fixing your car after three wheels have fallen off. Be the former and treat your car well.

Compartmentalization   

Everybody is taught about modularity but in reality is is a skill you have to learn on the fly. Like most things, there is a happy medium. Lots of young programmers make a zillion and 7 modules or classes that have one of two methods and that's it. Since they end up in separate files most often, it makes it a pain to search for them. Many methods are really purpose built support methods for a public method. The likelihood of their reuse is minimal and can be detrimental if you have to hunt down who is using that method if you want to change its functionality. 

In my opinion methods should be private until proven otherwise. DRY (don't repeat yourself) is nice in principle but it can be it can be taken to extremes where instead of a nice purpose built helper function you have this rusty Swiss army knife with blood all over it from people trying to use it. If a method has a shitload of mode and flag qualifiers, it's probably that knife you'd sooner avoid.

On top of that, short little functions/methods likely make it harder for things like CLANG and GCC to do loop unrolling. Maybe they can do this across modules/classes but they most likely have to be more careful. I don't know how much loop unrolling various interpreted languages do, but it's probably even harder. The main point here is not to go crazy with DRY as unifying principle.

Tools

Building tools is an often overlooked essential skill. Lots of programming is repetitive where you need to monitor something or generate output to be munched on to get stats. Basically anything and everything. When I was a young engineer I was one software engineer in one company supporting dozens of engineers in another company with a hardware product (a laser printer) they had contracted us to build. They had no experience with embedded systems and were getting a crash course on how you write for and debug them on the fly. This was before email across sites and I lived a hour away from them so didn't go down and visit often. 

I had the idea that I needed a debugger for the hardware for my own purposes. I decided to make it pretty fancy so that I could view variables symbolically, set break points, do profiling, etc. I thought that this was pretty neat from a feather in cap standpoint, but frankly in hindsight it is probably the single most important thing I did that caused the project to succeed. The reason is that it gave our client's engineers a familiar looking way to run and debug their code in something that was otherwise totally alien. 

The moral of the story is don't underestimate how hugely important tools can be. It's easy to get wrapped up with tools as their own ends, but most companies are tool-poor not tool-rich. 

Googling and Other Scrounging 

So you freshly have your degree and can spout every algorithm and data structure with pinpoint accuracy to the recruiters in the hiring process. You land you job and proceed to use that knowledge to write your version of all of those algorithms. They then fire you. Why did that happen? Because wheel reinvention is a waste of time, and almost certainly your implementation is going to suck in comparison to somebody else's who probably wrote a masters thesis on it decades ago. So why do they ask you those questions in interviews? Because you are green and they are lazy. 

The reality is Google-fu and being able to scrounge on the net is the way that actual research is done these days. Stackoverflow is not just a handy site on the net for programmers it is the expected way you'll find answers to your questions. You have a bizarre error message that makes no sense? Google it explicitly and see who else was stumped. Google-fu is its own skill and you need to learn it. Figuring out the right incantation can be a dark art in many cases, but the more you work on it, the better you'll get.

UI Design

You are not a UI designer, you say. That is for somebody else and never shall your estimable engineering hands be soiled with such dirty inconsequential details. You are in for a rude surprise. Nobody says you have to be an ace graphic designer with taste right out of Italian fashion houses, but everything has input that needs to be ground on and then shipped out. You may not need to do the actual window dressing itself, but you may need to make a first approximation that is so hideous that actual UI designers can't wait to get rid of your affront to the design sensibilities. 

Frankly everybody should know the basics HTML layout and some CSS. Even if that's not your day job you often come in contact with the need for internal tools. Far too often internal IT is not going to fund one of their folks to do this work and then -- more importantly -- maintain it. And your UI designers aren't wasting their limited time making your tool look better than the turd it is. So it's going to be up to you, and you're going to have to learn this on the fly too.

Security as Actually Practiced 

Interacting with the World 

Network Security Basics

OS Security Basics  

Web Security

Logging In

Role Based Access

Permission Based Access

Figuring Out Security Requirements

Testing

Testing Along the Way

Unit Testing

Integration Testing

Regression Testing

Interacting with DevTest

Working with Teams

The Mythical Man Month

If you haven't read The Mythical Man Month drop everything and read it. Twice. In a nutshell the best way to make a late software project later is to add more people to it. Fred Brooks' (RIP) insights are precious and hard won on his part. It's from the days of mainframe computers in the 70's, but is every bit as applicable today as it was back then. There are many other gems including The Second System Effect and No Silver Bullet.

Interacting with Others

Source Control

Requirement Gathering 

You will find that when it comes to defining problems everybody is a software architect. That's especially true of marketing types, sales engineers, etc. They can make your life miserable because they create an architecture that isn't right for the job or doesn't actually solve the customer problem. You must train them to tell you what they want, not how to design it. Customers are not immune to this either so it can be a delicate dance. It's an important one though because you are the one who has access to the internals and its architecture and all of the subtle tradeoffs that entails. 
 
The other problem with this is that customers often don't have a very good sense of what the larger problem they are pointing out is. Often there is a kernel of "this is what I want" that wants to be expanded and rationalized rather than patched and hacked. 

Bug Management

Feature Management 

Constantly Changing Requirements

Meetings

White Boarding

The Interrupt Stack 

Development Process 


Different Process Types 

Waterfall, Agile

Time Management

Estimating Time

Working at Software Sausage Factories

Working at Startups

What Dev Managers Are For

Requirements and the Debugging Thereof

Deployment

Build Management

Packaging and Release Management

Provision Servers, etc

Monitoring

Scaling Services 

Fire Drills 










Sunday, April 4, 2021

Quic: the Elephant in the Room

Status Check for African Elephants | NRDC
Elephants should be here, not in rooms

[foreword:  I revised this several times expanding my thoughts and worked on getting packet sizes and counts correct, but it's quite possible I've made some mistakes in the process.]

I was recently thinking about Quic, the combined TLS and Transport protocol that Google initially designed to streamline session start up and a wish list of other improvements to better target web traffic needs. The motivation is mainly latency and the number of round trips needed to start flowing the underlying HTTP traffic. While Quic certainly does that and is an improvement with the strict layering with TCP, looking at this from an outside perspective (I am no TLS expert), the biggest source of latency at startup seems to be sending the server certificates themselves. I looked at my certs and the full chain pem file (ie, my cert plus the signer's cert) are about 3500 bytes. I tried gzip'ing them and it made some difference, but was still about 2500 bytes all said and done, but TLS doesn't seem to be doing that. So that's a minimum of three MTU sized packets just for the credentials and one MTUish sized packet for the ClientHello. While the cert packets are sent in parallel given the congestion window like TCP, they are still MTU sized packets which have the latency that Google was trying to get rid of. One curious thing I noticed is that Wireshark seemingly said that it was using IP fragmentation which if true is really a Bad Thing. I sure hope that Wireshark got that wrong.

If I understand Quic correctly, basically they got rid of the the TCP handshake and used the TLS handshake instead since it's a 3 way handshake too. So the flow goes sort of like this:

 

  • DNS A/AAA Lookup ->
  • DNS Response A/AAAA <-
  • ClientHello+Padding ->
  • ServerHello+QuicHandshake1 (cert) <-
  • QuicHandshake2 (cert cont) <-
  • QuicHandshake3 (cert cont) <-
  • QuickHandshake (finish) ->

So in all the server is sending ~3 MTU sized packets. This is on the assumption they are sending pem which might not be a good assumption as they could be sending the straight binary X.509, but from the looks of it on Wireshark it looks like they're just sending PEM. I'm assuming that the ClientHello  is small, but I read that there are issues with reflection attacks so they are relatively large. Assuming that I read that right it's about 1200 bytes for the client and server Hello, so all told 4 ~MTU sized packets and small client finished handshake packet. So in bytes, we have about 1300+1500+1500+1000+100 which is ~5400 bytes.

Getting Rid of Certificates using DNS

What occurs to me is that if they weren't using certificates it could be much more compact. The rule for the reflection attack is that the server should send no more than 3 times the ClientHello packet size. Suppose instead of using certificates we used something like DANE (RFC 6698) or a DKIM (RFC 4871) selector like method:

  • DNS A/AAAA Lookup ->
  • [ DNS TLSA Lookup -> ]
  • DNS A/AAAA Response <-
  • ClientHello+Padding ->
  • ServerHello+QuicHandshake <-
  • [ DNS TLSA Response <- ]
  • QuicHandshake (finish) ->
The server QuicHandshake would be relatively small depending on whether you fetch the public key from the DNS or just query DNS as to whether, say, a fingerprint of a sent public key is valid (DANE seems to be doing the latter). In either case, the size of the QuicHandshake is going to be quite a bit less than an MTU, say 600 bytes. That means that the ClientHello only needs to be about 200 bytes or so, so it is a medium sized packet. Thus we've reduced the sizes of the packets considerably. That's 3 small packets and two medium ones. In bytes it's like 200+600+100+300+100 which is about 1100 bytes about 5x smaller.  
 
But wait, there's more: DNS is cachable so it's pretty likely that the DNS response is going to be sitting in cache so it becomes 3 smallish messages and ~900 bytes instead which is about 7x smaller and 2 messages less. It also doesn't have any problem with IP fragmentation if that's really what's going on. Plus we're back to the traditional the 3 packet handshake as with TCP.  Note that DNSSec requires additional lookups for DNSKEY and DS RR's but many of these will end up in caches, especially for high traffic sites.
 
Using DNS obviously would in this case require DNSSec to fully reproduce the security properties of certificates but that shouldn't be an impediment. As with the original Quic from Google, Google owns both browsers and servers so it controls whether they come to agreement or not. All they have to do is sign their DNS repository (which I assume they already do) and the browser needs to make certain that the DNS response is signed properly. All of this can happen in user space that is completely under their control.

Update: I moved the DNS TLSA lookup to be speculative after the A/AAAA record lookup if it's not in cache. The client could keep track of the domains that have produced TLSA records in the past as a means to cut down useless speculative lookups. A better solution would be to have the TLSA record "stapled" to the A/AAAA lookup, but I'm not sure what the rules for such things are, and of course it would require buy in from the DNS server to add them to the Additional RRset.

DNS Implications

Using DNS as a trust root is a much more natural way to think about authentication: domains are what we are used to either trusting or not. Certificates created an alternative trust anchor and frankly that trust anchor is pretty self-serving for a whole lot of certificate vendors. It would obviate the need for that side channel trust anchor and get it on the authority of the domain itself directly. Gone would be the need to constantly renew certificates with all of the hassle. Gone would be the need to pay for them. Gone would be the issue of having dozens of certificate roots. Gone would be the risk of one of those roots being compromised. Gone would be a business model that was predicated on 40 year old assumptions of the need for offline verification which is obviously not needed for an online transport layer protocol. 

Another implication is wildcards. Certificates have the ability to have wildcards in the name space, so that foo.example.com and bar.example.com can have one certificate with *.example.com. DNS has wildcards too, but whether they would meet the security properties needed is very questionable as I'm pretty sure that there is a lot of agreement that DNS wildcards are messed up to begin with. If they don't, you'd have to enumerate each subdomain's DANE records. I'm willing to bet that DANE addresses this, but haven't seen it specifically in my skim of it.

Another implication is that a lot of clients rely on upstream resolvers which is a thorny issue when authentication is involve. However, my experience is that browsers either implement their own stub resolver, or rely on a OS stub resolver. Given ecommerce, etc, my feeling is that trying to eek out some sort of CPU performance benefit is generally a bad tradeoff and that browsers can and should actually authenticate each transaction before storing it in a local cache. RSA/ECDSA verifies are extremely cheap these days, and besides browsers are already doing those verifies for certificates.

TLS Implications

 
I am by no means a TLS expert and can barely play one on TV, but my understanding is that TLS allows for naked public keys these days. Update: this is specified in RFC 7250 and uses X.509 syntax, but strips everything out by the public key. I'm not sure how TLS deals with validating the raw public key, but I assume that it just hands it up to the next layer and says "it validates, whether you trust it is now your problem". That takes the DNS/DANE exchange completely out of the hands of TLS so implementers wouldn't need to get buy in from TLS library maintainers.  
 

An Alternative for Certificates

While certificates require 3 packets to transmit, it is not inevitable that they must be sent each time a session is started. A client could in principle send the fingerprint(s) of certificates that it has cached for the domain in the ClientHello and the ServerHello could then reply with the chosen certificate fingerprint if it has possession of its key. That too would cut the exchange down to 3 packets instead of the 5. The downside is that it would require buy in from the TLS community to implement the new protocol extension. Additionally, the ClientHello would still be required to be an MTU'ish sized packet since the client wouldn't necessarily know whether the server supports that extension or not.

Conclusion


I've stressed throughout this that a Google-like company could take this into their own hands and just implement it without buy in from anybody. That was what made Quic possible in the first place since anything else than that is beating up against  an ossified and sclerotic industry. Indeed the Certificate Industrial Complex would completely lose their shit as their gravy train is shut down. Given DANE and DKIM, the use of DNS to authorize public keys for use elsewhere is well understood and should be completely safe given DNSSec, and arguably safer given that there are far fewer middle men CA's involved to screw up. 
 
A real life implementation would go a long way to proving how much latency it would cut out because my numbers here are all back of the envelope. It remains to be seen what the actual improvement is. But if it did nothing more than break the back of CA's, that would be an improvement in and of itself. Admittedly, this only changes the startup cost, not the per packet cost which might contribute to some of the gains that Quic sees. Since Quic allows longer lived connections and multiplexing of requests to deal with head of line blocking, it's not clear whether the gains will be significant or not. The business side implications, on the other hand, are clearly significant, though it has to be said that x.509 would need to be supported for a good long time.
















Monday, March 8, 2021

Certificates Confuse Everything

Not the solution to everything

 

I'm fairly certain I had a basic understanding about how certificates for identity worked, though not much about the underlying technology before 1998. But in 1998 all of that had to change really quickly because I opened my mouth about the security problems for the residential voice over IP project I was working on at Cisco and in particular the signaling protocol we were using a called MGCP (nee SGCP). MGCP is a pretty simple command/response protocol where a server tells a home POTS (eg phone) gateway to, say, go off hook, or ring the ringer, etc. Needless to say having some script kiddie being able to ring the ringer or listen in on the microphone would not be ideal. For opening my mouth I got told to solve it. So there I was having to do a crash course on network security and all of its protocols and really how it worked at all.

My group in particular was tasked with creating the residential gateway which was a box that had a couple of POTS ports and was integrated together with either a cable or DSL modem. These needed to be authenticated both ways so that the service providers could prevent rogue gateways getting access to their telephone network. In this case the gateway is the client device in a client/server relationship. Normally clients use passwords but that doesn't seem especially elegant for a box sitting in the corner, though now that I think about it that is exactly what my router does when connecting using PPPoE to my ISP. There was a requirement that the user wouldn't have access to the gateway so that would have made it more difficult, and especially for manufacturers if they had to pre-provision the secret keys.

So it was time to learn about asymmetric keys. Well, rather the first thing to learn about was certificates because that's how they always got couched. Certificates were these magic identity thingies that through some math voodoo allowed the other side of the conversation to know who they were talking to. Once you had a certificate all of that math voodoo became mostly irrelevant so I mostly concentrated on them rather how asymmetric keys actually work. To give an understanding of how clueless I was at the time, I remember asking another engineer whether we could just RSA sign the MGCP packets or something like that. Looking back that seems like a silly question to ask, but as it turns out it was exactly the right question to ask with signing email for DKIM just a few years later.

So everything was in terms of certificates, how to get them onto the box what to do with them once they were there and how this all related to keeping kiddie scripters from ringing my phone in the dead of night. Some of my previous group were working on IPsec so I got up to speed with that and it seemed like a good solution to the crypto needs for our residential gateway security problems. Though TLS (then SSL, I think) was definitely in the air back then, MGCP was a UDP based protocol, and TLS only works on TCP (though now integrated with QUIC). I was persistent on this point in the SIP working group too -- SIP could also be run over UDP -- because I thought IPsec in transport mode was a better choice since it dealt with UDP as well. Instead, others went off and designed DTLS to meet the UDP requirement. Oi. The irony now is that SIP is so bloated that it wouldn't even fit in an MTU packet anymore, so we were both "wrong" in that deprecating UDP would have been the better choice.

Now that we had an underlying crypto mechanism it was back to getting those certs on to the gateway and what were these certs anyway? The general idea was to have a root CA which vouched for approved manufacturers (I was by that time participating with Packetcable, Cablelabs' residential voice standardization project). To me this was still all rather mysterious and something of a black box. I finally started to grok the larger picture when I was in a meeting with Van Jacobson and he said "ah, the enrollment problem". We didn't have a certificate problem, we had an enrollment problem. How do you enroll those devices such that the server knows who is who? That is the basic problem going on, and in the race to solve it with certificates nobody asked why they were needed at all.

That's sort of how it always seems go when people start talking about using certs as if they were some magic incantation and The way you used asymmetric keys. Nobody asked why we needed to bind a name to a public key in the first place. I finally started to understand the underlying math and how IKE worked and especially how RSA signing and encryption worked. Being able to determine who you're talking to doesn't require having a key bound to a name at all. The public key itself is unique and can be used directly as an identifier itself. Certs completely obscure that property. Later when Jim Fenton and I designed IIM which is one of the precursors of DKIM we took advantage of that property and just used the public keys as an identifier itself. It was DK that had a somewhat gratuitous name/key binding in the form of selectors, but it didn't hurt anything and allowed me to have a selector name called "fluffulence".

So why do I like to bag on certificates? Because they confuse getting to the bottom of what you're trying to do. Like I said, since everything is very certificate oriented, nobody asks the obvious question of why do you need a name to key binding? In the Packetcable case I recall us struggling with what exactly the name should be in the cert. That right there says that first principles almost certainly need to be revisited. We didn't have a naming problem, we had an enrollment problem and the name was irrelevant and thus there was no requirement to carry it using an obscure and ossified bag of bits in the form of X.509 and ASN.1. The other part that I dislike about certificates is that they are a business model. It costs nothing to put a public key into the DNS or in some database. It costs enough to support lots of CA vendors' bottoms lines for certificates though. There is one use case that certificates can do that are not easy to reproduce in other ways: offline verification. This was an important use case when they first arrived since expectations of online was a rare beast in the 80's. Today the need for offline verification is niche and the whole world is a connected internet. So we're supporting billions dollar business model for a feature almost nobody uses.

When our residential voip project was going on over 20 years ago it might be somewhat justifiable because a whole lot of us were getting a crash course on network security. However I don't really think much has changed on that front. Everybody proceeds from a cert-first mindset as if it were a given without thinking about what the actual requirements are and then deciding whether a name/key binding is needed at all, and next to determine how that binding is achieved if needed. It's also unfortunate that so many protocols have a built in expectation that certificates must be used, though it's my understanding that TLS and IPsec both allow for naked public use obviating the need for actual valid certificates. I'm not sure if they implement it just by sending a self-signed cert where the server just ignores the CA signature, or whether it truly is a means of just sending a naked public key (the latter would certainly be better since the intent is clear). 

In the VoIP/Packetcable use case, client certificates were never needed. The naked public key (or a hash of it) was perfectly serviceable as an identifier for the residential gateway. All that needed to happen was to get it enrolled somehow. There are many ways to do that depending on security requirements. Dispensing with the complex X.509 infrastructure makes the entire problem both easier to administer and much simpler to understand. It should be a dead giveaway for anything that proposes client side certificates to ask why they are needed. In the wild, client side certificates are exceedingly rare, so why is this different?

The reason I decided to write this post is because as of the writing I was having a conversation in which I voiced my dislike of certificates and most especially the certificate-centric view that most people have with authentication with asymmetric keys. I had brought up SSH which along with DKIM are two of the most used tools that use asymmetric keys (TLS being the most), neither of which use or need certificate based identity. Somebody pointed out that SSH allows for client certificates, so I looked it up and it seems that they hacked the protocol to get that to work and that apparently it is used as a replacement for the SSH authorized_keys file on servers which is supposedly better at scale. When I pointed out that it would be easier to just put the SSH public key into the user's profile with a LDAP directory or some such, I got told that it was infinitely easier to create a certificate and put it on the client. Since both have to upload the public key to something, that cancels out. How can putting certificates on a client be easier than to doing nothing at all? Magic, I guess. Or confusion. Lots of confusion.

The moral of this story is to not start with certificates as a given if you are thinking about using asymmetric keys for authentication. That just confuses everything. You need to understand what problems you are trying to solve first and foremost. What are the requirements for authentication? Do those requirements require a key/name binding? Do those requirements need the ability to verify authentication when the verifier is offline? If the answer to both of those is yes, then you should consider using certificates. If the answer to offline is no, then you don't need certificates and it can be designed without them by using naked public keys. Simplicity is always good with security. Certificates are not at all simple and should be used only as necessary.
















Sunday, January 24, 2021

Birthing DKIM


Foreward

This is completely from my perspective needless to say. I really wish Mark Delany in particular would write something similar as it's the other half of the equation and his perspective would be really enlightening. DKIM is a remarkable piece of convergent evolution.

IIM

Tasman Drive


In 2004 Cisco just like everybody else was being inundated by spam. With my personal mail server, Spamassassin couldn't keep up with the permutations. Cisco had no visibility or expertise with email but we were heavy users of email so we had an outsiders view that the situation was really bad and didn't seem like it would get better any time soon. So Dave Rossetti assembled myself, Fred Baker, Eliot Lear, Jim Fenton and maybe one other that I'm forgetting to talk about what Cisco could do about the spam problem. The main thing going on at the time was Bayesian filtering, but that was being defeated by image spam. After one of these meetings, I came up with an idea that if the mail servers did nothing more than apply an unanchored digital signature to the mail but with a consistent key, that maybe the Bayesian filters could latch onto that as a signal for spam or ham. I remember talking to Eliot after a meeting telling him my idea, and he was interested as I recall, but dubious that a free floating key would work. Some time after I told Jim too, but he had a better idea: why not anchor the key to a domain? And thus the genesis of of Identified Internet Mail, IIM. I'm fairly certain Jim came up with IIM because if left to me I would have probably tried to make some cutesy tortured acronym ala KINK

Since we now had a trust anchor (ie, the sending domain) it became obvious that we could possibly also publish a record which said whether the sending domain signed all of their mail or not. If the receiving domain received unverified mail and the sending domain says it signs everything,  it would be a forgery in the eyes of the sending domain. Thus the concept of Sender Signing Policy (SSP) was born. 

So off we went. Jim was still part of his group, and I was still working for Dave Oran at the time, so we were more or less doing this free-form and under the radar. Jim wrote most of the IIM draft, and I wrote the actual IIM code, telling Jim what the syntax of the header was from my running code, and how I implemented the SSP code. IIM had a concept of a key registration server (KRS) that ran on top of HTTP. For discovery, we used a well-known top level SRV record to find the KRS. We were a little nervous about the overhead from HTTP for fetching the key, but we had a means to allow it to be cached, so we figured it was probably acceptable. We were also really nervous about the overhead of the RSA signing operation. But when I wrote the code using a sendmail milter I quickly found out that the signing overhead was drowned out by the overall processing of the message so it wasn't a problem. 

While this was going on we had heard of some exec at another company falling for a spear phishing attack purportedly from another employee. We didn't think our execs were any brighter and security savvy -- and frankly, none of the engineers either since it isn't easy to figure out even if you're looking for it. So with Dave Rossetti we decided that spear phishing was a scary problem for Cisco and decided to create a research group within Cisco which was charged with dealing with this employee-employee spear phishing attack where I was employee #1 (Jim stayed in his group throughout this). We got some coworkers that we had worked with before including one -- Bailey Szeto -- who had close ties to Cisco IT.  The object was to create an IIM signing/verifying MTA and insert it into the mail pipeline to sign and verify signatures. 

While this was going on, we were starting to reach out and socialize the ideas externally. Our co-worker Dan Wing was good friends with Jon Callas then at PGP Corp so we had him over to talk it over to make certain we weren't crazy. I'm not sure if Jon was impressed or not, but he didn't find anything substantially wrong as I recall, so we weren't going to badly embarrass ourselves going to IETF at least. We were making fast progress on actually implementing IIM internally as well while this was happening, and getting buy in from the IT folks to insert my IIM code into the email pipeline. Finally holding our breath we went live with IIM in the mail pipeline. A little at first then a little more until we were signing and verifying signatures for an entire Fortune 100 company. A company that lives and dies by email, I'll add.

Domain Keys

Tasman Adjacent

We kept our feelers outside of Cisco and eventually found out that right down the street a mile or two away at Yahoo! Mark Delany was working on something called Domain Keys (DK) and had actually deployed it into their mail pipeline. What was remarkable about DK is how similar it was to IIM. He too was working on an internet draft documenting DK. DK also had a signing policy mechanism as I recall, but it was more tentative and maybe aspirational as I recall Mark saying which makes sense from the perspective of a email provider. When we finally became aware of each other we started meeting in larger groups of interested people informally called the Email Signing Technical Group of maybe about a dozen to try to figure out what to do, both with the two I-D's and generally how to standardize something. Barry Leiba was part of that early group who along with Stephen Farrell would go on to be the DKIM working group chairs. Nothing is simple with the IETF world, and it takes time to agree on the color of the sky even on good days, so it's usually the best plan to have quite a bit of buy in and a coherent front for the inevitable push back and vested interests. Mark's DK worked. Ours worked. It was deployed just like ours was. They were both fundamentally doing the same thing.

The IIM draft was first published on June 3, 2004 and DK was published on June 24th, 2004. As I recall we both had live implementations running when we published our drafts. I don't know when Yahoo! started signing it outgoing mail, but I have always assumed it was before us, but who knows (and if you do, let me know and I'll update it).

The Fusion of DKIM

Mark being at Yahoo! was very service provider oriented. Our situation at Cisco being from an enterprise standpoint was more complex where the IIM draft laid out a bunch of the use cases that needed to be supported. It wasn't entirely clear whether they could be supported by DK or not. As I recall, we met with Mark at Cisco to see if we could hammer out a combined spec instead of the usual routine at IETF of having two competing drafts and the pissing matches that ensued. The pissing match was already happening in SPF-land with SenderId. There was a real engineering trade off between using DNS and using HTTPS. Security was easy for HTTPS, much more of stretch for DNS. But DNS lookups are cheap vs. HTTPS and we kept going around on that though neither of us was dogmatic. I liked DK's header syntax better as mine was a little overwrought. The big deal though was whether DK could do the enterprise-y things that we wanted.

After the meeting I thought about it for several days reading the DK draft and comparing it to IIM and its use cases until I convinced myself that it was a product of convergent evolution; DK could just be extended for our needs. I bit the bullet and told our group we should just adopt the DK mechanism and add the things we needed.  The lingering concern about HTTP performance was greater than the security concerns of DNS. The irony these days is that DNS over HTTP (DoH) is now a thing so we're back to where we started with IIM: we could have used HTTP from a performance standpoint after all. The other part of basing it off of DK was tactical: Yahoo! was a big fish in the email world where Cisco was a barely hatched fry. That said, I think IIM had it right in the long run. DKIM gets knocked all of the time about DNS and the lack of deployment of DNSSec. While I think that is overblown, you can't argue that setting up TLS on a HTTP server was a well known skill even in those days.

At that time we already had IIM deployed throughout Cisco and were starting to gather some stats for our stated goal of dealing with spear phishing. Part of the problem was identifying the sources of email in the company that were not routed through the Cisco mail pipeline and that was daunting and proved something of an Achilles heel, though not entirely. DMARC's reporting facility would have been very helpful, but of course that requires wide deployment from other domains, and we didn't even have a merged protocol yet. Our main problem was with external mailing lists of which we were painfully aware because that's how IETF did its business. I wrote a bunch of heuristics to recover signatures that went through mailing lists to see if it could be validated. I got tantalizingly close with about 90% recovery, but we had a lot of unsigned email from other sources so we couldn't take action. 

 

Where Eric Allman, Mark Delany, Jim Fenton, Jon Callas, Miles Libbey, and I hammered out DKIM at my place in San Francisco
 

The combined spec was coming together. Eric Allman was given the editor's pen for the combined spec that was hammered out in my dining room in San Francisco with all of the named authors in attendence. When enough of DKIM was cobbled together I got to work converting IIM into DKIM with my implementation. I had found out that Murray Kucherawy at Sendmail had a DK implementation written as a milter as well (it was never clear to me if that's what Yahoo! was using. Edit: Mark says it was Murray's milter). So the race was on. I got done enough that I sent Murray email (signed!) telling him I was ready to interop. Murray was right behind me and the next day we started to debug our implementations. Murray was at a big advantage because the protocol looked on the outside a lot like DK. Our main interop issue was me getting the NFWS body canonicalization correct as I recall. Beyond that I think we had interop possibly that day, but certainly within a few days.

As it turns out, that was a theme with lots of implementations to follow, and most importantly lots of interest across the industry. The next step was to take the combined DKIM draft to IETF. As I mentioned IETF is a painful process, and getting a working group spun up is always extremely difficult because everybody and their brother gets their $.02 worth in. DKIM had the advantage that it was a fully formed spec with a lot of vetting at that point from a lot of eyeballs as well as implementations. If I recall correctly it was at the Paris IETF in 2005 where we had our debutante's ball. There was a lot of sound and fury from the usual attack poodles. Jim got saddled with writing a threats informational RFC, much of it written sitting on the floor in the halls of the Paris IETF venue as I recall. The one thing I do recall out of all of the sound and fury was that Harald Alvestrand (then IETF chair) stood up saying this entire process was ridiculous and should just proceed. Thanks Harald!

I don't recall whether we actually were chartered in Paris, but do remember filling up a friend's tiny restaurant with an assortment of IETF folks with the wonderful food coming from her postage stamp kitchen including a chocolate mousse with cayenne. Everybody loved it. So anyway the working group was chartered, the threats draft was published and work began on what was already a pretty mature draft with a growing number of interoperable implementations. Probably the single biggest change to the original draft was the message body canonicalization. NFWS turned into "relaxed" for reasons I don't really recall. Relaxed seemed better, but not that much better and required us to re-interop. Oh well, something had to change. We did eventually have an in-person interop with probably 20 different implementations hosted by the affable Arvel Hathcock at Altn (now MDaemon) in Dallas. We were treated to a Brazilian restaurant where prodigious amounts of meat was consumed. 

So at this point DKIM was pretty well set and would go on to become a proposed standard RFC 4871 in mid 2007. Believe it or not, that was a good speed for IETF process, but we did have the advantage of an interoperable spec without any competing specs, or in IETF parlance the rough consensus and running code were there before the working group was formed. On the home front I continued to do experiments as we tried with increasing frustration to find all our sources of email.

SSP/ADSP

Early on I believe after the working group formed, it was decided to split DKIM and SSP apart. That's a fine decision in retrospect -- they are two different on the wire protocols. But SSP elicited shall we say fervor from people who disliked it. It still seems to elicit similar fervor in its DMARC instantiation which makes me wonder why the people who dislike it participate at all. But there was a lot of resistance to SSP suffice it to say. It was at some point renamed ADSP for reasons lost to me, but for all of the bickering it remained pretty much the same SSP with some tag wordsmithing I assume so as to justify the name change.  One of the authors was even in the resistance crowd which again makes you wonder why you'd work on something you don't support. To this day, DMARC which is yet another bite at the apple is fundamentally the same (modulo the reports) as ADSP. It also added support for SPF to be used as a policy check too along with DKIM. As for DMARC, I really don't know why they went off to reinvent ADSP instead of just extending it, but it's possible that the shall we say the fervent poisoned the well too much. One of them even wrote an article for a tech rag against its existence after participating -- mainly delaying -- in its production. Finally RFC 5617 was made a proposed standard in mid 2009.

That's All Folks

At Cisco we had deployed DKIM into the mail pipeline, but we were also working on a more ambitious project that could take multiple protocols and apply security to the various streams instead of just email. I was most intrigued with SIP because SIP has a lot of the same issues that email does, the inter-domain problem being the biggest. Since I had previously been working on VoIP stuff before DKIM, I still kept tabs on what was going on with SIP. SIP was at that time creating what was called the P-Asserted-Identity header which supposedly told you what the caller-id was being asserted. I was a regular Casandra shouting at the top of my lungs that their assertion that voice will be an old-boys-network just like the old days was wrong and this was going to backfire on them since there was no authentication mechanism. I even hacked up a SIP stack and started DKIM-signing SIP INVITES to prove it could be done with probably little or maybe no changes with DKIM. More later.

Cisco had decided a ways back that it was in fact interested in getting into the email security business. I did due-diligence for a number of companies including Ironport which was eventually chosen and integrated making our prototyping work redundant (they even participated at the Altn interop too). Both Jim and I had figured that we'd just move over to Ironport. Apparently we were too "Old Cisco" and both of us were rejected with myself at least labeled completely unqualified to write code or something like that. We just won you the fucking startup lottery, assholes. Thanks a fucking lot you ingrates.  Have I mentioned how much I dislike puffed up egos?

Epilog

Off to Ski

My group (one that I was responsible for forming) had decided to go off on some wacky scheme with Skype which I had absolutely no input on and absolutely no interest. As I was looking around for something new to do at Cisco, I was also fascinated by having taken my Garmin GPS to Kirkwood skiing and dumping all of my points into Google Maps. I was completely fascinated by this with all of the possibilities of finding your friends on the mountain, seeing how fast you were going ("It can tell me how fast I'm going?!") and gaming with your friends. Since I didn't find anything interesting at Cisco and was bored, I left to go ski for 5 years in August 2008. Two months later Android came out and my adventure into phone apps began.

DKIM to STD 

While I was off skiing the DKIM working group kept bumping along. While most RFC's stay at the proposed standard level, there is a complicated process to move it from proposed to draft standard and then to a full internet standard. By the time I decided to ski for a living, I had had it with the petty politics and stopped paying attention to the working group altogether and unsubscribed from the mailing list. I have no idea what happened in the intervening 3 years but in 2011 DKIM became STD 76. 76 makes it clear that there are not many protocols that make it to full standard. By the time I left, DKIM was already widely deployed at the major email providers with billions of pieces of email signed and verified every day. 

One of the interesting things that came out of DKIM is that it implements a public key infrastructure (PKI) and is probably the second largest PKI next only to HTTPS/TLS. What I particularly like is that it shows that it is not inevitable that a PKI needs to use certificates. In fact DKIM shows that X.509 is particularly dated and unnecessarily complex with its CA's, ASN.1, blah blah blah. TLS is water under the bridge at this point, of course, but there seems to be some magical thinking that if you use asymmetric keys that certificates are required. DKIM proves that is emphatically wrong and that it can be just as simple as publishing the key/name in DNS or pointing to a web server to fetch it.

Ah SIP, My Old Friend

As a sad case in point I submit to you STIR/SHAKEN (RFC 8226). While my main beef with STIR is that it solves the wrong problem -- they are trying to determine whether somebody is allowed to assert a given telephone number rather than just hold the senders accountable as DKIM did. They also clung to the X.509 world made which made it much less comprehensible in the process. On top of that there are many classes of deployments that STIR can't address at all. The RFC was published in 2018, 10 years after I had shown that they could just reuse DKIM. STIR is so rife with errors and under-specification that I had to stop writing a blog post about it. If it flops -- and there is a good chance it may -- there is always the DKIM route, which also has the benefit that it also solves for the non-bellheaded use cases in the From address.

ARC, WTF?

I had vaguely heard that a set of people created a standard which was a successor to ADSP which at some point was brought to IETF as an information RFC. I looked it over more carefully and it seems to unify policy with SPF which is fine -- we didn't care about SPF at the time because they had their own policy mechanism so why pick needless fights? It also has a reporting mechanism for when signatures fail, etc which is in reality a completely different protocol than ADSP and has no advantage of being tied to the ADSP policy mechanism. 

That said, I happened to see looking at the headers of a message a weird DKIM-like signature called ARC-Signature along with ARC-Seal and ARC-Authentication-Results. I joined the DMARC working group trying figure out what this was about. There were a lot of fresh new to me faces on the working group, but also a lot of people who should have known better that ARC brings nothing new to the table than plain old DKIM. The main premise is that ARC is supposed to solve the mailing list traversal problem, or more generally intermediates who invalidate the originating DKIM signature. There is definitely a lot of magical thinking because when pressed on the issue when asked how ARC will do what DKIM supposedly can't is that it depends on the receiving domain to trust the ARC-Signature's domain. Doh. Uh, folks that resolves to a previously unsolved problem because intermediaries DKIM sign all of the time these days and there is absolutely nothing stopping a receiving domain from trusting that domain for the past dozen years. I really can't understand how the IESG let this happen because it is really ill conceived, though it is just an (failed, imo) experimental RFC at least. Through the process, however, I have come to the conclusion we should just ignore the mailing list traversal problem and set p=reject and let the chips fall where they may. For the vast majority of domains it is unlikely to ever be a problem. I wrote a post here about why.

Conclusion

DKIM is definitely one of the biggest achievements of my life and I'm very proud of it. Starting from a kooky idea about feeding Bayesian filters, working up a fully fleshed out implementation and internet draft, finding convergent evolution just down the street and marrying them off instead of a protracted pissing match to a full internet standard 76. What a trip! 

I recently came across a really interesting study about DKIM, SPF and DMARC showing what the effects they have had: TL;DR not a silver bullet -- nothing is with spam -- but it's having a noticeable effect on the problem. It's an interesting if long read but worthwhile if you're into email security.











Saturday, January 2, 2021

Storing Asymmetric Keys in Client Storage for Login and Enrollment

Very unlocal storage

I'm genuinely conflicted about whether storing private keys in browser storage is OK all things considered, or really wrong. It certainly feels wrong as hell, and this goes back years. With my HOBA revisited post (using asymmetric keys for login and enrollment), I dusted off my ancient prototype code and used WebCrypto to generate keys, and sign enrollment and login requests. Obviously you don't want to re-enroll keys each time you go to login, so that implies that the private keys need to stored in a place that is accessible to the WebCrypto signing code at the very least. To my knowledge, there is no way to get WebCrypto to keep the keys themselves inaccessible on a permanent basis, so that means they need to be exported, stored by the app, and reimported on the next login attempt.

So what is the implication? That you'll be storing those keys in localStorage or IndexedDB. They could be kept in the clear -- say PEM form -- or even wrapped using a password to encrypt the blob for extra security, in which case it would look much like a normal login form even though the password would not be sent to the server. From everything I can tell this is frowned upon, but strikingly not categorically. OWASP says that it is not good form or somesuch to put sensitive information in client storage. They give some rationale such as XSS attacks, but seem to not be able to work up the courage to say "NO! Don't ever do this!" This leads me to believe that they feel the same yucky feelings like I do, but can't quite bring themselves to have a logical reason to say that it's always bad.

Part of the problem is that if you have XSS problems, you are well and truly fucked; browser based storage is a problem, but everything in your entire existence is a problem too with XSS. So to my mind that says just take that rationale off the table because you would never allow that to be an excuse for shitty practices such as SQL injection attacks: it's just not acceptable, full stop. After that, the rationale to not use browser storage gets murky in a real hurry. As far as I can tell, it reduces down to that yuck factor.

Thinking about this some more, browsers can and do store immense amount of sensitive information. That information is available to javascript code running in the browser sandbox too. That sensitive information is from form fill in helpers from the browser. They collect names, passwords, social security numbers, credit cards, your first dog's name... you name it, it has it. Yet people do not think twice about the browser helping out. All of those things are completely available to javascript and any rogue XSS script it allows to run too. Yet I've not heard anybody sound the klaxons about this, and if anybody has nobody seems to be listening. Which is to say, that ship has long since sailed.

So are they any differences between browser form fill in? Yes, but they are subtle and I'm not sure they really change the attack surface much. Browser helpers do give you the ability to say no, don't fill this in, which is definitely a nice feature and enhances security since not handing it over to the DOM means that the client code won't see it. I think, however, that that is a Chimeric victory: if everybody uses this feature because the browser is good at figuring out what needs to be filled in, then the victory is just nibbling around the edges of the actual security problem.

Now to be certain, I think a situation where the browser stores the keys such that they are not visible to the javascript code where the user is given that choice would be great. What I would visualize is that private keys are stored along with all of the rest of the sensitive data the browsers store, but somehow delivered down to the client js code when the user allows it as with form data. The keys could be encrypted with a secret only the browser knows, or maybe the crypto routines in Webcrypto can take a new key type which is really just a reference to a key rather than the key itself. Which is to say that it's just an implementation detail rather than anything fundamental. This would be ideal to really solve this problem. This would, of course, require changes to the browser and definitely standardization. Which is to say, that it would be a long way off, but it definitely possible.
 
The question that remains is a riff on the Saint Augustinian Bargain: give me better security, but not just yet. That is, should we keep chaste until a better solution comes along, or should we make do in the mean time with a less perfect solution. I guess that given what I can tell with the risks, I put myself into the "just not yet" camp. Passwords over the wire are such a huge problem that almost anything would be better. Given that browsers are already exposing sensitive information including passwords, I'm not sure exactly what the big problem is.  The threats are much worse with passwords given their reuse, it seems to me that it is an incremental gain is completely worth it even if it is not the best long term solution. That is to say, that even if I can manage to steal your private key for a site, that gives me exactly no amplification attack unlike reused passwords.

So in conclusion, yes I totally get the ick factor. But the ick is part and parcel of the entire beast, and not just browser storage mechanisms. What that tells me is that one needs to put this in perspective. As with all security, risk is not an absolute and needs to be viewed in context of the alternatives. The alternative in this case is the additional amplification factor of reused passwords. Since all things are otherwise are fairly equal given browser form fill in code, I think that's pretty significant and a good reason to override the "should not put sensitive information in client storage" recommendations.