Base64 Tutorial and Sample Code

Note to readers:  The first half of this post is a simple introduction to Base64 intended mainly for beginners.  The second half discusses implementing Base64 along with sample code, and is intended for intermediate- to advanced-level programmers.

What is Base64?

Base64 is used to transform raw binary data into printable text data.  If you’re totally new to Base64, let’s break down this definition a bit further.  Maybe you’re wondering, “what’s the difference between text vs. binary data.  An example of text data is this paragraph.  Remember that from a computer’s perspective, each letter is just a number.  For example, to a computer, the letter “W” is just 87.  With this in mind, “text data” is simply a string of numbers that all correspond to letters, numbers, symbols, etc.

Binary data often contains numbers that don’t match up to any letter or number.  For example, picture, song, and movie files all contain binary data.  Have you ever accidentally opened a picture or song in a word processor, and ended up with a screen full of junk that looks like “0:09PrintIM0300öÇ äùÇí”à’à@ê02??  20êöêÆëë¬í”?  That’s an example of binary data.  Base64 gives us a way to convert this binary data to text data.

Why would we want to convert binary data into text data?

Here’s a scenario where Base64 would be useful:  You’re talking with a friend via instant message (IM), and you want to send a picture.  Unfortunately, for whatever reason (corporate firewalls, bad/different IM apps, etc.), you can’t send the file directly.  However, you could simply convert the picture to Base64, and copy/paste the resulting text into the IM window.  Your friend can take the Base64 text and convert it back into binary data, then just double click and open the picture file.

Also, when you visit a webpage, the images on that page are sent in Base64.  Your web browser decodes them back into binary data and displays the images for you.

If you’re interested in other uses of Base64, check out the Wikipedia article.

The Base64 Encoding Algorithm

Alright, now it’s time to discuss implementing a Base64 encoder/decoder.  I’m using C++ for this example, but I suspect that it would be easy to translate the code to other languages.

Base64 is simply a base-64 number system, hence the name.  It’s an extension of the Decimal and Hexadecimal number systems.  To review:

  • In Decimal we use the digits: 0 1 2 3 4 5 6 7 8 9.
  • In Hexadecimal we use the digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F.
  • In Base-64 we use the digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z + /

At its root, encoding a message in Base64 is just a simple substitution of binary data to Base64 digits.  Because bytes are composed of 8 bits, and a base64 character can only represent 6 bits, it prevents us from just performing a 1:1 substitution.  Encoding is done three bytes at a time, producing four base64 digits.  Note that three bytes is 3 * 8 = 24 bits, and three Base64 digits is 4 * 6 = 24 bits.  Refer to the Wikipedia article for a more complete example of this encoding.  They have an excellent table demonstrating this conversion.

Before we get too far into the details, note that there are several different variants of Base64.  My code supports MIME-Base64 and FileName-Base64.  MIME is used when encoding images and data for web pages, and FileName always produces output that is a valid file name.  To be specific, the only change is that FileName-Base64 uses the ‘-’ character instead of the ‘/’ character.

At this point we can outline a basic algorithm for conversion:

  1. Load up a block of binary data and store it into an array called ‘original’.
  2. Load 3 bytes of data from the ‘original’ array and combine into one 24-bit number.
  3. Take the 24-bit number and split it into four 6-bit numbers.
  4. Translate each 6-bit number into a base64 digit.
    1. This can be done with a series of ‘if’ statements.
    2. The same can be accomplished using a look-up table (i.e. a dictionary of sorts) that translates each 6-bit number into a Base64 digit.  This is the approach I have used in my code.
  5. Go back to step 2 and repeat until there are < 3 bytes of data left.
  6. If there are 1 or 2 bytes remaining, encoding them must be handled as a special case.
    1. You cannot just insert 1 or 2 zeros and convert the last block, because upon decoding the data, there would be no way to tell whether those zeros were just padding or actual data.
    2. Wikipedia does a good job of explaining what to do in this case.  Refer to the “Padding” section of the Base64 article.
  7. Celebrate because you’re done!

Sample Code

Below I’ve provided links to the source and header file.  I’ve decided not to go through the code line-by-line in this post, because I believe that if the code is elegantly written, the code itself is its best documentation.  I’ve included some extra comments that identify different parts of the encoding/decoding functions, as well as additional details in the comment blocks at the top of both the source and header file.

HTML files: (Good for viewing in a web browser, but doesn’t compile)

base64.h.html
base64.cpp.html

Source files: (These are the actual files you want to add to your project)

base64.h
base64.cpp

Questions?

Feel free to comment or ask questions if anything is unclear.  If you have suggestions on how to improve the code, I’d be happy to hear them as well.

Cyclic Redundancy Check (CRC) Tutorial and Sample Code

The goal of this post is to provide free, robust code for computing a CRC, both in embedded firmware, and general computer software.  I have written the code in C/C++, but it can be easily adapted to most other languages.  Now, onto business.

What is a CRC?

The cyclic redundancy check (CRC) is an error detection code computed from a block of digital data.  It is especially well-suited to detect accidental corruption when data is transmitted, copied, or stored in an unreliable manner.  For example, some archive files compute CRCs for files when they are added.  Later, when files are extracted from the archive, the CRCs can be used to check for data corruption.

How does it work?

It’s not my intention to fully explain the math behind the CRC, rather, supply just enough information to allow you (the developer) to use the code quickly and correctly.

In brief, the CRC is computed via a “division-like process”, where the remainder is the final result.  Typical CRC sizes are 8-bit, 16-bit, and 32-bit, although larger integers can be used.  This article focuses on 32-bit CRCs.

What else do I need to get started?

All you need now is to pick a generator number and a bit order.  These parameters are explained below.

  1. The generator number, or, more formally, the “generator polynomial” is the divisor for the “divison-like” CRC algorithm.  When it comes to error detection, some generators are much better than others in terms of performance.  Luckily for us, there are a relatively small set of standardized generators.  These generators are specific numbers that are better at detecting errors than a number chosen at random.  Some systems use the ANSI CRC Generator Polynomial for backward compatibility, but there are newer generators that offer better error detection than the old ANSI standard.  Several common generators are defined in the table below.  My recommendation: if you need backward compatibility with another application, use the ANSI generator, otherwise use the Castagnoli generator.
  2. The bit-order is the order in which the CRC is computed.  Computers store data in bytes (8 bits), while the CRC is computed one bit at a time.  Therefore, the developer has to decide whether the algorithm starts with the left-most (most significant) bit or the right-most (least significant) bit.  If you’re not doing any embedded firmware development, you should probably just choose most-significant-bit-first (msb-first) and move on, but embedded developers should give this some thought.  A CRC should always be computed in the same bit-order as it was transmitted / copied / received.  For example, USB transmits data least-significant-bit-first, thereby implying that the CRC should be computed lsb-first.
Polynomial Name Value MSB-First Value LSB-First
ANSI X3.66 0x04C11DB7 0xEDB88320
Castagnoli 0x1EDC6F41 0x82F63B78
Koopman 0x741B8CD7 0xEB31D82E
AIXM 0x814141AB 0xD5828281

How do I implement the CRC algorithm in embedded firmware?

Answer: you don’t have to!  I’ve posted the code and a short example program.  Important: please read the comment header at the top of the file.

Click here to view the code as formatted html (good for reading, but won’t compile or copy/paste well).

Click here to view/download the code as a plain-text source file (easy to copy/paste and compile).

How do I implement the CRC algorithm on a PC, Mac, or Smartphone?

Just as before, I’ve already published the sample code.  I’ve written a class called “Crc32″.  The documentation and the class declaration in the header, along with an example program, and info from this post should make the class’s usage rather intuitive.   Here is the download link and some usage guidelines to follow.

Download all files as a 7z archive (faster download).
Download all files as a zip archive (slower download, but more compatible).

  1. Try to instantiate as few CRC classes as possible.  It takes a significant amount of time and memory to build the look-up table, and will slow down code significantly.  For example, imagine you’re writing a program, and two of your classes are named Data and DataOutputter, respectively.  The program sends packets (in the form of a Data object) to DataOutputter, which then outputs the data somewhere.  The Crc32 class should be placed in DataOutputter, not in the “Data” class.  Ideally, it would be nice for each Data object to “know” it’s own CRC, but it’s impractical to do so.
  2. When I first wrote this code, I ran into major problems because I had to manually assemble all parts of my data into a single block of memory, which was often quite difficult.  To solve this problem, one can call compute() multiple times, allowing the developer to compute a full CRC when the data is still in pieces.  I will demonstrate this in this example code.
  3. Always call the invert() function before and after computing the CRC, unless you’re trying to conform to someone else’s standard.
  4. The Crc32 class will choose the right generator for you based on the BitOrder you specify.  If you are using a custom generator, enter it msb-first regardless of bit order.  If necessary, Crc32 will automatically convert the generator to lsb-first.
  5. The Crc32 class is not thread safe.
  6. Use the result() function to get the CRC().  
  7. For large amonts of data (>= 1 MiB) the look-up table method is about 10 times faster than direct computation.
  8. If you are calculating the CRC of very small amounts of data rather infrequently, it may be faster to just compute the CRC directly.  The Crc32 class has a static function called compute_direct() that will return the CRC without generating a look-up table.

Further Help?

I’m rather brain-fried after writing this post and code, so the post is probably filled with grammar mistakes.  I’ve also probably forgot to explain certain things, so please, comment with questions or recommendations.  With respect to the code, it should be robust though.  I’ve used it before for multiple projects.

How to Pick a Password

There’s a lot of talk (i.e. rants, arguments, confusion) on the Internet about choosing good passwords.  Until fingerprint and eye-scanning technology becomes commonplace, passwords will continue to be the cornerstone of computer security (wow, I’m sounding like a college student that has to meet a word count for a paper).  My goal here is to teach the reader how to create passwords that are easy to remember, but difficult to guess, and do so in plain English.  Let’s get started!

Out of necessity, people usually pick passwords that are easy to remember.  For example “matchstick1″, is much easier to remember than “U/7@xX+Ae’OSfI%-”.  I don’t know about you, but it would take me hours to memorize that second password.

There are many ways that a cyber-attacker can attempt to access your money and personal information.  Perhaps the most prominent method is by trying to guess your password.  It’s important to remember that an attacker usually writes a computer program that tries to guess the passwords of several different people at once.  Usually, they aren’t able to guess the password, but with so many computers out there on the Internet, chances are that the attacker will successfully guess a few passwords.  The takeaway from this is: don’t be the low-hanging fruit!  You don’t have a super-complicated password, but don’t use an embarrassingly simple password.

A bunch of attackers trying to guess your passwords may sound outlandish and paranoid, but consider this.  I use my desktop computer to backup my (and my friends’) files.  Generally, between the hours of midnight and 8:00 AM, I generally get around 15,000 different guesses from these attackers.  And this happens every day!

Because there’s only a limited number of words, it’s possible for an attacker to try every single word in the English language.  If that doesn’t work, an attacker’s program can try every single word in every single known language in sometimes as little as 30 minutes.

This brings us to a rather ironic point.  A password should never be a word…

So what can we do?  One easy method is the xkcd password method.  In summary, pick four random words.  The original example for the xkcd method was “correct horse battery staple”.  A typical 16-year-old knows about 12,000 words.  With four random words, the number of possible combinations is 12,000 x 12,000 x 12,000 x 12,000 which equals an unfathomable 20,736,000,000,000,000 (20 quadrillion possible passwords).  If you guessed 15,000 passwords per day as stated before, it would take about 4 billion years to guess all the combinations.

Another simple method is to use a long acronym.  These passwords appear random, but you can always remember them by a memorable phrase.  For example, the password:

anteinasiftfcmfisicismcowicilaltfiteaehammmttidimw

seems impossible!  But it’s actually an acronym for:

“And now, the end is near, and so I face the final curtain, my friend i’ll say it clear, I’ll state my case, of which I’m certain, I’ve lived a life that’s full, I traveled each and every highway, and more much more than this, I did it my way.”   —”My Way” by Frank Sinatra

Even using only lowercase letters, there are 5,606,184,657,6641,933,068,511,861,128,435,847,024,473,459,936,647,893,758,520,097,689,829,376 different passwords that are 50 letters long, making the password virtually un-crackable.  And if you ever forgot it, you can just hum the song in your head and type the password as you go.  Sometimes throwing in a capital letter or two will make the password more secure.

To summarize and add a few more guidelines:

  1. Your password MUST be at least 12 letters long, preferably 14 if you have stuff like medical records, bank account numbers, nuclear launch codes, etc.
  2. Do not use any word in any language, unless you’re using 4 or more words in your password (see the xkcd method above).
  3. Acronyms from songs, books, with a capital letter, number, or symbol thrown in here and they make for very, very strong, memorable passwords.
  4. Do not use any phrase, acronym, or any password derived from the Torah, Bible, Quran, Vedas, or any other religious text.
  5. Clever passwords are virtually always bad.  I’ve broken into many computers (with the user’s permission, of course) where the password was password1, 123456, Jesus, letmein, monkey, baseball, qwerty, 111111, or similar.  Check this link for the list of most common passwords.

Soon, I will be posting a program that can generate passwords using four or more random english words via the xkcd method, as well as an auto-acronym-izer.  The program will generate the password on your computer, meaning I have no access to the password, and it will not be sent over the internet.

If you learned anything new here, go change any weak passwords you have.  Identity theft can be a nightmare if it happens to you.  Look for my password generation tools soon!

—9375

P.S. A note for technically knowledgeable readers.  I know that I’ve glossed over the differences between guessing over a network vs. cracking a stolen hash, wordlist mangling rules, password vs. message digest hashes, use of salts, precomputed hash tables, GPU hash cracking, LM vs. NTLM hashes, etc.  Feel free to start a more technical comment thread if you desire.

An Introduction to Mathematica

When it comes to technical programming, model simulation, equation solving, etc., Wolfram Mathematica is, by far, my favorite tool.  The language is much more powerful and expressive than MATLAB, and, in general, creates higher-quality plots and charts.  It takes a while to get used to the syntax, but it’s well worth spending the time.

For example, pretend you’re an engineer, and you have to plot a system of differential equations.  In MATLAB, the coefficients need to be loaded into a matrix, and then the user must select the proper algorithm to run the simulation.  Usually ode45() works well for smooth systems, but for stiff systems it usually fails miserably.  The engineer is then required to back up, swap out the solving algorithm, and continue.

Compare this to Mathematica.  The user can input the differential equations as a list of properly typesetted equations.  Fractions are vertical, there are superscripts, not the ^ operator.  The equations appear exactly as they do in the textbook.  If Mathematica can solve the system analytically, it will do so, provide the solution equation, and generate the plot, all in one step.  If the system cannot be solved analytically, Mathematica will automatically select the most appropriate algorithm to numerically solve the system.  It can automatically detect stiff systems and instances where alternative algorithms are required. Of couse, the user could always override Mathematica’s autoselection, but, in general, Mathematica’s API is much more robust and efficient than MATLAB.

Well, that was a long-winded example.  Here, I have posted a basic Mathematica notebook demonstrating some of the application’s features.  This list is by no means complete; it is only meant to be an introduction.  I will post more advanced code samples later.

Click this link to access the sample Mathematica notebook.  Unfortunately, I can only post a pdf because WordPress will not let me upload a Mathematica notebook for security reasons.  I will try to externally host these files in the future, allowing everyone direct access to the notebook file.  Mathematica is not free software.  It costs $300 retail price, but most schools and Universities have free licenses for  students and faculty.  There is also a free Mathematica viewer available on the Wolfram website.  It will allow you to view, but not modify the notebook.

View the notebook!

Keyboard Shortcuts for Mac

I originally compiled this list as a guide for my grandfather when he switched to a MacBook as his primary computer.  This list is by no means complete; there are hundreds of keyboard shortcuts in Mac OS, but this lists some of the more prominent ones.  A few of these are specific to Microsoft Word, but most shortcuts are general.  Please comment if you feel that I have left out a critical shortcut.

Click here for the list!

An Introduction to LaTeX

I’ve been a fan of LaTeX ever since I was introduced to it three years ago.  As many of you may know, LaTeX makes it much easier to manage large documents and reports.  That is, if you use it correctly.  If you’ve never tried LaTeX, I strongly recommend that you check it out.  Here are the download links:

MacTeX — LaTeX for Mac (Warning: 2.0 GiB Download)

MiKTeX — LaTeX for Windows

Below I’ve posted links to a sample LaTeX document and the corresponding PDF file it produces.  This example is by no means complete.  If you’re familiar with LaTeX and would like to share additional tips and tricks, by all means comment or email me and I’ll include them.

 The LaTeX source as formatted html.  This is easier to inspect in a web browser but it’s not a plain text file that LaTeX requires.

The LaTeX source as plain text.  You will have to change the extension from .txt to .tex in order for the LaTeX editor to recognize the file.  WordPress doesn’t allow uploading .tex files for security reasons.

The resulting pdf file.

Hello there, Internet!

After many hours of Wikipedia and WordPress.org reading, the site is finally up and running.  Ended up using “9375″ as the site title to avoid unintentionally stealing someone else’s clever title.  Also, the header image is a time-lapse photo of a Space Shuttle Launch and is property of National Geographic.  I hope they don’t mind my use of their awesome picture.