Protocol Buffers for (Coding) Dummies

A (hopefully) slightly more entertaining introduction to Google protos. For people who got bored of reading Google’s official documentation, and want a slightly less formal tone of writing.

Why do we even need Protocol Buffers

Short answer: Efficient data serialization. Take my word for it.

Long answer: You technically have other options for data serialization. Let’s look at each of them and explain why they kind of suck:

The (self-declared) winner: protos!

All internal Google applications use protos for their data serialization, so that should count for something. Lots of Googlers don’t even use Chromebooks (I’m a Google engineer typing this out on my Macbook), so we don’t blindly adopt products just because our employer made it.

Example of a (heavily commented) Proto Message

Here is a .proto message representing the data needed in an address book. Scroll down more to see my detailed notes on different aspects of a proto message.

// Specify whether you're using proto2 or proto3 at the top of the file
syntax = "proto2";

/***********/
// PACKAGE //
/***********/
// Your message will be packaged under "tutorial" now
// It's good practice to use packages to prevent naming conflicts
// between different projects. Fittingly, in C++ our generated
// classes will be placed in a namespace of the same name ("tutorial")
package tutorial;

message Person {
    // REQUIRED V.S. OPTIONAL FIELDS
    // Be careful about setting fields as required; it can be a headache
    // if the fields end up becoming optional later on. Some developers
    // just stick to making all fields optional no matter what.
    required string name = 1;
    required int32 id = 2;
    optional string email = 3;

    // ENUMS
    // It's good practice to define enums if you have fields with a
    // fixed set of values they could possibly take. Much more readable
    enum PhoneType {
        MOBILE = 0;
        HOME = 1;
        WORK = 2;
    }

    // NESTED MESSAGE DEFINITIONS
    // You can define messages within other message definitions
    message PhoneNumber {
        required string number = 1;
        optional PhoneType type = 2 [default = HOME];
    }

    // REPEATED FIELDS
    // The repeated keyword allows us to have multiple numbers
    repeated PhoneNumber phones = 4;
}

Fields

A proto message is effectively a glorified set of fields. So if you understand fields, you basically know what a proto message is.

Field Types

All the standard simple data types work (bool, int32, int64, float, double, string…are there even any more?).

If you’re handling some complicated data and those primitive types don’t suffice, you can use other proto messages as field types as well! Notice the lines repeated PhoneNumber phones = 4; and repeated Person people = 1;; both of them are fields in the Person message, and use the messages PhoneNumber and Person respectively as field types. This “nesting” is what makes it so easy to represent complex data structures as proto messages.

Required v.s. Optional Fields

You’ll notice that some fields are prepended with the optional modifier while others have the required modifier. As suggested by the name, optional fields are optional and required fields are required. That’s probably one of the most redundant sentences I’ve formed in my entire life.

If you’re interested, here are some important nuances:

Tag Numbers

You can assign different numbers to fields. Notice the =1 and =2 in my code snippet. These aren’t literal values of the fields, but unique identifiers (a.k.a. “tags”) for each respective field. The compiler will freak out if you try ot use the same tag for several fields. Identity theft is a serious crime.

Default Values of Fields

All fields, if not specified or unknown, will take a default value.

Repeated Fields

To express a “list” or an “array”, create a repeated field by prepending the field name with the repeated keyword. This allows a field to be repeated any number of times. The order of the values if preserved in the protocol buffer.

The opposite of a repeated field is a singular field, but we don’t explicitly specify this. It’s just the default type.

// A list of phone numbers that is optional to provide at signup
repeated string phone_numbers = 7;

Enums

If you know all the values a field that take in advance, use an Enum type. Note that the enum value.

// we currently consider only 3 eye colors
enum EyeColor {
    // The default value is UNKNOWN_EYE_COLOR, unless specified otherwise
    UNKNOWN_EYE_COLOR = 0; 
    EYE_GREEN = 1;
    EYE_BROWN = 2;
    EYE_BLUE = 3;
}

// it's an enum as defined above
EyeColor eye_color = 8;

Packages

Add a line package my.package.name at the top of the .proto file to place a protocal buffer message into a particular package.

Packages help to prevent name conflicts, just like C++ namespaces (my.package.Person). Be careful when importing protos into other protos; you need to specify the correct package name when accessing packaged Protos, or else your compiler will freak out at you. Yes, I’ve done this many times. The bane of my existence is incorrect import statements.

Time to compile - how to make protoc magically create code files from proto files

.proto files are cool and all, but what we really care about is generating code files from it that we can actually work with. Your C++ program doesn’t know what to do with a .proto file.

The protocol buffer compiler (also referred to as protoc) does its magic and generates the code file in whatever programming language you specify. After you download everything you need and have all your .proto files ready to go, invoke the compiler in your terminal with the command below:

protoc --proto_path=IMPORT_PATH --cpp_out=DST_DIR path/to/file.proto

Let’s break down what the heck is going on in that line.

Once you run that command, you’ll get these two files in your specified destination directory:

Protocol Buffer C++ API

Our wonderful protocol buffer compiler generated a custom protocol buffer API from our addressbook.proto file! Look at all the code protoc generated from that Terminal command:

  // name field
  inline bool has_name() const;
  inline void clear_name();
  // Normal getter
  inline const ::std::string& name() const;
  inline void set_name(const ::std::string& value); //setter
  inline void set_name(const char* value); // setter
  // Mutable getter, which gives you a direct pointer to the string
  // so that you can mutate it (thus the name)
  inline ::std::string* mutable_name();

  // id field
  inline bool has_id() const;
  inline void clear_id();
  inline int32_t id() const;
  inline void set_id(int32_t value);

  // email field
  inline bool has_email() const;
  inline void clear_email();
  inline const ::std::string& email() const;
  inline void set_email(const ::std::string& value);
  inline void set_email(const char* value);
  inline ::std::string* mutable_email();

  /**********************/
  // REPEATED FIELD API //
  /*********************/
  // Get the number of phone numbers
  inline int phones_size() const; 
  inline void clear_phones();
  inline const ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >& phones() const;
  inline ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >* mutable_phones();
  inline const ::tutorial::Person_PhoneNumber& phones(int index) const;
  inline ::tutorial::Person_PhoneNumber* mutable_phones(int index);
  inline ::tutorial::Person_PhoneNumber* add_phones();

The compiler will generate getters and setters for each of your fields. The mutable getters return a direct pointer to the value. Note that for a mutable getter, even if the field isn’t initialized, it will simply initialize a empty instance.

Repeated Fields API

repeated fields get more than just setters and getters.

Enums and Nested Classes API

Where are the nested messages and enums that I defined in my .proto file? For us, that would be PhoneNumber and PhoneType respectively.

Standard Message Methods

There are even MORE methods in each message class that lets you do stuff to the message itself, not just individual fields within it. This protocol buffers compiler is giving you lots of power.

Parsing and Serialization

Finally (I swear, I’ll stop introducing new methods after this section), our lovely compiler generated methods for writing and reading messages.

There are way more parsing and serialization methods, but those are the ones we’ll use in this tutorial!

Let’s write a message

Instead of staring at the documentation, let’s actually use our protocol buffers API! Let’s say I want my address book app to write personal details to an address book file. In order to do this, I need to do the following:

  1. Create and populate instances of my protocol buffer classes (generated by the protocol compiler).
  2. Write the info to an output stream.
// This function fills in a Person message based on user input.
void PromptForAddress(tutorial::Person *person)
{
    cout << "Enter person ID number: ";
    int id;
    cin >> id;
    person->set_id(id);
    cin.ignore(256, '\n');

    cout << "Enter name: ";
    getline(cin, *person->mutable_name());

    cout << "Enter email address (blank for none): ";
    string email;
    getline(cin, email);
    if (!email.empty())
    {
        person->set_email(email);
    }

    while (true)
    {
        cout << "Enter a phone number (or leave blank to finish): ";
        string number;
        getline(cin, number);
        if (number.empty())
        {
            break;
        }

        tutorial::Person::PhoneNumber *phone_number = person->add_phones();
        phone_number->set_number(number);

        cout << "Is this a mobile, home, or work phone? ";
        string type;
        getline(cin, type);
        if (type == "mobile")
        {
            phone_number->set_type(tutorial::Person::MOBILE);
        }
        else if (type == "home")
        {
            phone_number->set_type(tutorial::Person::HOME);
        }
        else if (type == "work")
        {
            phone_number->set_type(tutorial::Person::WORK);
        }
        else
        {
            cout << "Unknown phone type.  Using default." << endl;
        }
    }
}

// Main function:  Reads the entire address book from a file,
//   adds one person based on user input, then writes it back out to the same
//   file.
int main(int argc, char *argv[])
{
    // Verify that the version of the library that we linked against is
    // compatible with the version of the headers we compiled against.
    GOOGLE_PROTOBUF_VERIFY_VERSION;

    if (argc != 2)
    {
        cerr << "Usage:  " << argv[0] << " ADDRESS_BOOK_FILE" << endl;
        return -1;
    }

    tutorial::AddressBook address_book;

    {
        // Read the existing address book.
        fstream input(argv[1], ios::in | ios::binary);
        if (!input)
        {
            cout << argv[1] << ": File not found.  Creating a new file." << endl;
        }
        else if (!address_book.ParseFromIstream(&input))
        {
            cerr << "Failed to parse address book." << endl;
            return -1;
        }
    }

    // Add an address.
    PromptForAddress(address_book.add_people());

    {
        // Write the new address book back to disk.
        fstream output(argv[1], ios::out | ios::trunc | ios::binary);
        if (!address_book.SerializeToOstream(&output))
        {
            cerr << "Failed to write address book." << endl;
            return -1;
        }
    }

    // Optional:  Delete all global objects allocated by libprotobuf.
    google::protobuf::ShutdownProtobufLibrary();

    return 0;
}

We wrote a message, so now let’s read one.