There are many ways to extend Apache Spark and one of the easiest is with functions that manipulate one of more columns in a DataFrame. When considering different Spark function types, it is important to not ignore the full set of options available to developers.
Beyond the two types of functions–simple Spark user-defined functions (UDFs) and functions that operate on Column–described in the previous link, there two more types of UDFs: user-defined aggregate functions (UDAFs) and user-defined table-generating functions (UDTFs). sum() is an example of an aggregate function and explode() is an example of a table-generating function. The former processes many rows to create a single value. The latter uses value(s) from a single row to “generate” many rows. Spark supports UDAFs directly and UDTFs indirectly, by converting them to Generator expressions.
Beyond all types of UDFs, Spark’s most exciting functions are Spark’s native functions, which is how the logic of most of Spark’s Column and SparkSQL functions is implemented. Internally, Spark native functions are nodes in the Expression trees that determine column values. Very loosely-speaking, an Expression is the internal Spark representation for a Column, just like a LogicalPlan is the internal representation of a data transformation (Dataset/DataFrame).
Native functions, while a bit more involved to create, have three fundamental advantages: better user experience, flexibility and performance.
Better user experience & flexibility comes from native functions’ lifecycle having two distinct phases:
Analysis, which happens on the driver, while the transformation DAG is created (before an action is run).
Execution, which happens on executors/workers, while an action is running.
The analysis phase allows Spark native functions to dynamically validate the type of their inputs to produce better error messages and, if necessary, change the type of their result. For example, the return type of sort_array() depends on the input type. If you pass in an array of strings, you’ll get an array of strings. If you pass in an array of ints, you’ll get an array of ints.
A user-defined function, which internally maps to a strongly-typed Scala/JVM function, cannot do this. We can parameterize an implementation by the type of its input, e.g.,
Think of native functions like macros in a traditional programming language. The power of macros also comes from having a lifecycle with two execution phases: compile-time and runtime.
Performance comes from the fact that Spark native functions operate on the internal Spark representation of rows, which, in many cases, avoids serialization/deserialization to “normal” Scala/Java/Python/R datatypes. For example, internally Spark strings are UTF8String. Further, you can choose to implement the runtime behavior of a native function by code-generating Java and participating in whole-stage code generation (reinforcing the macro analogy) or as a simple method.
Working with Spark’s internal (a.k.a., unsafe) datatypes does require careful coding but Spark’s codebase includes many dozens of examples of native functions: essentially, the entire SparkSQL function library. I encourage you to experiment with native Spark function development. As an example, take a look at array_contains().
For user experience, flexibility and performance reasons, at Swoop we have created a number of native Spark functions. We plan on open-sourcing many of them, as well as other tools we have created for improving Spark productivity and performance, via the spark-alchemy library.
The startup anti-patterns section of my blog summarizes the repeatable ways startups waste time & money and, often, fail. Learning from startup failure is valuable because there are many more examples of failures that successes. (Anti-)Patterns become more noticeable and easier to verify.
For the same reason, it’s useful to read the failure post-mortems founders write. It takes meaningful commitment to discover the posts and to distill the key insights from the sometimes lengthy prose (an exercise in therapy at least as much as reporting of the facts). Luckily, there is a shortcut: the CB Insights summary of startup failures. It’s part table of contents and part Cliff Notes. It can help you pick the ones that are worth reading in full.
Some of the insights from post-mortems come from understanding the emotional biases of founders, CXOs and investors. In the uncertain startup execution environment these biases have the ability to affect behavior much more than in situations where reality is inescapable and readily quantifiable.
Speaking of emotional biases, Bill Gurley’s post on the Unicorn pressure cooker now that the magic has worn off is a must.
When lots of players are lining up to feed at the advertising money troth, it sometimes becomes difficult to separate reality from marketing hype. The programmatic hype is that it brings efficiency to advertising (and does your laundry to boot). The reality is very different. While there are many benefits to programmatic advertising, it also causes and exacerbates many problems in the advertising ecosystem that hurt publishers, advertisers and consumers in the long run. The root cause is that the leading open programmatic protocol–OpenRTB—fails to align marketplace interests. This is what happens when adtech optimizes for volume as opposed to quality.
The question of luck came up and a commenter linked to my work on data-driven patterns of successful angel investing with the subtext that being data driven implies index investing. That’s certainly not what I believe or recommend.
The goal of my Monte Carlo analysis was to shine a light on the main flaw I’ve seen in casual angel investing, which is the angel death spiral:
Make a few relatively random investments
Lose money
Become disillusioned
Give up angel investing
Tell all your friends angel investing is terrible
Well, you can’t expect a quick win out of a highly skewed distribution (startup exits are a very skewed distribution). That’s just math and math is rather unemotional about these things.
You can get out of the angel death spiral in one of two ways. You can take the exit distribution for what it is. In that case, you need many more shots on goal (dozens of investments) to ensure a much better outcome. Alternatively, you can try to pick your investment opportunities from a different, better distribution. That’s what I like to do and this is what Jerry is advocating.
The main influencer of return for angel investors is the quality of deal flow that you can win. Why? Because this changes the shape of your personal exit distribution and, in most cases not involving unicorn hunting, improves your outcomes at any portfolio size.
As an investor, you sell cash + you and buy equity. To see better deals and win them you need to increase the value of “you.” After all, anyone’s cash is just as good as everyone else’s. The easiest way to do this is via deep, real, current expertise and relationships that are critical to the success of the companies you want to invest in, backed by a reputation that you are a helpful and easy to work with angel. One way to maximize the chance of this being true is to follow some of Jerry’s advice:
Invest in markets that you know
Make multiple investments in such markets
Help your companies
There is a bootstrap problem, however, when new markets are concerned. How do you get to know them? Well, one way to do it is to make a number of investments in a new space. In this case, your investments have dual value: in addition to the financial return expectations (which should be reduced) you have the benefit of learning. Yes, it can be an expensive way to learn but it may be well worth it when you consider the forward benefits that affect the quality of your deal flow and your ability to win deals.
As an aside, I’ve always advised angels to not invest just for financial return. Do angel investing to increase your overall utility (in the multi-faceted economic theory sense) and do it so that it generates a return you are happy with.
In summary:
Don’t attempt to pick unicorns as an angel.
Where you can get high-quality deal flow you can win, do a smaller number of deals.
Where needed, and if you can afford it, use higher-volume investing as a way to signal interest in a market and to learn about it so that you can get higher-quality deal flow.
At Swoop we have many terabytes of JSON-like data in MongoDB, Redis, ElasticSearch, HDFS/Hadoop and even Amazon Redshift. While the internal representations are typically not JSON but BSON, MsgPack or native encodings, when it comes time to move large amounts of data for easy ad hoc processing I often end up using JSON and its bulk cousin, JSONlines. This post is about what you can quickly do with this type of data from the command line.
The best JSON(lines) command line tools
There has been a marked increase in the number of powerful & robust tools for validating and manipulating JSON and JSONlines from the command line. My favorites are:
jq: a blazingly fast, C-based stream processor for JSON documents with an easy yet powerful language. Think of it as sed and awk for JSON but without the 1970s syntax. Simple tasks are trivial. Powerful tasks are possible. The syntax is intuitive. Check out the tutorial and manual. Because of its stream orientation and speed, jq is the most natural fit when processing large amounts of JSONlines data. If you want to push the boundaries of what is sane to do on the command line there are conditionals, variables and UDFs.
underscore-cli: this is the Swiss Army knife for manipulating JSON on the command line. Based on Node.js, it supports JavaScript and CoffeeScript expressions with built-in functional programming primitives from the underscore.js library, relatively easy JSON traversal via json:select and more. This also is the best tool for debugging JSON data because of the multitude of output formats. A special plus in my book is that underscore-cli supports MsgPack, which we use in real-time flows and inside memory-constrained caches.
jsonpath: Ruby-based implementation of JSONPath with a corresponding command line tool. Speedy it is not but it’s great when you want JSONPath compatibility or can reuse existing expressions. There are some neat features such as pattern-based tree replace operations.
json (a.k.a., jsontool): another tool based on Node.js. Not as rich as underscore-cli but has a couple of occasionally useful features having to do with merging and grouping of documents. This tool also has a simple validation-only mode, which is convenient.
Keep in mind that you can modify/extend JSON data with these tools, not just transform it. jsontool can edit documents in place from the command line, something that can be useful for, for example, quickly updating properties in JSON config files.
JSON and 64-bit (BIGINT) numbers
JSON has undefined (as in implementation-specific ) semantics when it comes to dealing with 64-bit integers. The problem stems from the fact that JavaScript does not have this data type. There are Python, Ruby and Java JSON libraries that have no problem with 8-byte integers but I’d be suspicious of any Node.js implementation. If you have this type of data, test the edge cases with your tool of choice.
JSONlines validation & cleanup
There are times when JSONlines data does not come clean. It may include error messages or a mix of STDOUT and STDERR output (something Heroku is notorious for). At those times, it’s good to know how to quickly validate and clean up a large JSONlines file.
To clean up the input, we can use a simple sed incantation that removes all lines that do not begin with [ and {, the start of a JSON array or object. It is hard to think of a bulk export command or script that outputs primitive JSON types. To validate the remaining lines, we can filter through jq and output the type of the root object.
This will generate output on STDERR with the line & column of any bad JSON.
Pretty printing JSON
Everyone has their favorite way to pretty print JSON. Mine uses the default jq output because it comes in color and because it makes it easy to drill down into the data structure. Let’s use the GitHub API as an example here.
# List of Swoop repos on GitHub
API='https://api.github.com/users/swoop-inc/repos'
alias swoop_repos="curl $API"
# Pretty print the list of Swoop repos on GitHub in color
swoop_repos | jq '.'
JSON arrays to JSONlines
GitHub gives us an array of repo objects but let’s say we want JSONlines instead in order to prepare the API output for input into MongoDB via mongoimport. The –compact option of jq is perfect for JSONlines output.
# Swoop repos as JSONlines
swoop_repos | jq -c '.[]'
The .[] filter breaks up an array of inputs into individual inputs.
Filtering and selection
Say we want to pull out the full names of Swoop’s own repos as a JSON array. “Own” in this case means not forked.
In both cases we are not saving that much code but not having to create files just keeps things simpler. For comparison, here is the code to output the names of Swoop’s own GitHub repos in Ruby.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Math is beautiful and, sometimes, math becomes even more beautiful with the help of a bit of computer science. My favorite proof of all time combines the two in just such a way.
This is an old problem dating back to Cantor with many proofs:
The traditional proof uses a diagonal argument: geometric insight that lays out the numerator and the denominator of a rational number along the x and y axes of a plane. The proof is intuitive but cumbersome to formalize.
There is a short but dense proof that uses a Cartesian product mapping and another theorem. Personally, I don’t find simplicity and beauty in referring to complex things.
There is a generative proof using a breadth-first traversal of a Calkin-Wilf tree (a.k.a, H tree because of its shape). Now we are getting some help from computer science but not in a way that aids simplicity.
We can do much better.
Proof:
Given a rational number p/q, write it as the hexadecimal number pAq. QED
Examples:
0/1 → 0A1 (161 in decimal)
¾ → 3A4 (932 in decimal)
12/5 → 12A5 (4773 in decimal)
Code (because we can):
def to_natural(p, q)
"#{p}A#{q}"
end
It is trivial to extend the generation to all rationals, not just the positive ones, as long as we require p/q to be in canonical form:
To me, this CS-y proof feels much simpler and more accessible than any of the standard math-y proofs. It is generative, reducible to a line of code and does not require knowledge of any advanced concepts beyond number systems which are not base 10, a straight, intuitive extension of base 10 positional arithmetic.
Note: we don’t need to use hexadecimal. The first time I heard this proof it was done in base 11 but I feel that using an unusual base system does not make the proof better.
At Swoop we use Redis extensively for caching, message processing and analytics. The Redis documentation can be pithy at times and recently I found myself wanting to look in more depth at the Redis wire protocol. Getting everything set up the right way took some time and, hopefully, this blog post can save you that hassle.
Redis MONITOR
The Redis logs do not include the commands that the database is executing but you can see them via the MONITOR command. As a habit, during development I run redis-cli MONITOR in a terminal window to see what’s going on.
Getting set up with WireShark
While normally we’d use a debugging proxy such as Charles to look at traffic in a Web application, here we need a real network protocol analyzer because Redis uses a TCP-based binary protocol. My go-to tool is WireShark because it is free, powerful and highly customizable (including Lua scriptable). The price for all this is dealing with an X11 interface from the last century and the expectation that you passed your Certified Network Engineer exams with flying colors.
To get going:
WireShark needs X11. Since even Mac OS X stopped shipping X11 by default with Mountain Lion, you’ll most likely want to grab a copy, e.g., XQuartz for OS X or Xming for Windows.
Start WireShark. If you see nothing, it may be because the app shows as a window associated with the X11 server process. Look for that and you’ll find the main application window.
Redis protocol monitoring
WireShark’s plugin architecture allows it to understand dozens of network protocols. Luckily for us, jzwinck has written a Redis protocol plugin. It doesn’t come with WireShark by default so you’ll need to install it. Run the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
If WireShark is running, restart it to pick up the Redis plugin.
Now let’s monitor the traffic to a default Redis installation (port 6379) on your machine. In WireShark, you’ll have to select the loopback interface.
To reduce the noise, filter capture to TCP packets on port 6379. If you need more sophisticated filtering, consult the docs.
Once you start capture, it’s time to send some Redis commands. I’ll use the Ruby console for that.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
In WireShark you’ll be able to see the binary data moving between the client and Redis with the benefit of the command and its parameters clearly visible.
Check out the time between request and response. Redis is fast!
I am roaming the halls of Google I/O 2013 and wondering whether Google’s platform passes the ecosystem test.
… no platform has become hugely successful without a corresponding ecosystem of vendors building significant businesses on top of the platform. Typically, the combined revenues of the ecosystem are a multiple of the revenues of the platform.
So much activity but what’s the combined revenue of the businesses building on top of Android, Chrome & Apps?
I’ve been asked to explain how online ads are delivered many times and every time I’m surprised by the complexity of covering even the most basic elements of how ads appear on Web pages. Since Wikipedia’s article on ad serving is not much help, I’ll try to explain one common way ads are delivered using a concrete example.
Side note: this is not how Swoop works. At Swoop we use a much simpler and more efficient model because we’ve built an end-to-end system. This eliminates the need for lots of different systems to touch (+ cookie + track) users. It also allows us to create deeper and more relevant matches by placing Swoop content not in arbitrary fixed slots but in dynamic slots right next to the part of a page it relates to. If you are looking for an analogy, think about Google Adwords on SERPs. It’s an end-to-end system where Google has complete control over ad placement and no ads are shown if there are no relevant ads to show.
Tools of the trade
If you want to know how adtech works, there is no better tool than Ghostery. Ghostery was created by my friend David Cancel and, later, when he was starting Performable (now part of Hubspot), my previous startup, Evidon, became the custodian of the Ghostery community. Ghostery will show you, page-by-page, all the different adtech players operating on a site. For example, on Boston.com’s sports page, there are 35 (!) separate adtech scripts running today.
Ghostery will show you what is happening but not how it happened. If you are technical and want to understand the details of how ad delivery works, there no better tool than a debugging proxy such as Charles or Fiddler. Just be prepared for the need to use code deobfuscators. If you don’t have time for wading through obfuscated code and you really want to know what’s going with your sites(s) or campaign(s), it is worth taking a look at Evidon Encompass. It’s an advanced analytics suite built on top of the Ghostery data.
The example
The example we’ll use is the arthritis page on Yahoo!’s health network. We will focus on the leaderboard ad at the top, which is a Celebrex (arthritis drug) ad from Pfizer.
What Yahoo! sent the browser
The initial response my browser got from Yahoo!’s server included the following chunk of HTML about the leaderboard ad unit, which I’ve formatted and added comments to. (Not sure what’s up with the empty lines that WP is adding to the bottom of the gists–they’re not on GitHub).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
This content was mostly likely emitted by the Yahoo publishing system without direct coordination with the Yahoo ad server but instead using conventions about categories, page types, etc. and hence parameters like rs=cnp:healthline that you see on the URLs.
Display advertising units use standard IAB formats. In this case, we are dealing with a 728×90 leaderboard unit. The DIV with id yahoohealth-n-leaderboard-ad sets up the location where the ad unit will be displayed. The DIV under it serves the dubious function of controlling some styling related to the ad content.
Beyond this there are two things going on here. The first is the delivery of the ad script and the second is the delivery of a tracking pixel via a tracking pixel script.
Tracking pixels
Tracking pixels are 1×1 invisible images served from highly-parameterized URLs. They are not used for their content but for the request they generate to a server. The request parameters are used for record-keeping and the response could be used to cookie the user, though this did not happen in this case.
The tracking pixel is delivered via the script inside the <center> tag. It’s contents are shown below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
The script uses the JavaScript document.write function to write some HTML into the page. In this case the HTML is for an invisible image (display: none, height: 0, width: 0) whose URL is that of the tracking pixel, whose unencoded value is the long URL:
As you can see, lots of data getting sent, most likely to record the impression opportunity parameters.
Yahoo! ad delivery script
There are two ways to deliver an ad unit. The preferred way is via a script. If scripting is disabled in the browser, however, Yahoo doesn’t want to lose the ad impression opportunity and so there is the <noscript> option to show the ad in an iframe, probably as an image.
The code for the Yahoo! ad delivery script, which comes from the Yahoo! ad server, is shown below with reformatting and comments from me.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
AdChoice came about in 2010 as the online advertising industry’s response to FTC pressure to reign in some poor privacy practices and provide consumers with more transparency and choice when it comes to interest-based advertising, a.k.a., behavioral targeting (BT in adtech parlance).
The AdChoice icon is a triangle with an i in it. Its color can vary. Yahoo!’s is gray (). Next time you see an ad with it, click on the AdChoice notice. You should see information about who targeted the ad at you and get some options to opt-out of interest-based advertising. We started Evidon back in 2009 to bring more transparently to adtech and we helped create AdChoice. Evidon is now the leading independent player in this space.
In the case of the Celebrex ad from our example, the AdChoice icon is tied to a very long URL:
If you click on the AdChoice icon, Yahoo! will record information about which ad you are selecting to learn more about and then redirect you to the page at the end of the URL, which is the Yahoo learn more about this ad page. The long URL is just for bookkeeping.
BTW, the reason why you don’t see AdChoice notice with Swoop is because Swoop does not do any behavioral targeting at this time. Still, because we want to make it clear that Swoop is serving content, you’ll see our logo on our units.
Cache busting
After the AdChoice notice setup comes a line of script that creates a random number. This is used for cache busting.
A cache-buster is a unique piece of code that prevents a browser from reusing an ad it has already seen and cached, or saved, to a temporary memory file.
Adding a random number to a URL does that nicely.
Google ad delivery script activation
The following script tag loads Google’s ad delivery script from ad.doubleclick.net. We will look at this later on.
Yahoo impression tracking
Remember how Yahoo already fired one tracking pixel to record the impression opportunity. Well, here, at the end of the script they are going to fire another tracking pixel but this time the purpose will be to record the impression of the Google ad. As before, you can see lots of data being passed.
As before, in the case that the browser does not have JavaScript enabled, Yahoo doesn’t want to miss the opportunity to deliver an ad, which is why they have the option to display the Google ad as an image.
In that case, Yahoo is also positioned to capture the click and then redirect to Google. This is achieved by wrapping the image (<img>) in a link (<a>). Getting click feedback data would be valuable for Yahoo as it allow is to optimize better. If the unit is sold on a cost-per-click (CPC) basis, then getting click data is a requirement for good record-keeping.
Google/DoubleClick ad delivery script
It’s time for us to take a look at what Google’s ad delivery script does. Alas, the guys at Google don’t want to waste bandwidth so they’ve packed everything into a single unreadable document.write call. You can scroll to the right for a very long time…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Here is what Google is actually trying to write into the HTML page (with my comments added):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Don’t worry about the volume of code. There are basically two things going on here: delivering a Flash ad and lots of third party ad verification.
Delivering a Flash ad
Flash ads are richer and more interactive but there are browsers where Flash ads don’t do so well. The first part of the Google/DoubleClick ad delivery script is about carefully determining whether a Flash ad unit can be used and falling back to images otherwise. As before, all clicks are tracked via redirects.
Third party verification
We saw Yahoo! attempting to fire three types of tracking pixels: (a) for impression opportunities, (b) for actual impressions and, in the case of no scripting, (c) for clicks. This is to help optimize the performance of Yahoo!’s ad network. This is first party verification. Google/DoubleClick does the same with its own systems.
Third party verification happens when the advertiser asks the delivery network, in this case Google/DoubleClick, to include additional verification tags (scripts) to prove that the campaign is delivered based on its configured parameters.
In the case of this Celebrex campaign, Pfizer is using four separate verification vendors. At the top level we have only Nielsen NetRatings and DoubleVerify, however NetRatings’s script loads AdSafe as well as Facebook in the pattern we are familiar with: a script that writes out <script> tags to load more scripts.
Putting it all together
Let’s try to piece together the requests that allow this one single ad unit for Celebrex to appear:
Yahoo ad unit delivery script
Google ad delivery script
Flash movie
??? (not easy to track Flash traffic)
Nielsen NetRatings tracking script
AdSafe pixel
Facebook iframe
Facebook tracking
??? (did not analyze)
DoubleVerify script
Tracking script (like a pixel)
Yahoo impression tracker
Tracking pixel
Yahoo impression opportunity tracker
Tracking pixel
All in all, 13 separate HTTP requests to 6 separate companies, not counting redirects and cacheable scripts. With this much complexity and independent parties doing their own accounting, it’s no surprise the display advertising value chain is in a bit of a mess right now.
One of the fastest ways for a startup to grow has always been to ride on the shoulders of a successful platform: from Microsoft/OSS in software to AWS in cloud computing to iOS/Android in mobile to Facebook/Twitter/Pinterest in social to IAB/Google in advertising and the many SaaS players. Betting on a platform focuses product development both because of technology/API choices and because of the automatic reduction in the customer/user pool. Also, platforms that satisfy the ecosystem test help the startups that bet on them make money. That is, until they don’t.
I’ve been involved with three startups that have been significantly helped by platforms initially and then hurt by them. Two cases involved Microsoft. One case involved Twitter. The first time it happened, our eyes were closed and it hurt. It prompted me to learn more about how platform companies operate and how they use and abuse partners—companies small and large—to help them compete with other platforms. The basic reality is that platform companies will do whatever it takes to win and they typically don’t care much about the collateral damage they cause.
Just like hacking fast & loose, which accumulates technical debt, accelerating the growth of a startup by leveraging a platform may come with substantial platform risk.
Note: links to undocumented anti-patterns will take you to the main list.
Startup Anti-Pattern: Platform Risk
What it is
Platform risk is the debt associated with adopting a platform. Platform risk becomes an anti-pattern when three conditions are met:
The platform dependency becomes critical to company operations.
The company is unaware of the extent of the risk it has assumed.
There is increased likelihood of adverse platform change.
Here are the top 10 sub-patterns of platform risk hurting startups that I’ve seen:
Lock-in. Startups that adopt a closed platform can be locked into their choice typically for the duration of the company’s life. This is not a problem until the need arises to support another platform. At that point the time & cost associated with the work could be substantial, especially if the core architecture was not designed with this in mind. In many cases, it is cheaper to start from scratch.
Forced upgrades. When software came in boxes, if you didn’t like the new version or if it was incompatible with your own software, you and your customers did not need to upgrade. You could take the time to make things work and upgrade on your own schedule. In the platform-as-a-service world, you do not have this option. Instead, forced upgrades are the norm. You have to deal with them on the platform vendor’s schedule, which may be quite inconvenient and costly. You do not have the option to ignore the update. Vendors vary widely in how they manage their partner ecosystems with respect to forced upgrades. Google has been pretty good when it comes to its APIs and has acted like a not-so-benevolent dictator when it comes to non-API-related behaviors of services such as search and advertising. Facebook and Google have both been accused of manipulating the behavior of their systems to force businesses to spend more money in their advertising platforms. In the case of Facebook, the issue has been pay-to-play for likes. Google has come repeatedly under fire for manipulating the search user experience to (a) shape traffic away from large publishers it competes with and (b) reduce advertiser choices and drive more ad dollars to AdWords. If your business depends on SEO or SEM, these changes can be very significant. The former CEO of a large advertising agency once summarized this as “Google giveth and Google taketh away.”
Forced platform switch. A forced platform switch usually comes as a side effect of platform vendors playing turf wars. For example, Apple severely hurt Adobe’s Flash platform as a way to limit write once, run anywhere options in mobile, thus also slowing Android’s adoption a bit. Thousands of small game & other types of content developers in the Web & Flash ecosystem were affected and had to either abandon iOS development or find new costly talent.
The partner dance. The partner dance is most commonly seen in enterprise software. It was popularized by Microsoft. As one former MS exec described it to me: “first you design your partners in and then you design you partners out.” During the design-in phase, a platform vendor partners with and, in some cases, spends meaningful resources helping an innovative startup company with solutions that compete with the solutions of another platform vendor. As the platform company’s own product roadmap matures, it designs its partners out starts directly competing with them.
Swinging. Swinging is a variation of the partner dance where rather than competing directly with a startup, the platform vendor partners with one of the startup’s competitors. Some years ago I was on the board of a European company that was Microsoft’s preferred partner in a fast-growing market. After winning against much bigger players such as EMC and IBM, the startup convinced MS that there was a big business to be built in this market. At that point MS promptly terminated the startup’s preferred status and partnered with a much bigger competitor. We were expecting the move: Microsoft now wanted to move hundreds of millions of dollars of its platform products in this space and the startup, despite closing significant business, could not operate at this scale. The Facebook/Zynga saga is an example from the online world.
Hundred flowers. The name of this sub-pattern comes from the famous Chairman Mao quote “Let a hundred flowers blossom.” Mao fostered “innovation” in Chinese socialist culture—open dissent—and then promptly executed many of the innovators. It seems that Twitter, Facebook and other social platforms have studied the Chairman quite well, judging by how efficiently they have moved from relying on the adopters of their APIs for growth and traffic to restricting their access and hurting their businesses. The prototypical example is Twitter driving much of its traffic from third party clients and then moving against them.
Failure to deliver. Startups pick platforms not just because of their current capabilities and distribution but also because of their expected future capabilities and distribution power. If the platform does not deliver, the startup’s ability to execute can be significantly hampered. One of the most common use cases of this sub-pattern relates to open-source platforms where the frequent lack of a single driving force behind a product or service could lead to substantial delays. At various points, teams I’ve been involved with have had to dedicate significant resources to accelerate development of OSS, e.g., Apache Axis, which turned out to be the most popular Web services engine, and Merb, whose adoption turned out to be a bad platform decision for my startup. It’s rewarding work but it also usually is plumbing work that generated little business value.
Divergence. Divergence is a form of failure to deliver rooted in a change of strategic direction of the platform. Divergence can be very costly over time and difficult to diagnose correctly because it happens very slowly. The analogy that comes to mind is of a frog in a pot of water on the stove. I knew a startup with a neat idea on how to provide significant value on top on the Salesforce platform APIs. They just needed one improvement that was “on the roadmap.” The improvement remained on the Salesforce roadmap for more than two years as the startup ran out of money. The hidden reason was that Salesforce had grown less interested in the use case. Another Salesforce-related example is the recent hoopla about the unannounced changes in Heroku’s routing mechanism, which cost RapGenius a lot of money. In this case, the reason was Heroku moving from being a great place to host Ruby apps to being a great place to host any apps and in the process becoming a less great place to host Ruby apps.
Poison pill. A platform choice made years ago could turn out to be a poison pill when it comes to selling your company to another larger platform vendor. As an example, consider the case of Google buying a company whose products are built on Microsoft’s .NET platform or Microsoft buying a SaaS collaboration solution that runs on Google Apps. Alas, most startups do not think about the exit implications as they make platform decisions early on.
Exit pressure. Platform companies may sometimes exert substantial pressure on partners when they want to acquire them. When Photobucket did not want to sell to MySpace they somehow experienced “integration problems” with MySpace, which affected their traffic. The sale soon completed. This goes to show that talking softly while controlling the source of traffic tends to deliver results. This week we learned that Twitter’s acquisition of social measurement service Bluefin Labs involved some threats, which must have been perceived as credible since 90% of Bluefin’s data came from Twitter.
Diagnosis
Good diagnosis of the platform risk anti-pattern is exceptionally difficult because it requires predicting the future path of a platform as well as those of the platforms it competes with. The basic strategy for diagnosing this anti-pattern involves three parts:
Investment in ongoing deep learning about the platform and its key competitors. This should cover the gamut from history to technology to business model to the personalities involved.
Developing relationships with industry experts with a deep perspective of the platform, whose businesses, like telltales on a sailboat, in some way provide leading indicators of platform change. You don’t want just smart people. You want people with proprietary access and data. For enterprise software try preferred channel partners. For open-source software try high-end OSS consultants. For advertising, find the right type of agency.
Network into the group(s) responsible for the platform, both involving people currently on the job as well as senior people who’ve recently left. This latter group has been the most helpful in my experience.
Ignorance is the most common anti-pattern that makes the diagnosis of platform risk difficult.
Misdiagnosis
A common misdiagnosis stems from failure to consider the effects of competitive platforms on the platform a startup has adopted. Sometimes it is these competitors’ actions that trigger the negative consequences, as was the case of Apple’s decisions hurting the Adobe Flash developer ecosystem.
Refactored solutions
Once diagnosed, the key question regarding the platform risk anti-pattern is whether anything at all should be done about it. Most companies choose to live with the risk, though very few fully use the diagnosis strategies to get an accurate handle of the net present value of the risk.
The refactoring of platform risk is typically very, very expensive as well as very distracting. For example, some would argue that Zynga’s fight on two fronts (a) trying to refactor its platform risk related to Facebook and (b) ship new games is what hurt the company’s ability to execute.
In the case of platform risk, prevention is far better than any cure. In the words of Fred Wilson (an investor in Twitter): “Don’t be a Google Bitch, don’t be a Facebook Bitch, and don’t be a Twitter Bitch. Be your own Bitch.” Being your own bitch doesn’t mean not leveraging platforms. It means getting in the habit of doing the following three things in a lightweight, continuous process:
Explicitly evaluate platform adoption decisions, once you have sufficient information. Having sufficient information usually involves more than reading a few blogs. For example, at Swoop we recently had to make a search platform choice. We decided to go with Elastic Search but not before I had talked to the company, not before Benchmark invested significantly in ES, and not before I’d talked to friends who ran some of the largest ES deployments to get the lowdown on what it was operate ES at scale.
Invest the time to learn about the platform and develop the relationships that would help you have special access to information about the platform. Here is my simple rule of thumb with respect to any platform critical to your business: someone on your team should be able to contact one of the platform’s leaders and get a response relatively quickly. This is especially important if you are dealing with new or not super-popular open-source projects. The best way to achieve this is to think about how you and your business can help the platform.
Every now and then spend a few minutes to honestly evaluate your company’s level of platform risk and think about how you’d mitigate it and when you’d have to put mitigation in action.
Remember, the goal is not to eliminate platform risk. You cannot do this while at the same time taking advantage of a platform. The goal is to efficiently reduce the likelihood of Black Swan-like events related to the platform hurting your business. If you understand the mechanics of how platforms operate and how platform risk accrues, you will be able to predict and prepare for events that take others by surprise. These are sometimes the best times to scale fast and leapfrog competitors.
When it could help
Betting on a platform can be hugely helpful to a startup, despite some level of platform risk. There is never a benefit from platform risk increasing to the anti-pattern level.
***
The startup anti-pattern repository is work in progress helped immensely by the stories and experiences you share on this blog and in person. Subscribe to get updates and share this anti-pattern with others so that we can avoid more preventable startup failures.
The difference between Yoda & Dirty Harry voices is... significant. Next step: A/B test which voice drives desired… twitter.com/i/web/status/1…1 week ago