Quantcast
Channel: ProgrammableWeb - Law
Viewing all articles
Browse latest Browse all 169

Why Forcing LinkedIn to Allow Scrapers Sets a Dangerous Precedent for the API Economy

$
0
0
Primary Target Audience: 
Primary Channel: 
Primary category: 
Secondary category: 
Related Companies: 
Summary: 
A federal judge has ruled that Microsoft's LinkedIn service must make the data found in most of its users' profiles freely available to third parties who want to programmatically "scrape" the site for that data instead of going through the service's official API and abiding by its Terms of Service.

Last night, I thought my work day was over as I was doing one last scan of the interwebs when I saw it. Usually, a one word tweet -- "Whoa" -- isn't enough to get my attention. It could be anything. But it came from Sam Ramji, now working in developer relations at Google and formerly of Apigee and Cloud Foundry; someone that I know is not easily surprised. My work day apparently wasn't over. I drilled backwards through the tweetstream to find another friend, Redmonk principal analyst Steve O'Grady, who tweeted"so this was interesting and surprising."

They were responding to news that a federal judge has ordered Microsoft's LinkedIn to, within 24 hours, remove any technical barriers that prevented 3rd-parties from scraping the data off of any LinkedIn profiles designated as public (which must be like, all of them). As it sank in, I gasped.

"Scraping" gets its name from the phrase "screen scraping." Back in the PC days, before the Web was hot, some very clever programmers wrote routines that could reverse-engineer whatever data an application was displaying on a PC's screen and then siphon that data off to a structured database. Along came the Web with its highly repetitive structural HTML patterns from one page to the next on data intensive sites like LinkedIn and eBay and now, developers didn't even have to intercept the pixels. They just had to retrieve and parse the HTML with the same kind of Regular Expressions that drove the success of search engines like Google and Yahoo!  

For sites that don't offer an API, making scraped Web data available through an API (sometimes called "Scrape-P-I") can be an invaluable workaround to remixing a site's data into new and innovative applications. There are even services that will do it for you for sites that allow it because they don't have an API. ProgrammableWeb recently reviewed one (see How to Turn Existing Web Pages into RESTful APIs with import.io). 

However, like many sites containing valuable data that third parties would like to freely access, there are manual, technical, and legal barriers to scraping LinkedIn. After LinkedIn blocked hiQ Labs from scraping its site, hiQ Labs filed a lawsuit and prevailed on the basis that LinkedIn was “unfairly leveraging its power in the professional networking market for an anticompetitive purpose.” The judge likened LinkedIn’s argument to the idea of blocking “access by individuals or groups on the basis of race or gender discrimination.”

In my opinion, the judge got it wrong and the implications for API providers of this terrible decision should not be underestimated.  For those of you that know me and my history of defending open computing, you would think that I might hail this decision. After all, who is LinkedIn to hoard all that data for itself? However, when it comes to collecting data and organizing it for ease of searching through it, displaying it, and consuming it with a machine (eg: through an API), my opinion on this matter is deeply affected by the strategic and tactical investments we make in order to provide ProgrammableWeb's API and SDK directories as a free service.

A long long time ago, wihether it had to do with personal profiles or API profiles, the grand majority of related data that lived wherever it lived was (and often still is) both disorganized and unstructured. Highly "disaggregated" as we like to say. Where it was organized or structured, it was only in pockets. For example, your contact data might have been structured and organized according the contact management system you used like the the one embedded in your email system. But other information like the list of jobs you held and what you did at each of them was likely scattered across resumes and other text files if at all. 

When the engineers at a service like LinkedIn sit down to think about their data model, they have to think about what sort of experiences they want to offer to their various constituencies, what sort of data is required to enable those experiences, where that data can be found, and, once it is found, how to best store it in a database. This involves the design and construction of schemas, tables, vocabularies, taxonomies, and other important artifacts that, taken together, enable great and performant user experiences. For example, as soon as you discover that one entity type might have two or more of another entity types connected to it, you've got what's called a one-to-many relationship that must be factored into your data model. An example might be how one person has multiple jobs and each job is connected with a company.

Or, in ProgrammableWeb's case, when we first made a provision in our API profiles for the location and type of an API's public-facing API description file (like an OAS, RAML, or Blueprint-based description), we wrongly assumed there would be only one such description file per API. Just including the fields for the API description in our profle was an important data model decision aimed at serving the needs of both API providers and developers (our primary audience). But then when we saw services like Stoplight.IO and Microsoft's Azure API Management offering more than one description file per API (StopLight offers both OAS and RAML, Azure offers both OAS and WADL) , we decided to fix our data model to accommodate those use cases. 

Continued from page 1. 

Such decisions do not come for free. Never mind the time of the stakeholders involved in the decision making process. There's the engineering time that goes into adjusting the data model. Then, the Web designers must design a variety of experiences to support that new data model. How is the data displayed in a profile? How is the data collected and if data entry forms are required, who designs those and what's the validation logic? Who writes the help text and tool tips for the new fields and codes the HTML? Then, the old data model has to be migrated to the new data model. In the aforementioned ProgrammableWeb use case, data that was originally kept with the base API data now has to be broken out into a separate table. All code that previously referred to the old data model has to be fixed to support the new model. Then there's the testing. Did the migration work? How about regression testing of all the new code (gotta make sure we didn't break anything). And finally, the data entry to populate the new data model.

For example, here at ProgrammableWeb, we have API researchers adding and updating API and SDK profiles every day. Well, as a result of one little data model decision, now we have to go back and update more than 18,000 records. I'm not saying we have a perfect model or that some things couldn't have been done more efficiently or with better technology. But it's a system that works and we're delivering real value to over a half-million people every month. 

Now, imagine that you're LinkedIn. You've made and continue to make thousands of decisions like this and because of the investment you've made in organizing and aggregating all that information on behalf of your users -- something that's never been done that way before -- you've become the most popular service of your kind on the Internet. That data model and the way you've organized it for your users is your unique selling proposition. It's the key differentiator that sets you apart from other similar offerings. And for other businesses who view your investment as being the best opportunity to do other work with that data, you offer (and monetize) an API on terms that keep licensees from replicating your service. 

That is, until a federal judge says you can't; that more than a decade's worth of investment must be given away without any reimbursement whatsoever. 

This decision, in my estimation, is a complete travesty. It's a failure on the judge's behalf to understand the fundamentals of Silicon Valley as an engine of innovation and a key contributor to the growth of the US economy. Think about it. With this precedent set, just about any Web site that involves profiles of entitites be they users, businesses, products, etc. is now on notice. That includes Facebook, Instagram, Pinterest, eBay, Twitter, CraigsList, Match.com, Reddit, and ProgrammableWeb just to name a few.  All the companies that have invested millions of dollars into inventing something of real and unique value that aggregates and organizes data, marketplaces, or emotions and who have hired thousands if not millions of employees in the course of running their businesses must now give away their most precious asset.  

But wait. It gets worse.

Not only must LinkedIn (and potentially others based on the precedent) give away their most precious asset, they must also allow third parties to circumvent its APIs in order to retrieve the data they want. For any API provider that also runs a public Web site as a means of viewing the data they manage, this is a dagger in the heart. 

What the judge surely doesn't realize is that as much as APIs are a great way for machines to interface with a database, scraping Web pages is an activity that inflicts serious harm on the operator of a Web site. For example, with the right API management system and DevOps processes in place, an API should be reasonably easy to scale according to demand for data. API providers will also have some idea of who is hitting their databases, for what applications, and how that drives additional value for the company. 

But once third parties can scrape your site, scaling and understanding the connection between your data and the value its driving pretty much go out the window. You might argue for example that Web sites should auto-scale according to demand anyway. True. There are engineers whose job it is to develop an understanding of the patterns generated through human usage and to set the entire Web stack up to scale accordingly. But once you throw machine-backed scrapers into the mix, all bets are off. All it takes is for a few scrapers to show up at your front door unannounced and you've got the equivalent of a Denial of Service attack. 

Web site analytics is another part of running a Web site that scrapers intefere with. Most Web sites are measuring their traffic on a daily if not hourly or minute by minute basis. Suddenly, you have hundreds, thousands, or millions of hits showing up on your Web site and you think you've struck gold in terms of traffic. But then, your phone rings and it's your analytics analyst with bad news; it was a scraper. Or two, or three.  You use Google Analytics and it does a great job. But it never sends you an alert that a scraper is in your midst, screwing up your traffic. Out of the box, it's just not that smart. 

Then, there's the efficiency issue. There are ways to design APIs such that data requests and responses are built to minimize the stress on systems and networks. It has to do with what data is delivered in an API (and what data isn't). It has to do with whether it's a hypermedia API. Or maybe it's a GraphQL-based API that retrieves an entire graph of data. Or perhaps the API is based on Protocol Buffers or Apache Thrift and data moving back and forth is compressed into a binary format to save resources and improve performance. There are also sorts of considerations that go into APIs in order to make them the most efficient way to retrieve data. But the minute a third party resorts to crawling Web pages to do the same thing, it's like a boat returning to the dock after each fish is caught instead of waiting until the hold is full. All the involved resources will be stressed, maybe to the point of breakage. 

Finally, there are many startups out there right now that view the opportunity to organize some domain that is yet to be organized the way LinkedIn has organized its domain as an amazing opportunity (both as a Web site and an API). But with this precedent hanging out there, the risk is too great that someone will come along, circumvent your API, and worse yet, use your own data against you. Innovation will have been stifled. Jobs that could have been won't be and dreams will be broken.

I could go on. But I think you get the point. Thankfully, Microsoft will be appealing. Hopefully, they'll prevail. 

Oh, and one more thing: one of the most common questions we get is: "Is there an API to the ProgrammableWeb API Directory?" The answer is, "No, but there will be." Few things aggravate me more than not having an API. Our data model is the key enabler to two features; a killer power search and an amazing API. We have a few final tweaks that we're making to our data model and once that's done, both the power search and the API will be within reach. 

Content type group: 
Articles

Viewing all articles
Browse latest Browse all 169

Trending Articles