A predictable web of data - the why of YQL

An introduction to YQL

This is a demo chapter about a book covering YQL which was planned to be released with Yahoo Press this year. I had written this chapter over Christmas last year as a pitch and it then got lost in paperwork. As I am leaving Yahoo I took this with me and now you can enjoy it here for free.

A changing web

Web development in the last few years has changed drastically. When you see how people use the web these days we are not talking about surfing any longer. We don't go to one web site and spend some hours there browsing and looking for content before going to the next.

Instead we communicate with other people on the web, get recommendations, find information by following experts and learn from our friends and contacts what the new cool thing of the moment is. Products like Facebook, Twitter, Upcoming, Last.fm, MySpace, Flickr and LinkedIn are the starting points these days and if what you built gets recommended there you are part of the "cool stuff" that happens on the web. The impact of the social web goes so far that search engines that traditionally were the "yellow pages" of the web start mixing their indexes with "real time web" updates from Twitter and Facebook. Content that has been verified by people you trust and know shows up higher than content a machine considered worthwhile.

The web is now much more social and human and this is what the whole "Web 2.0" hype is all about. If you want to be a success in this new world of loosely connected information with human review and promotion your product and information needs to be available in the most easy fashion imaginable.

We spend millions of dollars on advertising and search engine optimization to leave our footprint on the web and get people to visit what we have. This is really not that necessary. If you embrace the web as a construct of linked data with human curators advertising your services and reaching high quality end users becomes much easier.

In the past we tried to do everything on our own servers building software that only we used and with data that only we had access to. This is changing as more and more companies start to use third party systems and share their own information with APIs.

APIs, or Application Programming Interfaces are ways for developers to reach the information they need by sending request parameters. So instead of doing a search on the web for "Puppies" and getting the latest results ranked by whatever dark magic the search engine applies a search API would allow you to define "puppies" as the search term, filter the results by language, date range, or give you 20 results starting with the 503rd result and other ways to customize the data returned to you. Instead of just getting an HTML document that is the search result you will get the data back as XML and you can also only request the links and the titles - none of the other information like description, amount of links pointing to this site and other things the search result page displays.

For example you can go to Microsoft Bing and search for puppies and get the result in a browser:

screenshot of Microsoft Bing
Figure 1: A search result page

By using the Bing API you can get the same information as XML:

Bing search results as XML
Figure 2: Using the API you can get the same data as XML

When you start using APIs in earnest, you realize that what we considered the web up to now - the web sites - are just one view of the information contained in the web. By digging deeper we can create a much more flexible and better performing web.

Offering data with APIs

Offering an API means that you allow developers programmatic access to your information. So instead of publishing your data as a web site and hoping people will come to it you allow people to tap into your data and show your information in their sites. You also allow them to build applications driven by your data on other platforms like Facebook, Mobile Phones, game consoles and TV sets or wherever they please.

By separating your data from your web product and offering an API you reap several rewards:

A lot of companies have already understood that concept and offer APIs. If you look at http://programmableweb.com - a kind of "yellow pages" for APIs - you can see just how many APIs are out there:

Programmableweb.com
Figure 3: At the time of me writing this, Programmable Web lists 1573 APIs to play with.

Probably the biggest success story of offering an API is Twitter. Twitter was not a success out-of-the-box because of its amazing web interface - on the contrary - the interface kept changing constantly catering to the wishes and needs of a changing community (another clever move on the part of Twitter). The main success factor of Twitter was that from the start they allowed people to update Twitter through several channels - text message, emails, using the web site or using the API. The real breakthrough came when developers integrated Twitter in systems people already used and created much handier applications to read updates and write about what you are doing right now:

Screenshots of different Twitter clients
Figure 4: Twitter and different Twitter clients.

Funnily enough even governments and not-for-profit organizations start seeing the web the same way. The US government offers their data at http://data.gov, so does the Australian government at http://data.australia.gov.au/ and the UK at http://data.gov.uk. A clever move as the market has much more specialized, keen and able developers when it comes to building human interfaces than government departments who are very much hindered by red tape and hierarchies.

screenshot of data.gov.uk
Figure 5: It is all about the data - the US government has its own API offerings - democracy in action.

This is the case for API use for data publishers, but what if all you want is to offer a web site? How do APIs work for you?

Consuming the web of data

The web is full of specialized services with social elements. These are a great opportunity for you as a web developer. Instead of trying to do everything yourself, you can use expert systems built to fulfil a certain task.

Say for example you have photos to display on your web site, and you don't want to convert them into web formats yourself. What you can do is using Flickr (http://flickr.com):

The alternative would be to write an own image uploading and conversion tool, create a tagging and describing interface and deal with the user management to boot.

On Flickr you can even limit the access to the photos to a certain group or invite people to see them on a Guest Pass.

Flickr in editing mode
Figure 6: Flickr is a full content management system for images and has an API to get the data back in handy formats to re-use.

You can do the same with other kinds of data:

Using APIs, web services and hosting services for specialized data means a few things for you:

So, hosting and spreading your data means that you already take part in the web of data - but the really cool thing is mixing it up.

Mixing and matching to create something new

Mashups or "mixing several data sources to create something new" have been quite a buzz for the last few years in the web development market. A lot of people celebrate this way of thinking and developing as the most innovative thing that happened to software development in the last 20 years.

Probably the first mashup in the web world was taking photos and placing them on a map to give them more context and make them easier to take in. Flickr now even has an interface for that:

Flickr's map interface
Figure 7: Placing photos on a map in Flickr.

The interesting thing about this is that years and years before the first computer the same trick of putting information on a map was used to solve a medical puzzle. John Snow was a surgeon in London and he was as confused as the next medical person what the reason for the outbreak of Cholera in London in 1854 was. He solved the puzzle by noting down the location and number of deaths on a map of London (all details are available on Wikipedia: http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak):

map of london with incidents of cholera related deaths as bars
Figure 8: John Snow's Cholera Map. By placing the deaths of Cholera victims on a map John Snow proved that the water supply was the main reason. So to say the first mashup ever.

Whilst you will probably not solve the great mysteries of life with a web mash-up it is still a very interesting way to make information more understandable and show the relationships between different sets of data. For example you could take the latest articles of the New York Times, analyze their content, extract keywords and then show relevant photos next to them. For all of this you have APIs and putting them together is fun. However, there is a nagging issue.

Where it all becomes tricky

It is not all fun and games though. The biggest issue with mashups is that there is no standard for API development and every API developer team made assumptions and had ideas what the best way of building one is. Take this and the great developer tradition of either not documenting at all or documenting for those who are already experts in a certain system and you have quite a job cut out for you. Right now we spend most of our time of building mashups with reading up on documentation or trying to get the different APIs to work first on their own and then with another. Each API works in different, and sometimes mysterious, ways:

So all in all the problems of using APIs are - as always - a lack of standardization and predictability. You have to be an expert in a system before you can use it although the system was actually intended to allow for easy access to information.

Now scale this up to industrial and enterprise size

Now imagine yourself being a company like Yahoo. Everything inside Yahoo works with APIs. This is simply a necessity, as things need to scale to an amazing amount of requests and traffic.

If a web site is hammered as hard as the Yahoo home page or search result page or stores as much information as Flickr with dozens of images being uploaded every second you cannot make the front-end talk directly to the databases or do heavy calculations and conversions. All of this is abstracted into APIs and cached and packed and whatever else we can come up with to make the end-user experience as swift as possible without killing the servers.

The problems with APIs, however, are the same. In a company the size and age of Yahoo APIs have been built by departments completely independent of the other parts of the company and some of them have been built years ago with techniques which are now woefully outdated. Some smaller APIs delivered amazing data but were not hosted on infrastructure that could take a massive amount of traffic and many other issues.

Using several APIs for a new product was much less of a technical exercise but an exercise in communication, trying to find the right people to talk to and documentation that is not completely outdated or written for people who enjoy reading regular expressions as a pastime.

This is why we needed a simpler way to use and mix old and release new APIs.

Looking back at what we've done before: Yahoo Pipes

Actually for end users Yahoo already had a system that makes it dead easy to take data of the web, mix it and get it back in a format that is easy to use: Yahoo Pipes:

the yahoo pipes interface
Figure 9: Yahoo Pipes is a visual interface to remix the web of data.

Yahoo Pipes is a graphical interface for mixing and filtering web data. It looks and feels like a database schema tool or Visio and thus was a big success with the visual community and people who do not like to program but feel much more comfortable using a mouse to put pieces of information together.

The usability of Pipes is pretty amazing, when you add a new data source and you try to connect it with filter methods or mix it with other sources all you need to do is to drag a handle and you see a line that can be connected to the places where it makes sense. Possible targets are highlighted so there is no way for you to make mistakes.

The visual nature of Pipes is also its problem. First of all the interface is not accessible at all. Users who have no ability to use a mouse or cannot see have no chance to use Pipes.

Furthermore, building systems that use very complex filtering and collating can slow down your browser and become hard to maintain. Maintenance was another issue - there is no versioning of Pipes so any change you do means you lose what you had done before - unless you save the pipe under a different name.

So Pipes had the right idea, but it didn't scale to what Yahoo needed. This is why they took the concept of Pipes and put it back into code - YQL was born.

YQL - select * from Internet as the solution

YQL or Yahoo Query Language is a language that describes what you want to get from the Internet and how it should be returned to you. It is very much related to SQL - the standard way of accessing databases for decades. Instead of making YQL an interface like Pipes Yahoo decided to make it a web service. That way it can be used by anyone, regardless of ability or technical environment and it scales to levels of complexity no interface could ever scale to.

We will go into details about YQL in the following chapters, but as an example let's try what we talked about earlier: take the latest articles on the term "carbon" of the New York Times, analyze their content and extract keywords and then get relevant photos.

Without YQL we'd need to do the following:

Our code needs to:

  1. Authenticate with the NYT API and get the right content
  2. Take the results, filter them down to the bare necessity and for each result call the Yahoo Term Extractor API
  3. For each of the resulting keywords, go to Flickr and get photos relating to that term.

All in all this is a lot of authentication and requesting - say for 10 articles you find 5 keywords each. This means you need to call the term extractor 10 times and the Flickr API 50 times. Depending on the API this could mean quickly that you are over the allowed requests per hour. In YQL, this becomes as easy as this:

select * from flickr.photos.search where text in(
  select content from search.termextract where context in(
    select body from nyt.article.search where query="carbon"
  )
)

Your code does one single request - all the rest happens on the YQL server farm for you. All you have to do is call the YQL web service (without any authentication unless you want to) and you get the data. You can even limit the amount of information you want and sort the results.

Benefits of using YQL

Using YQL allows Yahoo to work much more efficiently:

By making it easier for Yahoo we also have a very easy and elegant way to use the web of data without being an API expert - neither on the publishing nor on the consuming side.

YQL Benefits for data consumers

As a data consumer or mashup creator YQL makes things a lot easier for you:

Quite a bundle of arguments, isn't it? But what if you want to publish data on the web?

YQL Benefits for data providers

If you are a data provider, either already with an API or thinking about offering one YQL has a lot of benefits for you, too:

Interested? Walk with me…

This is what YQL is - a way to make the web of data accessible for people who are interested in using the information contained in it without being API experts themselves. For the experts it means that they can concentrate on building great data endpoints and implementations rather than reading and writing documentation for things that should not be hard to do.

In the following chapters you will get to know the YQL endpoints, syntax of the language and the console which makes it dead easy to find data sources and mix them. We will then go into implementations of YQL and show some very hands-on examples of how YQL makes it easy to build mashups.