Web Crawling 101: How Bots Index and Enrich New Content
As Talkwalker’s web crawler now indexes more than 300 million new articles daily and adds five new sources every second, let’s take a look at how web crawling works.
On this blog, we usually focus on what you can do with the data that Talkwalker finds for you, but there’s a behind-the-scenes process we’ve never highlighted: how do posts and articles turn up in a project?
Talkwalker has a range of different data sources from social networks that come into the platform via APIs, but we actually collect a large portion of data ourselves.
How the Web Crawler Identifies the Right Content
Web crawlers, or spiders, systematically browse the web to index content. Talkwalker runs a proprietary crawler that indexes content on websites such as online news sites, blogs, forums and message boards. He’s called Roger.
As a web page is indexed, the bot needs to decide which blocks on the site are relevant content. If you take a look at this blog page for example, you’ll notice a sidebar giving reading recommendations. While this provides value to readers on the page, for the purpose of crawling and indexing, this content is not relevant.
Roger has to be smart enough to decide what to import and what to discard. “It’s my team’s job to make sure that only clean results show up in the platform.” says Sebastien Wagener, Head of Data at Talkwalker.
“To ensure we consistently get those high quality results, we automatically detect the website structure and type of website. We then identify areas where new content will be published in the future and automatically create extraction templates for new articles that include information for dates and timezones,” explains Wagener.
Adding New Sources and Content Automatically
Social media analytics is only as valuable as the data behind it. Cutting corners on data can come back to bite you fast, especially when you’re missing an important source that may post critical information about your brand.
Talkwalker’s crawler adds 5 sources per second. ”We automatically extract all the links to detect sites which we are not yet crawling from the posts we crawl, including social network posts. So whenever a new blog, messageboard, or news site is created, there is a high probability that we’ll be adding it very quickly,” says Wagener.
Automatic schedulers make sure that every site gets visited frequently. ”We adapt our crawling smartly based on previously crawled data. Through machine learning we can predict when new posts will be published and crawl them even faster,” explains Wagener.
Many of Talkwalker’s clients want to monitor niche sites that are important to them. Eray Yumurtaci, Territory Manager UK, knows many brands want to focus on regional coverage. “It’s not uncommon for us to get a long list of regional news sources or sources in a specific language the client wants to monitor. Then customer support works with the data team to add those sources, free of charge.”
How Talkwalker's Web Crawler Cleans Content
When a new site is detected, the web crawler automatically determines whether the site structure is a blog, message board, or other, and starts crawling. “We don’t require RSS feeds to be present and don’t have to write manual parsers for those site,” says Wagener. Instead, they start crawling the website instantly, completely automatically.
But not all crawled articles will be delivered to the Talkwalker platform. “We have set up advanced filtering and rules to deal with duplicates, spam and pornographic content, so that Talkwalker users can run their analysis on a high quality set of posts.”
Enriching Content After Crawling
For brands to make the most of the data, the article on its own isn’t enough, which is why there’s another set of steps between retrieving the article and adding it to the platform.
New posts are processed and enriched using machine learning. As Wagener explains, “we cluster similar articles together, calculate the sentiment, extract entities, topics and smart filters, remove duplicates and normalise all posts.” All this is done in a matter of a few seconds, so that new results are available in real time.
Determining language is another important step before the article can appear in Talkwalker. “The platform is build internationally from the ground up, because borders don’t make a lot of sense online. We put a lot of effort in correctly detecting country and region, language, timezone and date of all our crawled articles, so that you can do meaningful analysis in Talkwalker later,” says Wagener. The platform currently supports 187 languages, including right-to-left languages like Arabic and specific Indian dialects.
“To top things off, we also retrieve the number of times an article has been shared or liked on Twitter, Facebook, LinkedIn and many other social networks, allowing brands to measure the impact on their coverage,” says Wagener. “For all images linked on this data, we run our proprietary image recognition technology, which currently processes more than 100 million images daily. This allows clients not just to query data in text, but also in images.”
Brands can align these metrics with their company KPIs or other internal data, which can be integrated through the Talkwalker API.
In Seconds from Crawling to Delivery
Articles that our web crawler detected show up in the platform seconds later. Clients can analyze sentiment, apply geographic filters or discover influencers, while the data team finds new ways to improve the crawling process further.
"We take a lot of pride in what we do and we put an immense amount of effort into the quality of the data which we crawl. Being one of the only social media providers to do the crawling on their own, without relying on any third party data provider for crawling or data processing, makes us a quality leader in this field," says Wagener.
One of his team’s priorities is to provide even more third-party data. “Today, through our partners we have already print and broadcast data integrated, making Talkwalker the industry leading 360 degree social media monitoring tool. But we won’t stop there and keep adding new sources constantly” says Wagener.