Tarikh Korula’s company, Seen.co, is closing in on a hundred thousand monthly unique viewers, a stat that grew an alarming 72% last month. Much of this is organic traffic from search engines like Google, where Seen pages appear more than thirteen million times per month.
Natural Language Processing, or NLP, is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
Here are useful APIs that help bridge the human-computer interaction:
Broadly speaking, a startup goes through three phases of growing.
The goals, metrics, channels, focus, team structure, everything evolves and changes as you move through these three phases. Knowing where you are in this path helps you understand what…
This post contains some notes useful to me I took about web crawlers. To other readers, the bibliography below should be much more useful.
In  you can find a good explanation of the basic algorithm of a crawler, with observations about some common problems (e.g. exponential growth of the urls to crawl and the need to limit it in some way; caching DNS lookup; duplicate documents; keep TCP connections), and about some models of crawler (general coverage, offline crawler, adaptive crawler).
- Offline Crawler: it starts with a queue, ordered in some fashion; the queue is also called segment. As it crawls pages, it just stores them. A separate process read the stored pages, extracts link, decides which links to be crawled, sorts them by domain, and creates a new segment. When the crawler finishes to crawl the initial segment, it grabs a new segment and starts to crawl again.
- Adaptive crawler: a crawler that reacts in real-time, to be more or less selective about what it downloads. It allows to limit the queue, for better performance, especially to find more relevant information faster and with less bandwidth.. This is based on: live queue, a way to prioritize the urls it encounters, and the ability to prune the queue when it becomes too large.
There is a thread that constantly re-prioritize the queue based on information that the crawler is collecting from analyzing the documents it finds. In such way, the urls with lowest priority are simply removed.
An example: stop to crawl a particular path, because it contains useless document.
It consumes more CPU and memory.
How ranks work: if a page leads the crawler to the target page, it’s good, otherwise it’s not.
Crawlers that bring more visitors are welcome.
User Agent of every request should include the name that identifies the crawler, and a link to a page that contains information about the crawler, like “MSbot, http://domain.com/msbot”.
This page should contain at least:
- general information about the crawler: description of the company, kind of information the crawler is looking for, and what will be doing with this information. In case, if the crawler is part of a school project
- if it drives more visits to the site it is crawling
- list of IP addresses that the crawler operates from
- mention that the crawler respects the robots.txt file
- including information on how to block the crawler, in robots.txt (specifying “User-agent: <name of the crawler>” and “Disallow: *”)
- include an email address or form that people can use to contact you about the crawler
For robots.txt information see http://en.wikipedia.org/wiki/Robots.txt. Keep in mind the it isn’t a real standard, but it may have some extensions not used by all.
Too many request per seconds from the crawler may be assimilated to a DoS/DDoS attack. In this case, the owner of the site can block the crawler and report to your ISP as taking part in a denial of service attack.
One good delay among requests is the delay specified in robots.txt in the field “Crawl-delay”.
If you receive a complaint from a site owner, an answer may be:
Dear Mr. Webmaster:
Thank you for contacting me about my Web crawler. I make a concerted effort to ensure that the crawler operates within accepted community standards, and I apologize if it caused problems on your site.
I would like to research this matter to discover the source of the problem, which could very well be a bug in the crawler. Can you please send me the domain name for your site, and the approximate date and time from your Web server’s logs? It would be best if I could see the exact entry from your server log. That way I can match up the crawler’s logs with your information and more reliably discover the source of the problem.
Thank you again for notifying me of this problem. I will contact you again when I have discovered the source of the problem.
For more information about my crawler and what I’m doing with the information the crawler collects, please visit my information page at http://example.com/crawlerinfo.html
Banned requests tipically return a non 200 response (like 403/503), or redirect to a captcha page. In this case, stop to crawl that domain or move the request to an other ‘clean’ node, as Crawlera does.
- rotate user agent from a pool of well-known ones from browsers
- disable cookies as some site may use them to spot bot behaviour
- use download delays (2 or higher).
- if possible use Google cache
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
- use a highly distributed downloader that circumvents bans internally. An example is Crawlera
Regarding Crawlera: it uses an algorithm to avoid to get banned. It distributes the crawler task, throttling requests sent to the sites, blacklisting an IP address if it got banned and avoiding to use it for future requests to that domain .
It can be used to avoid requests to the domain to crawl.
To access the Google cache for a particular page, you can use the following endpoint, where http://example.com is the page you want to download
To get only the text, add the parameter '&strip=1' to the url.
Keep in mind that Google cache doesn’t accept unusual User Agent, specifically you should use the user agent of a browser.
Consider to use AWS spot instances
As described , you may crawl 1-3 million pages daily for just 3$ (excluding storage costs).
Public architectures of Web Crawlers
In  you can find this list
 - http://crawlera.com/
With most systems, trying to run a database of any significant size requires specialized knowledge, both to build your app and to manage the database it runs on top of. MongoDB makes your first 100GB simple - from running the database to writing the code. As your database gets larger, though, it…
I’m very excited to announce that my team, FlowsBy, was selected to receive the 25,000 euro funding from TechPeaks! We are tackling the problem of how to keep shoppers on your ecommerce site longer. My teammates include:
Riccardo Osti is from Rome, and has founded 2 Italian startups…
During my Skillshare class I cover a wide variety of topics on Lean Product Management to give my students a good overview. One topic we get into before the workshop section is Tools for Lean PMs. These are web tools for creating MVPs, measuring analytics, and watching user behavior.
A very interesting set of slides from Christophe Pettus looking at the features in PosgreSQL that would allow one to use it as a document database:
- built-in type
- can handle very large documents (2GB)
- XPath support
- export functions
- no indexing, except defining custom ones using…