Automatic Extraction API

AutoExtract FAQ

How to use the API?

The AutoExtract API is a service for automatically extracting information from web content. You provide the URLs that you are interested in, and what type of content you expect to find ie product or article. The service will then fetch the content, and apply a number of techniques behind the scenes to extract as much information as possible. Finally, the extracted information is returned to you in structured form. Documentation can be found here.

Why isn't data extracted correctly?

Web scraping is complex - there are bans, location-specific content, issues with remote websites, misbehaving web pages. Like humans, any useful machine learning (ML) technology is never perfect. If you spot an error, please let us know, as we maybe able to improve our system to correctly extract the data.

How should I use the "probability" field?

This value is an indicator of how confident we are that a page is an individual Product or Article page, depending on whether pageType is "product" or "article". The closer the value is to 1, the more confident. For example, when you're scraping products, "probability" is high on product pages, and low on product list pages, blog pages, 404 error pages, etc. You can use this field to filter out non-product or non-article pages: keep only results with probability larger than a certain threshold. Recommended default threshold value is 0.5 (i.e. use probability > 0.5), but you may choose a different threshold.

What are the possible errors and how should my code handle them?

Errors fall into two broad categories: request-level and query-level. Request-level errors occur when the HTTP API server can’t process the input that it receives. Query-level errors occur when specific query cannot be processed. You can detect these by checking the error field in query results. Documentation can be found here. 

What should I do if my request returns with HTTP status code 429 ("too many requests")?

This status code indicates that service is too busy and either per-user or system-level rate limit is hit. The best thing to do is to continue sending requests respecting rate limits assigned to you, the error message is expected to contain details. If you need higher limits, please contact sales.

Can I pass custom cookies to be used to download a web page?

Not right now but we are working on this feature, if it's important for you please contact us so that we can notify you when it's available.

Is JavaScript executed?

We enable or disable JavaScript to get the best extraction result.

Do I have to request URLs against the API in a polite manner or will the API take care of scheduling requests in such a way it doesn't DDoS the site?

API server rate limits the requests, we're trying to avoid causing any problems for target websites.

Are the content extraction techniques language agnostic?

Yes, Automatic Extraction API works on pages in all languages and from all countries.

Need data you can rely on?

Tell us about your project or start using our scraping tools today.

© 2010 - 2019 Scrapinghub

github-alt linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram