The AutoExtract API is a service for automatically extracting information from web content. You provide the URLs that you are interested in, and what type of content you expect to find ie product or article. The service will then fetch the content, and apply a number of techniques behind the scenes to extract as much information as possible. Finally, the extracted information is returned to you in structured form. Documentation can be found here.
Web scraping is complex - there are bans, location-specific content, issues with remote websites, misbehaving web pages. Like humans, any useful machine learning (ML) technology is never perfect. If you spot an error, please let us know, as we maybe able to improve our system to correctly extract the data.
This value is an indicator of how confident we are that a page is an individual Product or Article page, depending on whether pageType is "product" or "article". The closer the value is to 1, the more confident. For example, when you're scraping products, "probability" is high on product pages, and low on product list pages, blog pages, 404 error pages, etc. You can use this field to filter out non-product or non-article pages: keep only results with probability larger than a certain threshold. Recommended default threshold value is 0.5 (i.e. use probability > 0.5), but you may choose a different threshold.
Errors fall into two broad categories: request-level and query-level. Request-level errors occur when the HTTP API server can’t process the input that it receives. Query-level errors occur when specific query cannot be processed. You can detect these by checking the error
field in query results. Documentation can be found here.
This status code indicates that service is too busy and either per-user or system-level rate limit is hit. The best thing to do is to continue sending requests respecting rate limits assigned to you, the error message is expected to contain details. If you need higher limits, please contact sales.