Show HN: AI-powered API that instantly obtains website information

menacingly · on June 10, 2024

I like the idea. Looks like you've got some kinks, happens to the best of us. I look forward to playing with it when you get those ironed out.

Also, I don't care if you respect robots.txt.

kianworkk · on June 10, 2024

I am working to fix them. Thank you very much for trying it out.

menacingly · on June 10, 2024

we've all been there, go get 'em

nhggfu · on June 10, 2024

does your bot respect robots.txt directives?

pray tell, what is your bot's user agent string, [so i can nicely block you from my web-properties.]

oefrha · on June 10, 2024

I tried with a nonexistent path, here are the user agents I got:

  Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36

It's apparently built to evade detection.

menacingly · on June 10, 2024

the more reasonable assumption is that it uses a third party crawl service.

but still, it doesn't matter, it's acting on behalf of a user, and you aren't entitled to know what software your users run. One consequence of putting stuff on the public internet is that it's like public.

moneywoes · on June 10, 2024

what sort of evasion technique is being used here, random user agents on each attempt?

CaptainOfCoit · on June 10, 2024

> what sort of evasion technique is being used here

Using a user-agent that looks like a desktop user-agent, rather than including the name of the actual project/product that is being used.

I understand why though, plenty of websites block anything that looks like non-desktop/mobile user-agents, so makes sense. Besides the pragmatic reason, I also agree with menacingly that it doesn't matter, people should be able to use whatever user-agent they want to access the content you've put publicly on the internet.

kianworkk · on June 10, 2024

The goal of SiteProfile is not to scrape data. It only accesses publicly available web pages, such as the homepage, about page, and pricing page. It does not access non-public content on websites, nor does it offer users the functionality to scrape website data.

_heimdall · on June 10, 2024

What does it mean exactly for the service to provide information about a website without scraping it? How could summaries or LLM responses be generated be made without scraping pages?

dotancohen · on June 10, 2024

Presumably the same way that Firefox makes an HTTP request to the webserver then formats the page for the human user. This is just formatting that page differently. This is no more a scraper than is Firefox's Reader Mode.

That said, lying about the UA is not cool.

Animats · on June 10, 2024

I have something that sends a UA of "Sitetruth.com site rating system". Many sites won't talk to that.

_heimdall · on June 10, 2024

I've used a reader mode library that I think as created by Mozilla and handles converting a site to reader mode locally. Does the Firefox browser do it locally, or at least on demand? If so I wouldn't really consider that scraping since they aren't parsing the site and storing data for later use.

throwaway211 · on June 10, 2024

It does scrape the site in order to summarise it, no?

pavel_lishin · on June 10, 2024

Your statement doesn't answer their question.

kianworkk · on June 10, 2024

I meant that it is not supported yet. I will add this to the to-do list, and I believe it does not conflict with the goals of SiteProfile. Thank you for your feedback.

nhggfu · on June 10, 2024

so you scrape + store + process contact info etc, presumably.

sounds like a privacy nightmare

no doubt this is not GDPR compliant.

no doubt this is not legal in some parts of the world - unless people can opt out and get their data removed.

CaptainOfCoit · on June 10, 2024

> no doubt this is not GDPR compliant.

Unless the project is open source, no doubt you cannot know this. If they don't store any of those details anywhere (including not in logs) but just pass it along, GDPR won't apply.

CaptainOfCoit · on June 10, 2024

> does your bot respect robots.txt directives?

Would be a bit strange if it did, as the service is not a crawler/robot by any measures.

Bit like asking if cURL is "respecting" robots.txt.

It's just another user-agent after all.

dumbfounder · on June 10, 2024

It is a service that seems to crawl a website for content and feed that content into some LLMs. It should absolutely respect robots.txt. This is exactly what robots.txt is used for, to tell automated crawlers of a website what they should and should not do.

IncreasePosts · on June 10, 2024

I disagree - this is not a crawler that just blindly stumbles around any random website that it finds. It is more akin to a user agent. The only requests it makes are derived from specific instructions by the user to do so.

Having said that, people may use it as a crawler, just like you might be able to script Firefox to be a crawler, but it is not in itself a crawler.

dumbfounder · on June 10, 2024

It doesn't need to be blind stumbling around the web. But you might be right about only grabbing one page, and if you are then I agree that abiding by robots.txt is only going to upset a tiny minority. When they talk about websites it makes me think they are crawling to see all the pages linked to the homepage, because the asking questions part is extremely limited if all it does is look at one page. If they crawl, then I think they need to abide. If they don't, I think it's ok.

CaptainOfCoit · on June 10, 2024

> It is a service that seems to crawl a website for content and feed that content into some LLMs.

It doesn't seem to work like that at all, to me.

As far as I understand, you give it a specific URL, and it extracts content from that URL and that URL only. A "crawl" would mean it would also follow links automatically, which I don't see any evidence of being done, from the landing page at least.

pavel_lishin · on June 10, 2024

The link to "Can’t find the answer you’re looking for? Reach out to our customer support team." just takes me to the top of the page.

And I have to sign in to see the Privacy Policy or the Terms of Service?

kianworkk · on June 10, 2024

Sorry, this is a bug. I will fix it as soon as possible all my focus is on building the API.

cbsmith · on June 10, 2024

Hrm...

"Application error: a server-side exception has occurred (see the server logs for more information). Digest: 2269195897"

That was during sign up.

kianworkk · on June 10, 2024

Yes, today is the first day of testing for SiteProfile. I have seen the issues everyone is encountering in the logs, and I will fix them as soon as possible. Thank you very much for trying it out.

Animats · on June 10, 2024

Site tries to access "quality-sawfly-29.clerk.accounts.dev", rejected by the EFF's Privacy Badger.

brian_herman · on June 10, 2024

You should make the json graphic on your homepage bigger I can't read it even though I made it bigger.

kianworkk · on June 10, 2024

I completely agree, web design is very challenging for me. I am trying to create a better hero image...

sneak · on June 10, 2024

I have actually this week been building a similar API for my own internal use. Cool project!

Any intention of open sourcing it?

kianworkk · on June 10, 2024

I'm not sure yet, but I might consider it in the future

9cb14c1ec0 · on June 10, 2024

Wow! This is great. I was looking for a service with this exact feature set. Keep up the good work.

kianworkk · on June 10, 2024

I'm very glad you like this idea. I think we should continue to add more website meta information in the future, such as WHOIS data and SimilarWeb traffic, but I'm not sure if everyone needs this.

9cb14c1ec0 · on June 10, 2024

I don't. I just need the AI summary-of-a-website service as an API that I can call.

meiraleal · on June 10, 2024

Wtf is this?

moneywoes · on June 10, 2024

I don't understand, does this just provide LLMM type questions on a website?

max__dev · on June 10, 2024

Looks pretty similar to ez-extract.com, but more focused on metadata?

mutant · on June 10, 2024

I've done things like this adhoc by just feeding gpt html.

Anyone can do this.

erie · on June 10, 2024

How is it different from or better than builtwith.com ?

kianworkk · on June 10, 2024

BuildWith's goal is to tell you what technologies a website is built with. SiteProfile's goal is to provide you with the meta information of a website. Just like getting to know a person, I created a profile for websites, including images, social media links, and other information so you can quickly understand the site.

erie · on June 10, 2024

Great, I could see that now. Thank you.