Show HN: I replaced Google Analytics with simple log-based analytics

AdriaanvRossum · on May 11, 2019

Love what you are doing here. A few points:

1. I would rename the pixel.png to something like image.png. Never call your script anything like tracking, analytics, or pixel when you don't want to be blocked by ad blockers. We use hello.js and hello.gif. [1]

> Log file parsing is an old-skool but effective way of measuring the traffic to your site.

2. By using a pixel image you can bypass caching. When using server logs without the image you only get the non cached requests. So an image like you use is a better approach.

3. Your image is being cached. So if somebody revisits your website your image will not be loaded and you will not find anything in your logs. Just disable your etag and add a expiry date of the past.

Awesome that more and more people are replacing Google Analytics with more simple tools. GA is overly complicated and has not the best privacy mindset. I built Simple Analytics [2] as a privacy friendly alternative.

[1] https://docs.simpleanalytics.com/script

[2] https://simpleanalytics.com

xiaq · on May 11, 2019

> Never call your script anything like tracking, analytics, or pixel when you don't want to be blocked by ad blockers.

But this is a tracking script, no? If I were you I'd keep the name that way so that if people don't want to be tracked, they don't.

AdriaanvRossum · on May 11, 2019

If your business is to track people, yes. If it's to gather statistics without tracking people (like page views), I think it perfectly fine to bypass ad blockers. We even have a dedicated feature for bypassing ad blockers [1] because we think page views are not privacy invasive. We drop IP addresses from every request so there is no personal data in our database or logs.

If you really want to block you can enable the Do Not Track setting. Although I think this should only be used when you are actually tracking people (we don't). So this feature might be removed in the future. It's already removed by Safari because it is another parameter to fingerprint a browser.

[1] https://docs.simpleanalytics.com/bypass-ad-blockers

xiaq · on May 11, 2019

The privacy game is about power, not about who is doing what right now. People shun Google's data collection because of what Google can do with the data, not what it has done or is doing; it only takes a single case of data misuse to reveal the power dynamics even if nothing has happened to them personally.

You don't have to play the privacy game -- there is a lot of space between really respecting user's privacy and breaking privacy laws. But if you do, you should put the power back in the user's hand.

(Disclaimer: I work for Google.)

kerng · on May 11, 2019

There have been countless cases for me personally where Googles tracking is creepy. Like Google Map recommendations, YouTube videos I should watch, misleading ads that send the user to malware bases on interest Google has about them etc, as well as exposure of unauthenticated Google+ APIs that allowed access to sensitive data to name a few.

I think saying nothing bad has happened is disingenuous. As soon as Google gets similar exposure as FB right now, internal whistleblowers might come forward also with more stories.

Also, couple of years ago Google was thoroughly compromised by at least one foreign government, ever wonder how much data was stolen?

lioeters · on May 11, 2019

Thank you for that comment. It means a lot that such a thoughtful consideration of privacy is coming from someone working at Google, there is hope yet for a humane approach to analytics and data collection.

la_barba · on May 11, 2019

I think it is fair to know exactly what data I'm giving up and what Google is doing with it, if I chose to exchange it for free services and goods. Hopefully someone with a spine comes to power and enacts some regulation allowing users to have a much clearer understanding of this. Also, people dislike Google for many things it has already done, not just because of an imagined future problem. Also.... nothing personal against you!

AdriaanvRossum · on May 13, 2019

https://twitter.com/randfish/status/1126892446925848578?s=21

notafraudster · on May 11, 2019

I don't think what you're doing is the end of the world, but answer this is extremely patronizing.

You've basically said you know some users block your stuff, but you think they probably shouldn't really want to block your stuff, so you devise an end run around them.

Then, finally, you conclude that if they really want to block your stuff, they should do something that you yourself admit won't block your stuff, but that's okay because they shouldn't want to block your stuff.

jamiequint · on May 11, 2019

I'm fine with people being patronizing towards whiny, entitled users who think they have a right to use any website they want on their terms.

If you don't like someone running log analysis on their traffic then don't go to their website, nobody forced you to.

nathan_long · on May 13, 2019

This example of tracking is tiny and harmless. But there's a wide spectrum of tracking behavior from sites.

> whiny, entitled users who think they have a right to use any website they want on their terms

Users do have a right to use any website they want to on their own terms. If I make an HTTP GET to your site, it's up to you to decide what HTML to return. Once you do, it's up to me whether to request the images, scripts, etc, and whether to read the sidebar, etc.

It's up to you to decide whether to show me any content of substance without first collecting payment. I can't demand that you publish content for free. I'm not entitled to that.

But it's up to me to decide what I consume. You can't demand that I view ads, send back tracking cookies, etc. You're not entitled to that.

If you don't like site visitors refusing to be tracked, then don't let them view your website. Nobody forced you to.

jononor · on May 11, 2019

How exactly does one know which websites will do this and which wont, before going to the website?

haney · on May 11, 2019

I’d make the assumption that any client/server connection has some sort of logging mechanism? I don’t think browser fingerprinting and large scale cross site tracking is good but it seems a hard to take issue with being counted when visiting someone else’s website.

jamiequint · on May 11, 2019

How do you know what businesses have cameras to count visitors before going into the business?

lkiernan · on May 12, 2019

How do I know what physical trackers are being used before entering a physical location? I use my eyes.

Business often count foot fall with IR or laser; it's generally on the door. How do I know that using a businesses with cameras claiming to only count traffic are not actually gathering a whole lot more information to use/sell/change their mind later?

I have no skin in this game but the original comment was more focused on "we've decided to bypass your adblocker because we feel that our interests outweigh yours"

xfer · on May 11, 2019

Noone puts up cameras to count the visitors. They put them to monitor their activity.

SahAssar · on May 12, 2019

Actually some people do, and a lot of people use their existing surveillance cameras for people-counting. see for example https://www.axis.com/en-us/products/axis-people-counter

krupan · on May 11, 2019

I'd you are that concerned use a proxy, VPN, Tor, etc.

dx034 · on May 11, 2019

I completely agree. I don't think Do not Track was ever about knowing how many visitors you have or the browser they use. That's basic knowledge needed to improve the site. Tracking becomes an issue if it's shared across pages and where personal data is collected without consent to sell it to the highest bidder.

judge2020 · on May 11, 2019

The author said it himself, he could have just put Cloudfront in front of Pages and parsed the access logs there. The referrer and user-agent, which are sent by the browser, are in the request headers. There would be no need for JS or an image if the website owner just changed their setup.

How do you deal with this sort of tracking?

AdriaanvRossum · on May 11, 2019

What do you mean? I think using CloudFront is not accurate enough if you use caching.

judge2020 · on May 11, 2019

I guess I mean overall "you can't stop owners from analyzing their access logs" and not just related to Cloudfront.

ceejayoz · on May 11, 2019

They've got an enterprise logging tool that'll give you all the info needed for this sort of tracking. https://support.cloudflare.com/hc/en-us/articles/216672448-C...

judge2020 · on May 11, 2019

This is "Front", the Amazon offering, not "Flare" (side note: logging via Cloudflare Workers is cheap and possible)

snlnspc · on May 11, 2019

I agree with this, but thankfully these posts have let us know to block hello.gif and hello.js by default now.

qwertox · on May 11, 2019

> But this is a tracking script, no?

No. This is self-hosted analytics, no 3rd-party involved, the way it's supposed to be.

Tracking involves a 3rd-party. Be it Google Analytics or Cambridge Analytica UserTracking (TM).

matz1 · on May 11, 2019

The goal is to track. Allowing people to easily disable defeat the purpose.

aembleton · on May 11, 2019

Correct. I wasn't loading pixel.png as it was blocked by the Easy Privacy List: https://easylist.to/easylist/easyprivacy.txt

MuffinFlavored · on May 11, 2019

You don’t track user’s flow on site? like conversion for ecommerce, bounce rate, time on page, etc.?

AdriaanvRossum · on May 11, 2019

Not at the moment. We will implement tracking user flow with in a privacy minded way. I asked Hacker News a while ago if they would think it was privacy friendly [1]

To be clear about the other thread about bypassing ad blockers. We will not have this feature when tracking user flow. People should have the right to block events if they want. But for basic info as in page views we keep offering the ad blocker bypass feature.

[1] https://news.ycombinator.com/item?id=19136044

octref · on May 11, 2019

> Obviously with log parsing you don’t get as much information as a JavaScript-heavy, Google Analytics-style system. There’s no screen sizes, no time-on-page metrics, etc. But that’s okay for me!

This also bypasses Ad blocker. In the case you have a large percentage of technical audience (who presumably would have Ad blocker installed) this log can be way more accurate than GA.

However this still requires setting up the image on a third-party server. I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.

benhoyt · on May 11, 2019

> I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.

That would be a great feature for GitHub Pages! Just a simple interface like https://simpleanalytics.io/simpleanalytics.io would serve most GH Pages use cases, I would think.

sdan · on May 11, 2019

I feel like that should already exist, but surprisingly it doesn't. Hopefully Github works on it soon!

dewey · on May 11, 2019

They already owned an analytics service for a while but it changed owners recently. Still looks almost the same though https://get.gaug.es/ - so I guess they decided it’s not part of their core business.

encom · on May 11, 2019

>This also bypasses Ad blocker.

Not for me. `/pixel.png?` is blocked by default in uBlock.

dx034 · on May 11, 2019

welcome.png, hello.png, etc. will work. As will any random combination of letters.

jasonlingx · on May 11, 2019

> I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.

Cloudflare does this for free, and you can run it on top of Netlify, Github/Gitlab pages, etc.

https://www.cloudflare.com/analytics/

dx034 · on May 11, 2019

I don't think Cloudflare's info is very helpful, at least in the free/cheap plans. It can help to roughly identify where your visitors come from and how many visits are from bots but that's about it.

AdriaanvRossum · on May 11, 2019

> However this still requires setting up the image on a third-party server.

While this is true, we found a solution to bypass ad blockers (which could be implemented by Google Analytics as well). My experience is that ad blockers only block scripts and pixels that are implemented on multiple websites [1]. With having a custom domain and a non analytics named script or URL, ad blockers are unlikely to block you. At Simple Analytics we created a feature for this where customers can point a CNAME to our server [2]. We setup SSL and proxy all requests to our server. This makes it almost impossible for ad blockers to block the stats of those customers.

[1] https://github.com/easylist/easylist/pull/1855

[2] https://docs.simpleanalytics.com/bypass-ad-blockers

highace · on May 11, 2019

> This also bypasses Ad blocker

GA does a good job at extrapolating your data to account for users with ad block. Obviously not perfect, but good enough for most cases.

hjnilsson · on May 11, 2019

How is this presented in the UI? I have never noticed this extrapolation.

kingo55 · on May 11, 2019

That's because extrapolation does not exist. He may be confused with sampling.

weego · on May 11, 2019

A proposition of what might have been true doesn't really have a place in analytics.

acidburnNSA · on May 11, 2019

For those of you hosting on your own server, this is a PSA that awstats is still working. It reads server logs and makes little graphs via a cron job. It's fun to have analytics going back to 2002 in the same format.

https://awstats.sourceforge.io/

avian · on May 11, 2019

Yes, I've also been using awstats for many years. It definitely works for a small site, has a lot of useful features and occasional releases still keep up with new user agents, etc. that pop up over time. Each time I want to switch I fail to find a suitable replacement.

That said, behind the curtain, awstats has plenty of problems and shows its age. Most of it is a single ~20 kline script with hundreds of global variables, so it's very challenging to debug. There are no tests. Over time it also had plenty of security issues [1]. I wouldn't recommend running it in any other mode than for generating static HTML reports from an unprivileged cronjob.

I've made my own test suite and I'm using a slightly patched version with ~20 commits on top of the latest release that fix problems I found and that upstream didn't merge (still from the times of SourceForge - since they switched to GitHub they do seem to be a bit better in accepting pull requests). However it doesn't help with submitting patches that, for example, concerns regarding GDPR compliance are met with responses like [2].

[1] https://security-tracker.debian.org/tracker/source-package/a...

[2] https://github.com/eldy/awstats/issues/110

xfitm3 · on May 11, 2019

I read issue [2] and I think its reasonable to do this at the LogFormat level. awstats doesn't need to do everything, that is probably how it got to be 20,000 lines.

I always use awstats to generate static pages, and that's what any security conscientious operator should do.

avian · on May 11, 2019

You see a reasonable technical response.

I see users with a valid issue (even quoting relevant laws) being called names and told in a patronizing tone that widely accepted interpretations of said laws are wrong.

telaelit · on May 11, 2019

I also would suggest looking into Matomo (https://matomo.org/) if your want an open source analytic service to replace Google Analytics.

addandsubtract · on May 12, 2019

FYI, Matomo used to be called Piwik, for anyone more familiar with their old name.

https://matomo.org/blog/2018/01/piwik-is-now-matomo/

Neil44 · on May 11, 2019

I like GoAccess because it works well with multiple vhosts, if you have a lot of sites and want to see relative busyness. Also you can see if any particular site or individual resource is consuming too much bandwidth.

cromulent · on May 11, 2019

Urchin (the software that became Google Analytics) was a log parsing analytical tool.

https://en.wikipedia.org/wiki/Urchin_(software)

blakesterz · on May 11, 2019

Oh how I miss Urchin! It was the best, by far, for this type of work.

abhin4v · on May 11, 2019

Why not use a self-hosted analytics software like https://matomo.org/ ?

luckylion · on May 11, 2019

Still a lot of overhead though, and regularly blocked by privacy minded individuals.

tmoravec · on May 11, 2019

Matomo can work with the logs only if needed. Providing all the benefits like respecting user privacy and working around ad blockers.

Zolomon · on May 11, 2019

I wish they had a simple Docker image available with everything setup and configurable from a single file. The installation process is complex and under documented, IMHO. But the software is pretty nice!

ptman · on May 11, 2019

https://hub.docker.com/_/matomo ?

secfirstmd · on May 11, 2019

Even more simple, they have a Cloudron image. Click and deploy.

akvadrako · on May 11, 2019

This looks like you need to pay Cloudron even if you run it on your own server.

ifcho · on May 11, 2019

Why do you need Docker for a simple PHP & MySQL script? To install it you just need to setup a MySQL database and enter the credentials in the setup script. Finally, add the tracking code to your site, but that's the same for every analytics engine.

dewey · on May 11, 2019

And you also need to set up a bunch of things in your web server, add the cron jobs, run the update script and hope everything still works afterwards etc.

eitland · on May 11, 2019

Maybe if someone hasn't an existing php host and doesn't want to configure one from scratch.

interfixus · on May 11, 2019

If you take the trouble to host everyting on a stack you actually control, raw server logs and GoAccess is a highly capable and for most cases sufficient monitoring tool. I use othing but.

_l9mk · on May 11, 2019

Same here. Nginx log + GoAccess. I actually login to the server and generate GoAccess html to check the traffic occasionally. I also anonymize ip on Nginx - that way I don't have to deal with GDPR. No cookies, js or third-party images, etc. Really see no need for anything else for a purely content based website. Then again the website is purely non-commercial for now.

eljimmy · on May 11, 2019

If you’re not a fan of big brother you probably should be logging to bare metal instead of Amazon S3.

julsimon · on May 11, 2019

https://aws.amazon.com/compliance/data-privacy-faq/

"We do not access or use your content for any purpose without your consent. We never use your content or derive information from it for marketing or advertising"

luckylion · on May 11, 2019

With privacy, I personally prefer not to have to rely on trust. Encryption is better than trusting other people not to invade your privacy.

OrgNet · on May 11, 2019

Are you sure that you didn't give consent when you agreed to the user agreement? (and will not in a user agreement update)

ComputerGuru · on May 11, 2019

Are you implying Amazon harvests S3 objects for info?

taormina · on May 11, 2019

I think it is incredibly fair to say that you shouldn't put data onto the servers of a company if you mind them reading that data. Why should I assume they aren't? At a bare minimum, the metadata is being actively consumed by the parent company. I don't think this is AWS specific at all.

foota · on May 11, 2019

?? Because a cloud company using data like that would be corporate suicide?

ngold · on May 11, 2019

I would love to hear one case where that has happened. Information is free money, something virtually no company making money on information turns down.

thinkingemote · on May 11, 2019

"if you are not a fan of big brother".....

"That would be corporate suicide"

Yes sometimes protecting personal privacy might be opposed to a corporations profits. The horror of it!

manigandham · on May 11, 2019

You’re mixing things up. A corporation is not opposed to corporate profits, and in this case they make profits by safely, securely, and privately storing your data as requested.

Not doing so would be a major hit to their business. Whether you trust their reputation and compliance certifications is up to you, but a completely different issue.

busymom0 · on May 11, 2019

If this was the case, I think we would have had some amazon whistleblower already blow this info and would be widely reported on.

taormina · on May 11, 2019

I'm not accusing Amazon or anyone else of anything in particular. I'm just reminding people that even if you encrypt all of your data at rest, that at a minimum, every service provider of something like AWS is collecting plenty of metadata about your data. They have to. How else can they tell where your data is and track how much they need to charge you?

busymom0 · on May 11, 2019

> How else can they tell where your data is and track how much they need to charge you?

Are you being serious with this question? Regardless of whether they look at your data behind the scenes without telling you or not, you don't need to know what type of exact data and metadata you have to charge you. My AWS bill charges me for CPU hours uptime for the server, network requests, load balancing etc. None of this needs them to know your data or even exactly much metadata (other than what domain and server to route to obviously but that's public information anyway).

By your logic, if someone is using end to end encryption, are you saying Amazon wouldn't be able to charge them because they can't look at the encrypted data?

taormina · on May 11, 2019

I was thinking more about S3, where I know for a fact that are charging on a per-request basis and a volume basis. Amazon can and should absolutely charge them irrespective of the contents of the blob of data, and regardless of whether they read it or not, my point is encrypt it if you're worried.

busymom0 · on May 11, 2019

Oh yea, I totally agree with the "encrypt it if you're worried" part. In my opinion, if you have any user sensitive info, you should absolutely use end to end encryption and ssl etc.

NicoJuicy · on May 11, 2019

In short, you have nothing to prove it.

Sorry, but :

> How else can they tell where your data is and track how much they need to charge you?

You say companies need to track metadata for pricing and they don't know where your data is without tracking? Huh. I mean, that answer is surreal. I thought you had a console to set everything up and pricing is based on cost + wanted profit.

AWS was like something of 33% profit. Nothing of their pricing is based on GA :p ( like wtf)

luckylion · on May 11, 2019

I agree, it probably hasn't happened. Yet, that is. It might next year, and we might have a whistle blower then.

busymom0 · on May 11, 2019

I agree with that. Something like that could happen. Question is, how do we stop it? This reminds me of when Facebook owned Parse dot com and a lot of developers used parse for storing user data. Given Facebook's track record with privacy, I won't be surprised if they used it.

luckylion · on May 11, 2019

My personal stance it to just make it harder. Encrypt the data, use your own server so there's less metadata. Of course, even then, if the data center company really wanted to and really was happy to break the law and really went deep into the attack, they could probably find a way to get to the data. But it's much harder, much more illegal and even they will certainly think longer before doing that vs "just copy the data and don't tell anyone" if it's in plain text on their machine.

It's more work to do it this way and forsake the convenience provided by Amazon & Co, but I prefer it. Plus I learn lots of things while doing it which I would've outsourced otherwise.

There's also the problem that Amazon might stay 100% truthful and trustworthy and never access my data. But their employees might not. The NSA's surveillance data was misused by employees to stalk romantic interests, so there's no reason to believe that an Amazon employee couldn't wrongfully do the same. They might be fired for it, but the damage is done, privacy has been compromised.

tinus_hn · on May 11, 2019

If you think that’s a risk you can just encrypt the data using a key Amazon doesn’t have.

redm · on May 11, 2019

This looks like a solution in need of a problem. Previously I had tried some similar things at scale, mainly because I wanted upsampled reports and GA charges $100,000 for premium services. What I found is that raw logs are not reliably accurate. The volume of traffic in certain countries was accurate but not reliably “real” users even after accounting for known bots, search engines etc. GA has a way of accounting for these and giving you a better overall picture. The second thing is that GA has improved its service so you get upsampled reports now, even at scale for tier 1 reports. At low volume, there upsampled already.

I’m not sure why anyone would want to waste time with this.

RA_Fisher · on May 11, 2019

I prefer the raw logs. One reason is that Google performs "statistical improvement" like "sessonizing." This act destroys the value of point process data. It wouldn't be found in a stats textbook because the operation destroys valuable information. Also, the concept of Visitors and Users that GA uses isn't transparent to me.

Another fun fact. Since the tracking happens on the client side, there's potentially a ton of truncated data that GA simply misses. Backend server instruments don't suffer the same way.

redm · on May 11, 2019

I think those are fair points; mileage will depend on your end-goals. We want to know how our traffic relates to real-world ad deliverability, real users in our funnel, etc. I'd agree with "Also, the concept of Visitors and Users that GA uses isn't transparent to me." but I'd add that whatever they do, is more accurate than what we could get from raw logs, as it relates to relatable business metrics.

eruci · on May 11, 2019

I removed GA completely on https://geocode.xyz/

Cloudflare provides all the basic analytics I need and I can parse the log files in the command line if I need more.

Zenbit_UX · on May 11, 2019

I'm also doing cloudflare only but they don't give you certain important metrics like unique sessions.

Can you tell me where and how you are parsing cloudflare logs?

eruci · on May 11, 2019

The logs are in my server. I use the cloudflare module to restore original session data.

achairapart · on May 11, 2019

Another very simple log viewer that runs on any LAMP/LEMP stack is Pimp My Log[0].

[0]: http://pimpmylog.com/

nnx · on May 11, 2019

Has anyone done research whether stopping using Google Analytics impacts Google Search rankings?

spongeb00b · on May 11, 2019

Taking out the tracking code snippet increases your Google PageSpeed score. Which makes sense, but I was always amused that Google basically was taking points says for using their analytics in your site.

busymom0 · on May 11, 2019

If this could be proven, wouldn't it fall in the "antitrust" category?

cknoxrun · on May 11, 2019

I struggle with this one. We have been on the Google roller coaster a few times now, where for no apparent reason our traffic goes up or down wildly. From this I have developed a (likely) irrational fear that messing with things by removing google analytics, or even adding a competitor on like Hotjar, will put us back on the roller coaster.

pmlnr · on May 11, 2019

I never moved from awstats. It still works.

TomK32 · on May 11, 2019

I'm using ahoy https://github.com/ankane/ahoy with my Rails applications and I'm very happy with it. Geocoder using the IP is included and the thing that matters for me is being able to set an additional conversion flag on the visit.

stevenicr · on May 11, 2019

I used to spend a lot of time looking at logs and trying to make pages better and more content based upon the stats. Then google stopped giving keywords and searched-phrases - I've since found stats of little use most of the time.

I wish google would make search-keyword-hiding opt-in for users, and perhaps auto-opted-in if using incognito mode. I am sure most of my visitors would be glad to provide the search phrases knowing that it helps us make more thing and things better. But google does not let them opt in to sharing them, they are all basically opted out.

tinus_hn · on May 11, 2019

Even if Google wanted, browsers like Safari are going to strip referrer headers of query terms as you navigate between sites.

stevenicr · on May 11, 2019

I think it's a great ability - and a great option. I'd like this to be optional with sites and browsers. Give users the ability to change these settings.

Give web sites a way to say - thanks for visiting, we noticed you are using a browser that.. or search portal that strips info form us.. would you please click to enable sharing this small bit of info.. more about how we use and what info here..

Something like this could help sites and users. I'd like to toggle it myself. I like how startpage scrambles url queries, but I would turn it off for some sites, whitelist them like some are with ublock etc. I also don't like how p-hub and some others keep queries in the url, and would like an option to scramble, with the site, via browser settings, proxies, whatever it takes.. to give more options, more choice.

jefftk · on May 11, 2019

If a search engine was strongly in favor of including this information they could add it in a query parameter.

oefrha · on May 11, 2019

Can’t you find the keywords in Google Search Console?

stevenicr · on May 11, 2019

Does gsc show the search terms used to find each page these days? It's been so long since I used it, not sure if it's changed or I never knew where to look.

Way back in the day, aw stats, webalizer, (and similar server side stats) would show which keywords were searched as totals, and show which pages were found by each set of keywords (and totals for each page / key phrase) - this info was valuable. I've not been aware of any way to gain that info since the google changes some time ago.

nathan-io · on May 12, 2019

Very nice!

On several projects, we've had success with a custom tracker that records IP, URL, referrer, display resolution, OS, and user agent to a local db.

To filter out bot traffic, we used Crawler-Detect [1].

The whole thing is just a few lines of PHP and JS, doesn't even require a tracking pixel (we grab most of the data from the user session).

A cron job moves entries older than x from the production db to an archive db.

[1] https://github.com/JayBizzle/Crawler-Detect

qrbLPHiKpiux · on May 11, 2019

I feel old now. The 1x1 hidden pixel is so early 90's old school.

cartofu · on May 11, 2019

Another alternative to more privacy in GA is to proxy all requests to GA via your own simple proxy server analytics.yoursite.com and drop the last bytes from the visitor IP when proxying.

mejakethomas · on May 13, 2019

(data engineer here)

Nice post! It's always fun reading about people being creative and challenging the analytics status quo (aka GA). Besides the joy of doing it yourself, you've accomplished a couple other things worth mentioning:

1. You'll never be sampled. GA samples historical data pretty heavily, and you have to pay for 360 to retain unsampled event data (at a tune of $160k+ per year).

2. You have full access to all generated data.

I'd highly recommend using Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker) in a very similar manner to what you've outlined here. You'll get a ton of extra functionality out of the box, which would add yet another level of insight. With snowplow, you get the following for free:

1. Sessionization, which is consistent with google analytics' definition - effectively a 30 minute window of activity.

2. User identification - the tracker drops a persistent cookie (just like GA), so you can see returning visitors.

3. Tools for splitting requests

4. A variety of event types, out of the box: https://github.com/snowplow/snowplow/wiki/2-Specific-event-t...

5. Ability to respect Do Not Track

6. Time on page, browser width/height, etc

7. Ability to make your event tracking 100% first-party

(Disclaimer: I don't work for them, but I've seen the system work very well a number of times.)

I'm running a similar setup on my blog, and it costs well under $1 per month: https://bostata.com/client-side-instrumentation-for-under-on.... I'm doing the same exact thing with Cloudfront log forwarding and have several lambdas that process the files in S3. From there, I visualize traffic stats with AWS Athena (but retain a ton of flexibility, since they are all structured log files).

amanzi · on May 11, 2019

If you're using JavaScript to add the pixel code, why not also include other metrics you can get easily with JS like screen resolution?

benhoyt · on May 11, 2019

Yeah, good question. I wanted to do that, but GoAccess is a web server log parser and doesn't support custom fields (you don't get screen resolution via web logs, so it kinda makes sense). See: https://goaccess.io/man

I could probably hack it and overload different HTTP status codes to mean different screen sizes or something, but I didn't consider device size to be important for me. GoAccess does break down the User-Agent into OS, so I can see mobile usage via the "iOS" and "Android" OS usage. Breakdown for my site: Windows 24%, iOS 22%, macOS 19%, Android 20%, Linux 11%, other 4%. So mobile usage is probably about 45%.

jon-wood · on May 11, 2019

If you’re using JavaScript you could make a request in the background with a bunch of HTTP headers added like X-Screen-Resolution and then have your web server log them.

NicoJuicy · on May 11, 2019

And windows is only 50 % on desktop? Linux is 20%?

Can you give a screenshot? These stats look way off to what I'm used to.

amanzi · on May 11, 2019

Thanks - makes sense. Good work!

finchisko · on May 11, 2019

I was thinking about switching to log based user tracking some time ago. Not because of big brother issues, but rather my intention was to remove the cookie banner non sense required by EU. No cookies, no banner required, right? I mean there are sure some downsides, but in current stage of our analytics, logs should hold enough information for analytics we need.

mejakethomas · on May 14, 2019

You'll still have to anonymize IP addresses, since those are classified as personal data in the EU.

cphoover · on May 13, 2019

I would use a json based logger like https://github.com/trentm/node-bunyan or https://github.com/pinojs/pino and use the elk stack which can parse JSON.

gorkemcetin · on May 11, 2019

It is always possible to overcome ad blocker banning with js trackers if you are self hosting and have the option to modify strings in js sdk - there are several ways to do this. You can also achieve more data Countly, Matomo or Fathom w/o using direct server logs.

pknerd · on May 11, 2019

I could not find organic traffic details like keywords that bring visitors to the site?

TomK32 · on May 11, 2019

That's because it's getting harder and harder to get that info from the referrer. Better to use the Google Search Console and Bing Webmaster Tool, those will give you information even on those search queries that did list your pages but not lead to a click.

wwweston · on May 11, 2019

Once upon a time you could get this information from the HTTP_REFERER header, since search terms were essentially encoded in the query string portion of the URL the search results showed up on.

Not sure how that holds up these days.

AdriaanvRossum · on May 11, 2019

It's removed by Google, all search result clicks go through an intermediate URL that removes the keywords. You can have some information from the Google URL [1].

[1] https://webmasters.stackexchange.com/a/107179

pawurb · on May 11, 2019

You could go fancier with setting up ELK stack for visualizing those logs https://abot.app/blog/elk-nginx-logs-setup

gscott · on May 11, 2019

I use https://statcounter.com/ which is essentially like the old log based website stats.

Google analytics has too many layers of UI.

louismerlin · on May 11, 2019

This is awesome ! It makes me want to build my own minimal analytics tool.

tdhz77 · on May 11, 2019

“Most tracking systems, including Google Analytics, don’t work at all without JavaScript.“

I have found solutions with GA, matomo, fantom all to have image based solutions that you can use.

mobjack · on May 11, 2019

99.9% of those on the web have JavaScript enabled.

The audience who disables it is incredibly small and doesnt want to be tracked anyways.

It shouldn't be a factor in your analytics solution unless you want to track bots too.

qwerty456127 · on May 11, 2019

I have always been wondering why doesn't everybody do this and insist on using Google Analytics and other 3rd-party trackers instead.

bryanrasmussen · on May 11, 2019

making your own beacon based tracking system is pretty simple, and then you have the screen sizes, time on page metrics etc.

speedplane · on May 11, 2019

The real power of Google Analytics is not in the tracking code, it's with the front-end interface that any non-technical product manager or marketer can use. Tracking users is easy. Allowing non-technical folks with easy-to-use analysis tools is much harder.

johnchristopher · on May 11, 2019

The ever changing google analytics dashboard is not that easy to use for non technical folks.

And then there's the whole "data analysis" thing.

justkez · on May 11, 2019

The (lack of) speed and complexity of GA astounds me every time I go in (use it over a portfolio of businesses). Using it on a 100Mb connection is still like pulling teeth.

I've researched building out a desktop app that pulls GA data over the API in the background so you can get key stats out much quicker, but it's quite an investment of time to be beholden to Google's platform.

Now doing some dogfooding on a web analytics service I've been evolving that tries to answer the "why" of change in traffic/behaviour over time ("traffic's up today....not sure why?"). Google do this with their GA mobile app ("Insights") but what and when they show you don't seem to be too predictable.

speedplane · on May 12, 2019

> The ever changing google analytics dashboard is not that easy to use for non technical folks.

True, but easier to learn than custom programming and visualizations. A smart English major can still figure it out on their own. Giving product managers the tools to do in-depth analysis is a huge plus. Otherwise, they just ask devs to run reports all the time, burdening devs and slowing down analysis.

tangue · on May 11, 2019

Yep they redesigned it so much that I had to do a custom dashboard for a customer (think of it as a glorified counter from the 1990s) and I had to admit that even myself I get lost in Analytics and Adwords

bryanrasmussen · on May 11, 2019

When I joined the startup I was on for a few years the original GA had been setup for our CEO by somebody not especially technical, he was also non technical and could never get it to do anything he wanted, for some reason it ended up with me being the techie having to make things work.

arjunbanker · on May 11, 2019

i’ve heard snowplow is good for this

mejakethomas · on May 13, 2019

Totally agree! I'm running a minimalistic version of snowplow's collection/etl infra for under $1 per month, and it works great:

https://bostata.com/client-side-instrumentation-for-under-on...

amanzi · on May 11, 2019

Snowplow has a similar approach for the data capture, but starts to get quite complex when it comes to the ETL and analysis.

bilater · on May 11, 2019

Nice. Another (more) simplistic effort: https://medium.com/datadriveninvestor/a-very-simple-way-to-a...