1. I would rename the pixel.png to something like image.png. Never call your script anything like tracking, analytics, or pixel when you don't want to be blocked by ad blockers. We use hello.js and hello.gif. [1]
> Log file parsing is an old-skool but effective way of measuring the traffic to your site.
2. By using a pixel image you can bypass caching. When using server logs without the image you only get the non cached requests. So an image like you use is a better approach.
3. Your image is being cached. So if somebody revisits your website your image will not be loaded and you will not find anything in your logs. Just disable your etag and add a expiry date of the past.
Awesome that more and more people are replacing Google Analytics with more simple tools. GA is overly complicated and has not the best privacy mindset. I built Simple Analytics [2] as a privacy friendly alternative.
If your business is to track people, yes. If it's to gather statistics without tracking people (like page views), I think it perfectly fine to bypass ad blockers. We even have a dedicated feature for bypassing ad blockers [1] because we think page views are not privacy invasive. We drop IP addresses from every request so there is no personal data in our database or logs.
If you really want to block you can enable the Do Not Track setting. Although I think this should only be used when you are actually tracking people (we don't). So this feature might be removed in the future. It's already removed by Safari because it is another parameter to fingerprint a browser.
The privacy game is about power, not about who is doing what right now. People shun Google's data collection because of what Google can do with the data, not what it has done or is doing; it only takes a single case of data misuse to reveal the power dynamics even if nothing has happened to them personally.
You don't have to play the privacy game -- there is a lot of space between really respecting user's privacy and breaking privacy laws. But if you do, you should put the power back in the user's hand.
There have been countless cases for me personally where Googles tracking is creepy. Like Google Map recommendations, YouTube videos I should watch, misleading ads that send the user to malware bases on interest Google has about them etc, as well as exposure of unauthenticated Google+ APIs that allowed access to sensitive data to name a few.
I think saying nothing bad has happened is disingenuous. As soon as Google gets similar exposure as FB right now, internal whistleblowers might come forward also with more stories.
Also, couple of years ago Google was thoroughly compromised by at least one foreign government, ever wonder how much data was stolen?
Thank you for that comment. It means a lot that such a thoughtful consideration of privacy is coming from someone working at Google, there is hope yet for a humane approach to analytics and data collection.
I think it is fair to know exactly what data I'm giving up and what Google is doing with it, if I chose to exchange it for free services and goods. Hopefully someone with a spine comes to power and enacts some regulation allowing users to have a much clearer understanding of this. Also, people dislike Google for many things it has already done, not just because of an imagined future problem. Also.... nothing personal against you!
I don't think what you're doing is the end of the world, but answer this is extremely patronizing.
You've basically said you know some users block your stuff, but you think they probably shouldn't really want to block your stuff, so you devise an end run around them.
Then, finally, you conclude that if they really want to block your stuff, they should do something that you yourself admit won't block your stuff, but that's okay because they shouldn't want to block your stuff.
This example of tracking is tiny and harmless. But there's a wide spectrum of tracking behavior from sites.
> whiny, entitled users who think they have a right to use any website they want on their terms
Users do have a right to use any website they want to on their own terms. If I make an HTTP GET to your site, it's up to you to decide what HTML to return. Once you do, it's up to me whether to request the images, scripts, etc, and whether to read the sidebar, etc.
It's up to you to decide whether to show me any content of substance without first collecting payment. I can't demand that you publish content for free. I'm not entitled to that.
But it's up to me to decide what I consume. You can't demand that I view ads, send back tracking cookies, etc. You're not entitled to that.
If you don't like site visitors refusing to be tracked, then don't let them view your website. Nobody forced you to.
I’d make the assumption that any client/server connection has some sort of logging mechanism? I don’t think browser fingerprinting and large scale cross site tracking is good but it seems a hard to take issue with being counted when visiting someone else’s website.
How do I know what physical trackers are being used before entering a physical location? I use my eyes.
Business often count foot fall with IR or laser; it's generally on the door. How do I know that using a businesses with cameras claiming to only count traffic are not actually gathering a whole lot more information to use/sell/change their mind later?
I have no skin in this game but the original comment was more focused on "we've decided to bypass your adblocker because we feel that our interests outweigh yours"
I completely agree. I don't think Do not Track was ever about knowing how many visitors you have or the browser they use. That's basic knowledge needed to improve the site. Tracking becomes an issue if it's shared across pages and where personal data is collected without consent to sell it to the highest bidder.
The author said it himself, he could have just put Cloudfront in front of Pages and parsed the access logs there. The referrer and user-agent, which are sent by the browser, are in the request headers. There would be no need for JS or an image if the website owner just changed their setup.
Not at the moment. We will implement tracking user flow with in a privacy minded way. I asked Hacker News a while ago if they would think it was privacy friendly [1]
To be clear about the other thread about bypassing ad blockers. We will not have this feature when tracking user flow. People should have the right to block events if they want. But for basic info as in page views we keep offering the ad blocker bypass feature.
> Obviously with log parsing you don’t get as much information as a JavaScript-heavy, Google Analytics-style system. There’s no screen sizes, no time-on-page metrics, etc. But that’s okay for me!
This also bypasses Ad blocker. In the case you have a large percentage of technical audience (who presumably would have Ad blocker installed) this log can be way more accurate than GA.
However this still requires setting up the image on a third-party server. I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.
> I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.
They already owned an analytics service for a while but it changed owners recently. Still looks almost the same though https://get.gaug.es/ - so I guess they decided it’s not part of their core business.
> I would really love it if GitHub pages or Netlify can provide some simple server-side tracking. It doesn't match GA but in some cases that's all I need.
Cloudflare does this for free, and you can run it on top of Netlify, Github/Gitlab pages, etc.
I don't think Cloudflare's info is very helpful, at least in the free/cheap plans. It can help to roughly identify where your visitors come from and how many visits are from bots but that's about it.
> However this still requires setting up the image on a third-party server.
While this is true, we found a solution to bypass ad blockers (which could be implemented by Google Analytics as well). My experience is that ad blockers only block scripts and pixels that are implemented on multiple websites [1]. With having a custom domain and a non analytics named script or URL, ad blockers are unlikely to block you. At Simple Analytics we created a feature for this where customers can point a CNAME to our server [2]. We setup SSL and proxy all requests to our server. This makes it almost impossible for ad blockers to block the stats of those customers.
For those of you hosting on your own server, this is a PSA that awstats is still working. It reads server logs and makes little graphs via a cron job. It's fun to have analytics going back to 2002 in the same format.
Yes, I've also been using awstats for many years. It definitely works for a small site, has a lot of useful features and occasional releases still keep up with new user agents, etc. that pop up over time. Each time I want to switch I fail to find a suitable replacement.
That said, behind the curtain, awstats has plenty of problems and shows its age. Most of it is a single ~20 kline script with hundreds of global variables, so it's very challenging to debug. There are no tests. Over time it also had plenty of security issues [1]. I wouldn't recommend running it in any other mode than for generating static HTML reports from an unprivileged cronjob.
I've made my own test suite and I'm using a slightly patched version with ~20 commits on top of the latest release that fix problems I found and that upstream didn't merge (still from the times of SourceForge - since they switched to GitHub they do seem to be a bit better in accepting pull requests). However it doesn't help with submitting patches that, for example, concerns regarding GDPR compliance are met with responses like [2].
I read issue [2] and I think its reasonable to do this at the LogFormat level. awstats doesn't need to do everything, that is probably how it got to be 20,000 lines.
I always use awstats to generate static pages, and that's what any security conscientious operator should do.
I see users with a valid issue (even quoting relevant laws) being called names and told in a patronizing tone that widely accepted interpretations of said laws are wrong.
I like GoAccess because it works well with multiple vhosts, if you have a lot of sites and want to see relative busyness. Also you can see if any particular site or individual resource is consuming too much bandwidth.
I wish they had a simple Docker image available with everything setup and configurable from a single file. The installation process is complex and under documented, IMHO. But the software is pretty nice!
Why do you need Docker for a simple PHP & MySQL script? To install it you just need to setup a MySQL database and enter the credentials in the setup script. Finally, add the tracking code to your site, but that's the same for every analytics engine.
And you also need to set up a bunch of things in your web server, add the cron jobs, run the update script and hope everything still works afterwards etc.
If you take the trouble to host everyting on a stack you actually control, raw server logs and GoAccess is a highly capable and for most cases sufficient monitoring tool. I use othing but.
Same here. Nginx log + GoAccess. I actually login to the server and generate GoAccess html to check the traffic occasionally. I also anonymize ip on Nginx - that way I don't have to deal with GDPR. No cookies, js or third-party images, etc. Really see no need for anything else for a purely content based website. Then again the website is purely non-commercial for now.
"We do not access or use your content for any purpose without your consent. We never use your content or derive information from it for marketing or advertising"
I think it is incredibly fair to say that you shouldn't put data onto the servers of a company if you mind them reading that data. Why should I assume they aren't? At a bare minimum, the metadata is being actively consumed by the parent company. I don't think this is AWS specific at all.
I would love to hear one case where that has happened. Information is free money, something virtually no company making money on information turns down.
You’re mixing things up. A corporation is not opposed to corporate profits, and in this case they make profits by safely, securely, and privately storing your data as requested.
Not doing so would be a major hit to their business. Whether you trust their reputation and compliance certifications is up to you, but a completely different issue.
I'm not accusing Amazon or anyone else of anything in particular. I'm just reminding people that even if you encrypt all of your data at rest, that at a minimum, every service provider of something like AWS is collecting plenty of metadata about your data. They have to. How else can they tell where your data is and track how much they need to charge you?
> How else can they tell where your data is and track how much they need to charge you?
Are you being serious with this question? Regardless of whether they look at your data behind the scenes without telling you or not, you don't need to know what type of exact data and metadata you have to charge you. My AWS bill charges me for CPU hours uptime for the server, network requests, load balancing etc. None of this needs them to know your data or even exactly much metadata (other than what domain and server to route to obviously but that's public information anyway).
By your logic, if someone is using end to end encryption, are you saying Amazon wouldn't be able to charge them because they can't look at the encrypted data?
I was thinking more about S3, where I know for a fact that are charging on a per-request basis and a volume basis. Amazon can and should absolutely charge them irrespective of the contents of the blob of data, and regardless of whether they read it or not, my point is encrypt it if you're worried.
Oh yea, I totally agree with the "encrypt it if you're worried" part. In my opinion, if you have any user sensitive info, you should absolutely use end to end encryption and ssl etc.
> How else can they tell where your data is and track how much they need to charge you?
You say companies need to track metadata for pricing and they don't know where your data is without tracking? Huh. I mean, that answer is surreal. I thought you had a console to set everything up and pricing is based on cost + wanted profit.
AWS was like something of 33% profit. Nothing of their pricing is based on GA :p ( like wtf)
I agree with that. Something like that could happen. Question is, how do we stop it? This reminds me of when Facebook owned Parse dot com and a lot of developers used parse for storing user data. Given Facebook's track record with privacy, I won't be surprised if they used it.
My personal stance it to just make it harder. Encrypt the data, use your own server so there's less metadata. Of course, even then, if the data center company really wanted to and really was happy to break the law and really went deep into the attack, they could probably find a way to get to the data. But it's much harder, much more illegal and even they will certainly think longer before doing that vs "just copy the data and don't tell anyone" if it's in plain text on their machine.
It's more work to do it this way and forsake the convenience provided by Amazon & Co, but I prefer it. Plus I learn lots of things while doing it which I would've outsourced otherwise.
There's also the problem that Amazon might stay 100% truthful and trustworthy and never access my data. But their employees might not. The NSA's surveillance data was misused by employees to stalk romantic interests, so there's no reason to believe that an Amazon employee couldn't wrongfully do the same. They might be fired for it, but the damage is done, privacy has been compromised.
This looks like a solution in need of a problem. Previously I had tried some similar things at scale, mainly because I wanted upsampled reports and GA charges $100,000 for premium services. What I found is that raw logs are not reliably accurate. The volume of traffic in certain countries was accurate but not reliably “real” users even after accounting for known bots, search engines etc. GA has a way of accounting for these and giving you a better overall picture. The second thing is that GA has improved its service so you get upsampled reports now, even at scale for tier 1 reports. At low volume, there upsampled already.
I’m not sure why anyone would want to waste time with this.
I prefer the raw logs. One reason is that Google performs "statistical improvement" like "sessonizing." This act destroys the value of point process data. It wouldn't be found in a stats textbook because the operation destroys valuable information. Also, the concept of Visitors and Users that GA uses isn't transparent to me.
Another fun fact. Since the tracking happens on the client side, there's potentially a ton of truncated data that GA simply misses. Backend server instruments don't suffer the same way.
I think those are fair points; mileage will depend on your end-goals. We want to know how our traffic relates to real-world ad deliverability, real users in our funnel, etc. I'd agree with "Also, the concept of Visitors and Users that GA uses isn't transparent to me." but I'd add that whatever they do, is more accurate than what we could get from raw logs, as it relates to relatable business metrics.
Taking out the tracking code snippet increases your Google PageSpeed score. Which makes sense, but I was always amused that Google basically was taking points says for using their analytics in your site.
I struggle with this one. We have been on the Google roller coaster a few times now, where for no apparent reason our traffic goes up or down wildly. From this I have developed a (likely) irrational fear that messing with things by removing google analytics, or even adding a competitor on like Hotjar, will put us back on the roller coaster.
I'm using ahoy https://github.com/ankane/ahoy with my Rails applications and I'm very happy with it. Geocoder using the IP is included and the thing that matters for me is being able to set an additional conversion flag on the visit.
I used to spend a lot of time looking at logs and trying to make pages better and more content based upon the stats. Then google stopped giving keywords and searched-phrases - I've since found stats of little use most of the time.
I wish google would make search-keyword-hiding opt-in for users, and perhaps auto-opted-in if using incognito mode. I am sure most of my visitors would be glad to provide the search phrases knowing that it helps us make more thing and things better. But google does not let them opt in to sharing them, they are all basically opted out.
I think it's a great ability - and a great option. I'd like this to be optional with sites and browsers. Give users the ability to change these settings.
Give web sites a way to say - thanks for visiting, we noticed you are using a browser that.. or search portal that strips info form us.. would you please click to enable sharing this small bit of info.. more about how we use and what info here..
Something like this could help sites and users. I'd like to toggle it myself. I like how startpage scrambles url queries, but I would turn it off for some sites, whitelist them like some are with ublock etc. I also don't like how p-hub and some others keep queries in the url, and would like an option to scramble, with the site, via browser settings, proxies, whatever it takes.. to give more options, more choice.
Does gsc show the search terms used to find each page these days? It's been so long since I used it, not sure if it's changed or I never knew where to look.
Way back in the day, aw stats, webalizer, (and similar server side stats) would show which keywords were searched as totals, and show which pages were found by each set of keywords (and totals for each page / key phrase) - this info was valuable. I've not been aware of any way to gain that info since the google changes some time ago.
Another alternative to more privacy in GA is to proxy all requests to GA via your own simple proxy server analytics.yoursite.com and drop the last bytes from the visitor IP when proxying.
Nice post! It's always fun reading about people being creative and challenging the analytics status quo (aka GA). Besides the joy of doing it yourself, you've accomplished a couple other things worth mentioning:
1. You'll never be sampled. GA samples historical data pretty heavily, and you have to pay for 360 to retain unsampled event data (at a tune of $160k+ per year).
2. You have full access to all generated data.
I'd highly recommend using Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker) in a very similar manner to what you've outlined here. You'll get a ton of extra functionality out of the box, which would add yet another level of insight. With snowplow, you get the following for free:
1. Sessionization, which is consistent with google analytics' definition - effectively a 30 minute window of activity.
2. User identification - the tracker drops a persistent cookie (just like GA), so you can see returning visitors.
7. Ability to make your event tracking 100% first-party
(Disclaimer: I don't work for them, but I've seen the system work very well a number of times.)
I'm running a similar setup on my blog, and it costs well under $1 per month: https://bostata.com/client-side-instrumentation-for-under-on.... I'm doing the same exact thing with Cloudfront log forwarding and have several lambdas that process the files in S3. From there, I visualize traffic stats with AWS Athena (but retain a ton of flexibility, since they are all structured log files).
Yeah, good question. I wanted to do that, but GoAccess is a web server log parser and doesn't support custom fields (you don't get screen resolution via web logs, so it kinda makes sense). See: https://goaccess.io/man
I could probably hack it and overload different HTTP status codes to mean different screen sizes or something, but I didn't consider device size to be important for me. GoAccess does break down the User-Agent into OS, so I can see mobile usage via the "iOS" and "Android" OS usage. Breakdown for my site: Windows 24%, iOS 22%, macOS 19%, Android 20%, Linux 11%, other 4%. So mobile usage is probably about 45%.
If you’re using JavaScript you could make a request in the background with a bunch of HTTP headers added like X-Screen-Resolution and then have your web server log them.
I was thinking about switching to log based user tracking some time ago. Not because of big brother issues, but rather my intention was to remove the cookie banner non sense required by EU. No cookies, no banner required, right? I mean there are sure some downsides, but in current stage of our analytics, logs should hold enough information for analytics we need.
It is always possible to overcome ad blocker banning with js trackers if you are self hosting and have the option to modify strings in js sdk - there are several ways to do this. You can also achieve more data Countly, Matomo or Fathom w/o using direct server logs.
That's because it's getting harder and harder to get that info from the referrer. Better to use the Google Search Console and Bing Webmaster Tool, those will give you information even on those search queries that did list your pages but not lead to a click.
Once upon a time you could get this information from the HTTP_REFERER header, since search terms were essentially encoded in the query string portion of the URL the search results showed up on.
It's removed by Google, all search result clicks go through an intermediate URL that removes the keywords. You can have some information from the Google URL [1].
The real power of Google Analytics is not in the tracking code, it's with the front-end interface that any non-technical product manager or marketer can use. Tracking users is easy. Allowing non-technical folks with easy-to-use analysis tools is much harder.
The (lack of) speed and complexity of GA astounds me every time I go in (use it over a portfolio of businesses). Using it on a 100Mb connection is still like pulling teeth.
I've researched building out a desktop app that pulls GA data over the API in the background so you can get key stats out much quicker, but it's quite an investment of time to be beholden to Google's platform.
Now doing some dogfooding on a web analytics service I've been evolving that tries to answer the "why" of change in traffic/behaviour over time ("traffic's up today....not sure why?"). Google do this with their GA mobile app ("Insights") but what and when they show you don't seem to be too predictable.
> The ever changing google analytics dashboard is not that easy to use for non technical folks.
True, but easier to learn than custom programming and visualizations. A smart English major can still figure it out on their own. Giving product managers the tools to do in-depth analysis is a huge plus. Otherwise, they just ask devs to run reports all the time, burdening devs and slowing down analysis.
Yep they redesigned it so much that I had to do a custom dashboard for a customer (think of it as a glorified counter from the 1990s) and I had to admit that even myself I get lost in Analytics and Adwords
When I joined the startup I was on for a few years the original GA had been setup for our CEO by somebody not especially technical, he was also non technical and could never get it to do anything he wanted, for some reason it ended up with me being the techie having to make things work.
1. I would rename the pixel.png to something like image.png. Never call your script anything like tracking, analytics, or pixel when you don't want to be blocked by ad blockers. We use hello.js and hello.gif. [1]
> Log file parsing is an old-skool but effective way of measuring the traffic to your site.
2. By using a pixel image you can bypass caching. When using server logs without the image you only get the non cached requests. So an image like you use is a better approach.
3. Your image is being cached. So if somebody revisits your website your image will not be loaded and you will not find anything in your logs. Just disable your etag and add a expiry date of the past.
Awesome that more and more people are replacing Google Analytics with more simple tools. GA is overly complicated and has not the best privacy mindset. I built Simple Analytics [2] as a privacy friendly alternative.
[1] https://docs.simpleanalytics.com/script
[2] https://simpleanalytics.com