Looking for a static proxy thing

proxy
Tags: #<Tag:0x00007f21b41cc230>

#1

I really like CDN pull zones. The way it works is: the CDN pulls over content as needed, and then caches it. It is great, because say you have a website with bunches of images, you just point them all at the CDN URL, and it will grab it from the normal location.

I do this on maiki.blog, so when I post to https://cdn.maiki.blog/example.png, the CDN system will grab the image from https://maiki.blog/example.png.

A lot of the options are configurable, so you can set how often it caches (most places default to 24 hours), how to purge the cache, and which domain to point at.

I really like this, but I want to do something else: I want to proxy data from multiple sources, for my build script.

Hugo has the ability to consume data for data templates, and I want to pull in a bunch, but once I start pinging three or four sources, I am going to be aware of how often I am requesting data, how often I rebuild the site, yadda yadda. I don’t want network connectivity to multiple nodes to be a requirement to deploy.

I’ve thought of two ways:

  1. Build a repo that is just data (JSON files or whatever), and load it as a sub-module at build time for the site, and that works. But it means I need a method to periodically pull down the latest data and commit it to the repo. Would be neat!
  2. Use a CDN-like proxy to save the data docs, and therefore is a single online point to pull data from.

I am not sure how to approach either of these. I mean, I guess I could cron job a build process for each data source if I were collecting them in a repo. GitLab can probably do everything for me, with a very complex .gitlab-ci.yml.

But if I had a system where I could just feed it URLs and optionally an auth method, and then have my personal access ready, that is preferable! I also don’t need real-time feedback, so these data sources would be updated on 24 hour or 7 day cycles. A slow web approach to gathering data in one place.

I’ve started searching through the projects listed at GitHub - Kickball/awesome-selfhosted: This is a list of Free Software network services and web applications which can be hosted locally. Selfhosting is the process of locally hosting and managing applications instead of renting from SaaS providers., but nothing has popped out to me. Is there one or more projects I could use to put this together? :slight_smile:


Semantic, Media and Wiki, but not in that order.
#2

I wonder if I could pull this off with just nginx pointing go various external sources. However, I think some of them require auth, so not sure that is gonna work very well…


#3

The last few days have found me learning a lot about forward proxies, and among the candidates to do what I want are:

After talking it over with @tim, I’ve decided to do a weird thing that maiki does: use WordPress!

Why? Well, I needed to do the research to understand this, but at the end of the day, while I do intend to use maybe a dozen or so sources, I don’t want to run a separate service for this.

Okay, that is a bit lazy! There are some other reasons, too:

  • Data doesn’t feel static, and therefore feels like it should not be in version control next to my content.
  • I already do this with WordPress (see below)
  • This is slow, I am not updating my data sources more than once a day, maybe a week. A new service would run all the time, doing nothing.
  • If I process data external to my build script, I can make deployments faster by prepping my content for the build (also explained below)

I already do this with WordPress

So I already pulled in some data into WordPress, and during the import I operate on it, either creating new records, updating them, or just storing it as is. Then, I expose my own APIs in WordPress, which are now queryable!

That means for my purposes, if I just want a couple of datasets to parse, I don’t have to pull down everything into my goHugo data directory and sort/filter results in a data template. Instead, I can craft a query to give me the exact data I need, speeding up template development and deployment.

Also, nearly every WordPress site I host has a CDN serving as much as possible, so this level of caching is fine for me: my “data engine” can slurp, operate, spit out whatever, but as far as my build script is concerned, it is pulling down a handful of JSON files from a CDN nearest it.

Why didn’t I just do that from the beginning?

Briefly:

  • I sometimes fear I use WordPress for everything because I can, and there is a lot out there that does some things better.
  • I want to learn more systems, stay relevant across more domains.
  • I am super insecure about using APIs. Let’s just make it plainly: I don’t understand half of what programmers say, but they act like APIs are just this thing that everyone gets. I feel like I am constantly playing catchup.

There may come a time when I decide I’d rather do this a different way. But when that happens, I’ll have solid requirements based on prior usage.


#4

I filed the above comment and moved on to another project, but it may be related: I am looking into creating a geographic-based search engine, specifically, one based on Oakland, CA.

I found a neat combo that I realized could also be used for this: Scrapy + Elasticsearch.

My brain hadn’t considered just logging a web crawler for caching/indexing API responses, but of course that is viable!


#5

Hmmm, I’ve been sitting on an idea to rate websites by an arbitrary set of criteria, mostly centered around tracking and surveillance. Scraping sites in this way, I might be able to use an item pipeline component, not just to save it to the database, but also to check for externally linked resources that might load on a given page.