optimizing renders on massive lists with React.memo

A common task I use for react is rendering large datasets in the UI. For example, a large list of movies or books. Here’s a simple component that renders a list of movies.

import React from 'react';

const MovieList = ({ movies }) => {
  return (
    <ul>
      {movies.map((movie) => (
        <li key={movie.id}>
          <strong>{movie.name}</strong>: {movie.description}
        </li>
      ))}
    </ul>
  );
};

export default MovieList;

As long as you’re using a unique `key` attribute in this case, renders are pretty fast. In the example above, only simple list items are being rendered for each movie. It’s a small set of elements and there’s no complex state involved.

But what if…

  1. The individual movie items are a lot more expensive to render and contain a lot of state
  2. The state cannot be localized to the smaller child components and needs to live at the root level (because it’s shared with other components)

Here’s an example where we have a parent level state that maintains rating data and passes that down to render both the movie list and a sibling recommendations component.

import React from 'react';

const ExpensiveMovieItem = ({ movie, movieRating, setMovieRatings}) => {
   return (
     ... very expensive render ...
   );
}

const MovieList = ({ movies }) => {
  const [movieRatings, setMovieRatings] = useState({})
  return (
    <div>
      <ul>
        {movies.map((movie) => (
          <ExpensiveMovieItem movie={movie} movieRating={movieRatings[movie.id]} setMovieRatings={setMovieRatings}/>
        ))}
      </ul>
      <Recommendations movieRatings={movieRatings}/>
    </div>
  );
};

export default MovieList;

In this case, if setMovieRatings gets called in any of the children ExpensiveMovieItem components, the parent state will update and all of its children will re-render (even if the props for the majority of components in the list stays the same). One common misunderstanding I had for a long time is that when props stay the same, a component does not re-render. In reality, any time a parent UI component state changes as a result of setState, all of its descendants re-render.

If this is a large list (1000+) items, this re-render can create noticeable lag. In this case, if the re-render takes 200ms, it’ll take 200ms between when a rating for a movie is updated to when it’s actually reflected in the UI. Since React does not care to skip re-renders automatically based on props, it’s up to you to tell it when to skip a full re-render for a component.

React.memo

React.memo is a function that accepts a component (and an optional prop comparison function) and returns another component. This new component has special memoization behavior that skips re-render based on either the built-in or user provided prop check.

Going back to the original example, here’s how you turn a normal expensive component into a memoized one:

import React from 'react';

const MemoizedExpensiveMovieItem = React.memo(({ movie, movieRating, setMovieRatings}) => {
   return (
     ... very expensive render ...
   );
});

const MovieList = ({ movies }) => {
  const [movieRatings, setMovieRatings] = useState({})
  return (
    <div>
      <ul>
        {movies.map((movie) => (
          <ExpensiveMovieItem movie={movie} movieRating={movieRatings[movie.id]} setMovieRatings={setMovieRatings}/>
        ))}
      </ul>
      <Recommendations movieRatings={movieRatings}/>
    </div>
  );
};

export default MovieList;

Now if you update the parent state, only the children with changed props will render. This technique will work out of the box if all of your props are non-object primitives (strings, numbers), but you’ll have to be more careful if you have objects because the default comparison method is using Object.is, and it’s pretty common for the identity of objects to change across re-renders in React even if the literal values are the same. For example, if you’re re-creating functions that are being passed in as props then you’ll cause a re-render. Or if you’re doing object cloning in setState which creates new objects with the same values but different identities. You can get around these issues by either simplifying the prop params or providing a custom property checker.

how big can you make a JWT?

i’ve had a lot of JWT related discussions at work lately and today I wondered how big is too big for a JWT to fit through an HTTP header. The HTTP spec doesn’t really impose a limit but most servers do set a limit that range between 8K – 16K bytes.

I figured I can whip up a quick jwt generator to get a rough sense of how big JWT’s can get!

for simplicity I made the key value pairs small strings (these will vary in real life of course) and defined a byte limit of 8K. Also to save battery I increased the key counts exponentially 馃榾

ok here’s the script. can you guess what the key limit is using back of napkin calc?

require "jwt"
byte_limit = 8000
bytesize = 0
key_count = 1
rsa_private = OpenSSL::PKey::RSA.generate 2048
while bytesize < byte_limit
  payload = {}
  key_count.times do |i|
    payload["foo_#{i.to_s}"] = "bar"
  end
  token = JWT.encode payload, rsa_private, 'RS256'
  bytesize = token.bytesize
  puts "bytesize: #{bytesize}, key_count: #{key_count}"
  key_count *= 2
end

And here’s the output:

bytesize: 384, key_count: 1
bytesize: 403, key_count: 2
bytesize: 440, key_count: 4
bytesize: 515, key_count: 8
bytesize: 672, key_count: 16
bytesize: 992, key_count: 32
bytesize: 1632, key_count: 64
bytesize: 2950, key_count: 128
bytesize: 5680, key_count: 256
bytesize: 11142, key_count: 512

so with 512 keys we exceeded 8K bytes ~

what’s the deal with third party cookies phase out?

chrome is planning on phasing out support for third party cookies in 2024.

Third-party cookies, also known as cross-site cookies, are cookies set by a website other than the one you are currently on. For example, cnn.com might have a Facebook like button on their site. The like button will set a cookie that can be read by Facebook.

https://support.mozilla.org/en-US/kb/third-party-cookies-firefox-tracking-protection

these cookies are a big deal for big tech because they’re the primary enabler of web tracking and advertising. if you work for a tech company that relies on digital ads, chances are they depend on third party cookies for those ads to follow you around the web.

cookie basics

when a browser issues a request to example.com, example.com can set a cookie on the browser under the domain example.com by responding to the request with a set-cookie response header.

once the cookie is saved, if the browser makes additional requests to example.com again in the future, the cookie under the matching domain will be forwarded back to the server.

because HTTP is a stateless protocol, it’s this cookie behavior that allows websites to “remember” its clients.

now, whether a cookie is first party or third party depends on the domain you’re currently on. if the cookie domain is the same as the current domain you’re on (in your browsers address bar), this is a first party cookie. The cookie belongs to the domain. every other cookie set by other domains is third party. so whether a cookie is first party or not depends on two things:

  1. The domain of the cookie
  2. The current domain of the page

if i’m on example.com, all the cookies set under example.com are first party. the browser may contain other cookies, saved under other domains like yahoo.com or wikipedia.org. those are all third party!

so from a user perspective, a cookie isn’t first party in an absolute sense. if the user visits a different site under a different domain, the same cookie is no longer considered first party. so again, the site you’re currently on according to the browser determines whether a cookie is considered first party or third party.

issue with third party cookies

so remember how a website can set cookies under its own domain? well, turns out they can also indirectly set cookies under other (third party) domains! all thanks to the magic of

HTML documents can contain many references / links to resources on other domains. every <img src...> or <link ...> or <script src...> is basically a GET request to another server that may live under different domains than the one you’re currently on!

those third party servers i.e myspace.com and xanga.com can also set cookies under their domain. the browser doesn’t really make a distinction between a request you made directly by typing in example.com in the address bar versus ones that are fired as a result of HTML rendering.

as a result, the same website you’re visiting transmits its own cookies AND cookies from other websites. some of those websites may even be… ad networks that set cookies for purposes of analytics and tracking.

this basic auto storage and transmission behavior of cookies combined with the hyperlinking nature of web sites is at the core of how ad tech works.

if you’re an advertiser, third party cookie data allows you to learn about visitor behaviors, such as websites they frequently visits and recent purchases and target them with ads

real world scenario time

here’s a real example that i’ve researched and verified myself.

hypothetically speaking, lets say bob is researching a rice cooker on amazon.com. bob sees one he likes, but feels overwhelmed so gives up and decides to go bake cookies instead

bob then visits a site with a cookie recipe and sees an advertisement for the same rice cooker he was looking at earlier. since bobs not on an amazon-owned site but is seeing amazon related ads, this advertisement was triggered by a transmission of third-party cookie data.

the third party domain in this case is amazons advertising network. amazon has an advertising network similar to Google’s doubleclick.

here’s an example of a real cookie that’s set by the ad network when you visit amazon.com:

Set-Cookie:

ad-privacy=0; Domain=.amazon-adsystem.com; Expires=Sun, 01-Oct-2028 14:16:44 GMT; Path=/; Secure; HttpOnly; SameSite=None

when a page on amazon loads, it makes additional requests to the amazon ad network. the ad network sets this very long-lived cookie (at the time of writing, that date is more than 4 years into the future) that will be used to track your behavior. for the next 4 years, this cookie data will be forwarded to any request to the amazon ad networks domain .amazon-adsystem.com.

a quick aside about ad networks

there’s three primary groups involved in an ad network

  1. sellers that have something to sell (companies selling rice cookers on amazon.com) and want their products to be advertised as widely and cheaply as possible and
  2. those who want to make money by showing ads (content creators, bloggers, etc). amazon calls those that want to show ads through their web content associates.
  3. consumers / buyers / site visitors

the value of an ad network is proportional to the size of these groups. if there are no consumers or buyers, there’s nobody to sell to. if there are no content creators / youtubers, ads are limited in their reach. if there are no sellers… well, then there are probably no ads either!

back to bob

now that the third party cookie from the ad network is set, when bob visits a recipe site that’s also part of the ad network, he’s going to load a third party tracking script (also known as pixels) from .amazon-adsystem.com that receives the cookie that’s set, looks up the user by ID on the amazon servers, and serves a targeted ad based on the information the ad network has on the user identified by the cookie.

over time, these scripts loaded by sites that are affiliated with the ad network continue to track bobs behavior through the long lived cookie for as long as they’re loaded and transmit the data back to the ad network. this builds a rich representation of bob as a consumer for re-targeting purposes.

with the “end” to third party cookies, this basic mechanism is threatened. the recipe site will be restricted to only transmitted first party (its own) cookies. cookies previously set by an ad network (from a users visit to amazon.com properties) will not be transmitted behind the scenes for ad retargeting.

i should mention that most browsers do give you the option now of disabling third party cookies. yes even chrome. here’s what i see if i click into my chrome settings. as you would expect, third party cookies right now are allowed by default. i recommend turning that off.

so no more ads following me around in future?

while this looks like a positive direction for data privacy, it’s definitely not the end for ad retargeting. remember that big tech makes an ungodly amount of money from their advertising platforms and it’s not in their interest to kill their golden goose.

the current google led proposal to replace third party cookies in chrome is called the privacy sandbox initiative. a major part of this initiative is to offer users more control over how their data gets shared. i’m going to be reading more on this in the future and write about it on another post.

stateful vs stateless JWT’s

JSON Web Tokens (JWTs) are cryptographically signed JSON objects. The crypto signing is what provides the trust guarantees since consumers of a JWT can verify the signature using a public key. Now there’s two types of JWT’s: stateful and stateless jwt’s.

Stateless JWT’s are probably the most common JWT. All the information needed by the application about an entity is contained in the JSON (username, role, email, etc). When a stateless JWT is transmitted in a request either via cookie or header, the application base64 decodes the JWT, verifies the signature and takes some sort of action.

Stateful JWT’s contain a reference to information about an entity that is stored on the server. This reference is typically a session ID that references a session record in storage. When this JWT is transmitted to the backend, the backend performs a lookup in storage to get the actual data about the entity.

There’s almost no good reason to use stateful JWT’s because they are inferior in almost every way to regular session tokens:

  • Both session tokens (simple key-value string pair) and JWTs can be signed and verified by the server, but that whole process is simpler (and more battle-tested) with session key value pairs compared to JWT’s which involve an additional decoding step as well as plucking keys out of the decoded JSON for the verification step. Since most applications rely on third party JWT libraries to handle this (and they greatly vary in their implementation and security), this further increases the vulnerability of JWTs.
  • Encoding a simple string in JSON adds additional space usage. There’s nothing to be gained from this extra space if you’re just transmitting a single value.

Stateless JWT’s from my experience are most commonly used in service-oriented and microservice architectures where there are some collection of backend services and frontend clients. There’s typically a central authentication service that talks to a database with user information. Client requests from the frontend are authenticated with this service and receive a JWT in return. They may also need to communicate with backend services that are behind access control, so they pass along the stateless JWT as a bearer token in an HTTP header. The backend service verifies the JWT and uses the claims information to make an authz decision.

The reason why this is a popular flow is that teams are able to act on the JWT without having to perform an additional lookup about the entity at a user service. They do have to verify the JWT, but they don’t need to maintain any additional state about the user or communicate with another service. This level of trust only works because JWT’s are digitally signed by an issuing party (in this case, the auth server). If clients are just passing plain JSON requests, there’s no way for services to verify the integrity of the information (has it been tampered with?) or the source (how can i be sure it’s issued by the auth server?).

jwt as session

One common argument against the use of stateless JWT’s is when they’re used as sessions. The application basically offloads session expiry mechanism entirely to the JWT. The two strongest arguments i know of against using jwt’s as sessions are around data freshness and invalidation.

If you’re using a stateless JWT as a session, the only way to really expire a JWT is to set a new JWT on a client with an expiry in the past or change the issuing key (which invalidates all sessions). Fully client-side cookie sessions have similar limitations around invalidation. Data freshness is another related issue – when invalidation is hard, so is updating. If you need to revoke permissions, you can’t really do that without updating the JWT. But again, you can’t do that if you’re relying on the client-side state of a JWT for session management.

why can’t you tamper with a JWT?

jwt tokens are a very popular way of transmitting claims information between systems. It’s based on a public key system so that the claims can be verified and the verifier can be confident that the claim was issued by a trusted entity.

microservice architectures will commonly use the claims to perform access control. For example, the claim may contain a users ID and their roles. This information can then be used to allow or deny access to resources.

One question that inevitably comes up when implementing JWT flows is:

How can I be sure that this JWT isn’t fake? How do I know it’s not tampered with??

if you don’t verify the signature, you really can’t be sure. JWT tokens contain a “signature” which is the output of a cryptographic hashing algorithm such as RS256. The issuer of the token will hash the header and payload of the JWT using a one way hash. This hashed output is then encrypted using a secret and then the final output gets stored inside the token. So what gets stored is an encrypted signature. If anything about the contents of the JWT changes, the signature will change.

on the receiving side, the only way to trust the token is to verify the signature. First, the signature in the claim needs to be decrypted using a public key (this is usually made available by the issuer). If you can successfully decrypt this value then you can be confident that the token was issued by the trusted party. However, at this point you haven’t verified if the contents have been tampered with / changed.

to verify that the integrity of the actual payload, you need to perform the same hash on the header and payload and compared the hashed output to the claim signature. If they match, you now have confidence that the claims were not tampered with! So there’s two levels of verification that happen. The first is the successful decryption of the claim. If decryption fails, the claim must not have been issued by the trusted party. For example, if I generated a JWT using some random secret key, it can only be decrypted by a specific public key. If I don’t share this public key with another party, they cannot trust me. So if a service is unable to decrypt using the public key it has, it cannot establish trust.

by the same token (hehe), if the verification of hashes fail, it’s possible that the token was issued by a trusted party but the contents of the JWT changed or does not match what was used to generate the original signature. This is a sign of tampering – either by another party or even by accident by the JWT consuming service (perhaps there’s a bug in the signature verification code).

what the reflected XSS

reflected XSS attacks are a common way of tricking a users browser agent into executing malicious code. I’ll share onedefinition I found from mozilla and unpack the key terms / concepts.

When a user is tricked into clicking a malicious link, submitting a specially crafted form, or browsing to a malicious site, the injected code travels to the vulnerable website. The Web server reflects the injected script back to the user’s browser, such as in an error message, search result, or any other response that includes data sent to the server as part of the request. The browser executes the code because it assumes the response is from a “trusted” server which the user has already interacted with.

https://developer.mozilla.org/en-US/docs/Web/Security/Types_of_attacks#cross-site_scripting_xss

there’s a few key phrases here that are important to understanding XSS:

  • malicious link – this is the link sent by the attack to a user of a web service. Lets assume you’re the user and the service is my-bank.com. This link may look like “my-bank.com?alert=<script>… malicious code …</script>” which contains code that the attacker wants to execute on your browser.
  • malicious site – this is a site owned by the attacker. A malicious site doesn’t need to exist for an attack to happen, but it’s one place an attacker can get you to submit details that they can use to construct a scripted attack. For example, lets say they need your email address to perform an attack. The site might have a fake form that collects your email address and then redirects you to “my-bank.com” with an embedded URL script.
  • vulnerable website – this is the site that is vulnerable to XSS attacks. In general, that includes any site that doesn’t escape / sanitize inputs from the client. The problem with this is that it can lead to the browser agent executing user provided (via an attacker) javascript.
  • web server – this is the bank services backend service
  • “trusted” server – this is the bank server that returned the HTML containing malicious code that the users browser executed. Trusted is in quotes here because the server is returning javascript that it did not intend to. So it can’t really be trusted.

XSS attacks take advantage of:

  • A web service that liberally accepts user provided inputs (an attacker can replace a safe input with client code) and renders that input without sanitization (permits arbitrary code execution via injected script tags)
  • An established trust between the user agent and the web service. Any malicious code may execute in the context of an established session between the user and the service. For example, the user may be logged in and therefore all requests originating from the page (which will contain auth related cookies like session cookies) are trusted by the backend.
  • An unsuspecting user that blindly clicks a link (perhaps emailed to them) or fills out a form on a malicious website

err how is this different from stored XSS?

the only difference between reflected XSS and stored XSS is that with stored XSS, the malicious code is actually stored on the vulnerable web services servers. For example, lets say twitter is the vulnerable website and you’re a twitter user. Now lets assume you’re following someone who’s an attacker and they submit a tweet containing malicious code that they know will be executed by browser agents when it gets rendered, say, in the newsfeed of followers.

so here’s how it goes down – they submit the tweet containing script code. That tweet gets stored on the twitter servers (this is where the word “stored” comes from). That tweet will be rendered in users news feeds and when it does, the contents of the tweet gets executed as javascript. Boom, that’s the XSS attack. Just like reflected XSS, this can be prevented by ensuring that user input (in this case, tweets) is sanitized.

Additional resources

  • https://www.stackhawk.com/blog/what-is-cross-site-scripting-xss/
  • https://developer.mozilla.org/en-US/docs/Web/Security/Types_of_attacks#cross-site_scripting_xss

most devs are not the audience for the CAP theorem

eric brewer presented CAP at a distributed computing conference in 2000 to designers of distributed systems, most of whom were familiar with relational databases and their consistency guarantees. His intention was to start the conversation about trade-offs in the system design space that need to be made between consistency and availability for early cloud-based storage systems that needed to be highly available. His main argument was that systems that for cloud storage systems to be highly available, some level of consistency needed to be sacrificed.

most developers like myself are using distributed systems – not designing them. more importantly, i find that i’m often required to think about app data usage patterns and how their storage system supports those usage patterns at levels far more granular than what CAP offers. as a distributed systems user, CAP can basically be reduced to “there’s a trade-off between availability and consistency”. that’s a concept that i find much easier to digest on its own

more useful questions / concepts than CAP to grapple with are…

  • is replication synchronous or asynchronous? Is this adjustable?
  • In the event of node or network failures, what is the data recovery process like? what writes get rolled back?
  • is there support for transactions? what level of isolation is supported (are dirty reads possible?)
  • can I read my own writes?
  • how do I scale reads and writes as my system grows? what does the process of adding additional nodes to the system look like?
  • how are concurrency conflicts handled / when there is contention over shared data?

mongoDB majority read concern is confusing AF

one misconception of mongos read concern: majority is that it guarantees read-your-write consistency by reading from a majority of nodes. this is a common misunderstanding because it’s counterpart write concern: majority requires acks from the majority of nodes.

… but that’s not at all what read concern does!

reads always get submitted to a single node using a server selection process that takes into account your read preference (primary, primaryPreferred, secondary, etc).

  • if you have a primary read preference, a read will always go to the primary.
  • if you have a secondary read preference, a read will get submitted to a single secondary. if you have two replicas, a read goes to one of them

when the read concern (not the same as read preference, great naming mongodb) is set to majority, that’s basically saying “only return data for this query that has been committed / successfully written to the majority of nodes”.

this does not mean that you’re always reading the latest write

to understand this, you need to understand mongo’s notion of a “majority commit”.

the majority commit value for any write is determined by the primary during the standard replication process. when data gets replicated to a secondary node, it’ll check with the primary whether to update it’s “majority commit” snapshot of the data. If that value has not been majority committed, the majority commit snapshot will maintain its previous majority committed value.

that’s why it’s still possible to read stale values with read: majority. The majority commit snapshot on any given node is only updated for a particular value when it is actually successfully replicated from the primary to the majority of its secondary nodes. so the node you’re reading from may not have the majority-committed version of the data you literally just wrote. it’ll give you the previous majority committed value, instead of the latest value that perhaps has not yet propagated to the majority of nodes.

errr so what’s the point even MONGO

what it does mean is that you can trust that the data you’re reading has a high level of durability because in the event of a failure the value you’re reading is unlikely to be rolled back since it’s been majority committed.

read your own writes for real

reading your own writes is a special case of causal consistency and while having a majority read concern is not sufficient, it is a necessary component of setting up read your own write consistency in mongo.

to achieve reading your own writes, you need to ensure the following mAgiCaL settings:

  1. operations are done inside a session with causal consistency enabled
  2. write concern is majority
  3. read concern is majority
  4. lol?

why do you need to set specific read and write concerns even though causal consistency is enabled? I’m not sure, but it’s an extremely confusing and misleading API

theoretically/formally speaking… causal consistency includes read your own writes consistency, but in MongoDB enabling causal consistency is not sufficient for reading your own writes!

if you do have these settings on, MongoDB will track operations with a global logical clock and your reads will block until it’s able to read the most recent majority committed write from the same session.

without causal consistency enabled, a write may go to a majority of nodes but the read may still would up returning non-majority-committed data from a node that does not have the write that just happened in the same session. The causal consistency session is what causes reads to block if it attempts to read a stale write.

wow this sounds painful what are my other options

situations involving multi-document operations are most likely going to bring up requirements around read/write consistency. the best way to skip all that is to use mongo / nosql as it was designed and focus on single document, atomic operations that do not require you to interleave read and writes. this means modeling your data in a de-normalized way and not treat mongo too much like a relational db. it’s actually what they recommend too!

OR just dont use mongo. postgresql4lyfe

SAML and OAuth purposes

i’ve been thinking about saml and oauth a lot lately due to work. i’m putting down some of my thoughts here on how i believe they differ in purpose.

lets say there’s COOL WEB APP with a typical authentication using email and password. these creds get transmitted in a HTTP request and server backend verifies the credentials and grants access to user (hopefully user credentials are stored in an encrypted form…).

but there are many situations in which you may not want to be the one in charge of creds. maybe you’re scared. or maybe you want potential users to log in using their existing account with a different service (say, Google or Facebook) for convenience. or both.

if the user authenticates with a third party service (that you trust), you are basically delegating identity verification to a separate service. this is sort of the point of both SAML and OAuth at a high level, but they’re designed for different problems.

oauth

implementing OAuth in the web app lets users of that app access the service through a trusted third party that does the auth. The user goes through a login flow with the third party and eventually the third party grants a token that your app can use to retrieve user information (such as email) that can be used to grant access to the COOL WEB APP.

as a user, i might access multiple services with different oauth flows. if i use service A, i might oauth with my google login. if i use service B, i might oauth with my facebook login. but what if i want to be auth’d into both service A and service B using single auth step AKA single sign on. i’m lazy and i don’t want to use different oauth flows for each service. or maybe i’m a big corp and i want to centralize IT auth flows

OAuth, however, is not designed for single sign on scenarios (please correct me if i’m wrong!)

saml

that’s what SAML addresses! Implementing SAML for a web app is similar to OAuth in that identity is being delegated the auth to a third party, but in this case that third party is a specific entity known as an “Identity Provider” (IDP).

the responsibility of an identity provider is to grant a user access to multiple cool web apps, not just the one COOL WEB APP. the way this works is that each of those web apps have a trust relationship with the identity provider. so first trust is established with the IDP and that trust and then used to gain access to the related services. that’s basically single sign on in a nutshell.

the user signs into a single service (the identity provider) and that service has a trust relationship with N services. The user can then log into N services (one of which is yours) all mediated through SAML. this is why i believe someone came up with the fancy $100 dollar phrase “federated identity”

there is no such thing as “non-relational” data

in data modeling discussions I often hear the phrase “non-relational data”. it’s usually someone making a case for why data should be in a NoSQL store and live denormalized. the argument is usually that the data itself is somehow inherently non-relational and so it should be put in a non-relational database.

in reality, most complex data have relationships and every conversation about storage systems is fundamentally about how to most effectively store and fetch that data. it’s a lower level implementation detail. so depending on the use cases and constraints, that data may be stored in a relational database or a non-relational database, but the data itself is not inherently “non-relational”.

for instance, lets say you’re storing user analytics data and each user has many devices. there’s a one to many between a user and their devices, but if the devices list is append-only and you’ll never have to mutate any individual device, maybe there’s no need to have a normalized representation between users and devices that requires you to pay processing time for reads which require JOINs.

{
  "_id": ObjectId("5f5a1f5e3dc1264c7ee03d5a"),
  "username": "john_doe",
  "email": "[email protected]",
  "devices": [
    {
      "device_id": "123456789",
      "device_name": "Smartphone",
      "os": "iOS"
    },
    {
      "device_id": "987654321",
      "device_name": "Laptop",
      "os": "Windows"
    },
    {
      "device_id": "567890123",
      "device_name": "Tablet",
      "os": "Android"
    }
  ]
}

still, I don’t think that makes the data non-relational – there’s a one to many relationship. what it does mean that there’s a processing performance trade-off if it’s stored in a normalized form (which you can usually do in a non-relational database nowadays).

how two phase locking prevents lost updates

two phase locking is an old method of ensuring serializable transactions using locks.

a common issue with non-serializable isolation (which is typically what most rdmses support…) in the fact of concurrent writes is the lost update problem.

here’s an example of a lost update, lets assume:

  • a a record in a database with column x that has a value of 5
  • there are two independent processes that will perform a read-modify-write on that record
  • the two processes read the current value at the same time. assuming readers don’t block readers, they’ll read the same value
  • process A will attempt to read the current value of 5 and increment it by 2
  • process B will attempt to read the current value of 5 and increment it by 5
  • process A writes first and the new value of x is 7
  • process B writes last and the new value of x is 10. The update from process A is considered lost / as if it never happened!

some databases will detect a write conflict like this and raise an error and others may just proceed without a hiccup. In either case, there’s a potential conflict that may need to be resolved depending on the expectations of a client. If this is a bank account, a lost update is a huge problem. If this is a view counter, maybe it’s not as big of a problem.

(conservative) two phase locking

there’s two types of locks in two phase locking (2PL): shared locks and exclusive locks. exclusive locks block both reads and writes. only shared locks can be shared. In snapshot isolation levels, exclusive locks from writers only block other writers – readers can still read records that are exclusively locked. with serializable, writers can block readers.

the reason 2PL is able to guarantee serializability is by splitting it locking behavior into two phases – one before actual query execution and the other after.

prior to actually executing the query, the query planner figures out what locks it needs to obtain. If a transaction only does a read, only a shared lock is acquired for that record. If, however, a transaction does a read of a record followed by a write to the same record, the shared lock is “upgraded” to an exclusive lock.

once all the locks are obtained (lock point), the locks are held until execution is complete. If you have two concurrent transactions, the first one with an exclusive lock to reach the lock point will block the other. This is what ensures serial execution.

and finally, the reason this is sometimes referred to as conservative 2PL is that it obtains all the locks upfront regardless of whether or not it needs to.

mongoDB read preferences

MongoDB read preferences give you control over read behavior when using replicasets. Writes in every environment go to primary, but reads can be configured to read from secondary or primary based on various criteria. In most versions of mongo, the read preference defaults to primary in the client but you should check your version for the default.

primary

Read from the primary. if the primary is down, the read will fail. This is useful if you need the latest successful write and have zero tolerance for stale data. The terms “latest” and “successful” are a bit of a sliding scale depending on the read concern. For example, with read concern local to a primary, it’s possible to read writes from other clients that have not been acknowledged by the primary node. In practical terms, this just means that the data could be lost in the event of a rollback event since it may not have been replicated to a majority of nodes.

Regardless of a write’s write concern, other clients using "local" or "available" read concern can see the result of a write operation before the write operation is acknowledged to the issuing client.

https://www.mongodb.com/docs/manual/core/read-isolation-consistency-recency/#read-uncommitted

But I thought mongo maintained a WAL using an on-disk journal? Why would data be rolled back at all?

Yes, the journal is used for data recovery in the event of a node shutdown. But when a primary node goes down, a new node is elected as the primary, and then the previous primary rejoins the cluster as a secondary it may have data / writes that are ahead of the new primary. Since the new primary is the source of truth, those writes need to be rolled back to be consistent with the new primary!

If you want data that has the highest durability level, use majority read concern. When read concern is majority, it will only return data that has been successfully acknowledged by a majority of nodes. However, this does sacrifice some level of data freshness because the latest write from a client may not have been replicated to the majority of nodes yet, so you may be reading an older value even though you’re reading from the primary.

Another factor to consider when choosing this setting is your primary nodes capacity and current load – if your primary is close to capacity on cpu or memory use, switching reads to primary may impact writes.

primaryPreferred

Read from the primary, but if the primary is unavailable, the read goes to a secondary. This setting is appropriate cases you want the most recent write, but you can tolerate occasional stale reads from a secondary. Replication lag can range pretty widely from single digit ms to minutes depending on the write volume, so keep in mind that during periods of high replication lag you’re more likely to be reading stale data.

secondary

All reads go to secondaries. Not all secondaries though – just one based on a server selection process. The server selection works like this:

First, there’s the default threshhold value which is 15ms. Now this isn’t the window used for server selection though – it’s only a part of it. The other part is the lowest average network round trip time (RTT) of secondary nodes. So if the closest node has an RTT of 100ms, the total set will be nodes that fall within 100 + 15 (115ms). It’ll use this time to randomly select a node to forward the request to.

This read preference can make sense if your system is very read heavy AND you can tolerate stale reads. Scaling out with many secondary nodes is one way to spread your read load and improve read availability. At the same time, it may also add latency to both your primary (due to the increased oplog syncing), increase inconsistency and likelihood of data staleness because there are now more nodes that are likely to be out of sync with writes, and potentially reduced write latency particularly for writes that require majority acknowledgement because you’ll need N / 2 + 1 acknowledgements.

secondaryPreferred

All reads go to secondaries, but if not available, read from the primary. The same concerns for “secondary” read preference applies here. The additional factor to consider here is whether your primary can afford to take on this additional load.

nearest

Doesn’t matter which, just pick the one with lowest latency based on the server selection algorithm I mentioned in the secondary read preference section.

why the words in “CAP theorem” are confusing

lets go through each one ok? my goal is to make you as confused as i am

consistency

the word “consistency” is super overloaded in the realm of distributed systems. In CAP, consistency means (informally) that every read reflects the most recent write. This is also known as single-copy consistency or strict / strong consistency. Data is replicated to multiple nodes in these systems, but any reads to this storage system should give the illusion that read and writes are actually going to the same node in a multi-user / concurrent environment. Put another way – there’s multiple copies of data in practice, but readers always see the latest (single) copy by some loose definition of “latest”.

unfortunately, “strong” consistency is a grey area with lots of informal definitions. Some distributed systems researchers such as Martin Kleppman equate CAP consistency with linearlizability, which depending on your definition of strong consistency, is either a weaker form of strong consistency or a more precise and formal definition of strong consistency. Does consistency mean that every write is instantaneously reflected on all nodes? Or, in the case of linearlizability, that even though reads may not reflect the latest write they will always reflect the correct ordering of writes based on some agreed upon global time?

data stores that support transactions are also characterized by their ACID properties (Atomic, Consistent, Isolated, Durable). ACID consistency has nothing to do with ensuring that readers get the latest write; ACID is about the preservation of invariants in the data itself (such as keeping debit or credit balanced) within a transaction, which is often the responsibility of the application itself rather than the storage system. Relational databases support specific forms of invariant enforcement through referential integrity, but many invariants cannot be easily enforced by the storage system. Are they related at all? Well, they’re related insofar as CAP consistency trade-offs affect ACID consistency for system designers, but they’re still completely different notions of consistency! For example, if it’s possible for a read to return stale data during a transaction (dirty reads), this can make it difficult to maintain invariants for the application.

availability

availability in CAP has very strict, binary definition. CAP availability states that every non-failing node (even if isolated from the rest of the system) can respond successfully to requests. This is confusing because most site operators talk about availability in terms of uptime and percentages. For example, consider this scenario of a system that currently serves most reads from secondary nodes. Lets assume that most secondary nodes in a system fails and can no longer serve requests. The server switches all reads to the primary and is still technically “available” from a user standpoint. This fails the CAP definition of availability, but passes the more practical operational definition of availability. One way to distinguish one from the other when they’re discussed at the same time is to refer to CAP availability as “big A” and practical availability / uptime as “little a”.

in practice, availability (“little a”) isn’t black and white (can EVERY non-failing node respond?). Most real world systems don’t even offer this strong guarantee. This definition might make sense in a theoretical system and make it easier to reason with more formally / mathematically, but in reality many storage systems allow users to choose between many different levels of availability and for different operations. For example, some databases allow users to tune availability by reads or writes using quorums. Systems that prioritize data durability in the face of failures may sacrifice some availability for writes. If writes cannot be propagated to a majority of nodes in the system, return an error. At the same time, it may also prioritize availability for reads and so perhaps any response from any replica is considered successful, even if that data might be stale. If you’re trying to reason about the availability of a storage system, these details matter and are far more relevant than whether or not the system is capable of achieving big A (CAP availability).

partition tolerance

partition tolerance is confusing on multiple levels.

the word “partition” itself is confusing because partitions in a distributed system are splits or shards of a data store that is maintained for purposes of scalability. In CAP, a partition refers to some network partition which is basically a network error that disrupts node to node communication. if a system remains operational in the presence of a network partition, then it is considered partition tolerant in the CAP model.

second, CAP appears to present partition tolerance as something a distributed system has or does not have. but because networks are fundamentally unreliable, system designers need to design systems to tolerate network errors. there’s not a really a choice or trade-off that should be made – it’s more of a requirement. unlike availability and consistency (which are actual trade-offs that need to be made when there is a network partition), partition tolerance is not something you can relax in practice. so designers are really choosing between consistency and availability, assuming that there will be network partitions. this is why some people say CAP is really just CP and AP. you’re either choose consistency with partition tolerance or availability with partition tolerance.

finally, partitions in the network are relatively rare events. two data centers or nodes within a data center may lose communication, but that’s rare compared to individual node failures (disk full, over-capacity, random crashes) that happen all the time given a large network. In the model of CAP, however, a single node failure isn’t a network failure and so it falls outside the stated rules of CAP. eric brewer has clarified though that in practice, nodes that fail to respond within some time window can be treated as conceptually equivalent to an actual partition event..

SICP Exercise 1.11 – Recursive & Iterative

Prompt

A function f is defined by the rule that f(n) = n if n < 3 and f(n)=f(n-1) + 2*f(n-2) + 3*f(n-3) if n >= 3. Write a procedure that computes f by means of a recursive process. Write a procedure that computes f by means of an iterative process. Hint: for iteration, think about how you might code this using a for-loop in a language like Python.

Solution

The recursive form is easier to tackle – LISP’s probably do the best job of expressing recursive mathematical expressions due to the extremely minimal syntax.

(define (fn n)
  (if (< n 3)
      n
      (+
       (fn (- n 1))
       (* 2 (fn (- n 2)))
       (* 3 (fn ( - n 3)))
      ) 
   )
)

Here’s an (incomplete) expansion using substitution of a call to (fn 10). Each newline denotes the expanded expression of the previous. Note the crazy combinatoric explosion of having not one but three recursive calls in the body – this creates a huge tree of deferred addition and multiplication operations.

(fn 10)

(+ (fn 9) (fn 8) (fn 7))

(+
 (+ (fn 8) (fn 7) (fn 6))
 (+ (fn 7) (fn 6) (fn 5))
 (+ (fn 6) (fn 5) (fn 4)))

(+
 (+
  (+
   (+ (fn 7) (fn 6) (fn 5))
   (+ (fn 6) (fn 5) (fn 4))
   (+ (fn 5) (fn 4) (fn 3))
  )
  (+
   (+ (fn 6) (fn 5) (fn 4))
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
  )
  (+
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
   (+ (fn 3) (fn 2) (fn 1))
  )
 )
 (+
  (+
   (+ (fn 6) (fn 5) (fn 4))
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
  )
  (+
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
   (+ (fn 3) (fn 2) (fn 1))
  )
  (+
   (+ (fn 4) (fn 3) (fn 2))
   (+ (fn 3) (fn 2) (fn 1))
   (+ (fn 2) (fn 1) (fn 0))
  )
 )
 (+
  (+
   (+ (fn 6) (fn 5) (fn 4))
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
  )
  (+
   (+ (fn 5) (fn 4) (fn 3))
   (+ (fn 4) (fn 3) (fn 2))
   (+ (fn 3) (fn 2) (fn 1))
  )
  (+
   (+ (fn 4) (fn 3) (fn 2))
   (+ (fn 3) (fn 2) (fn 1))
   (+ (fn 2) (fn 1) (fn 0))
  )
 )
)
...

Eventually, this long recursive process bottoms out at the base case when we reach (fn 2), (fn 1), and (fn 0) since none of those invocations will reach the recursive section of the procedure. For example, (fn 2) will return 2 since 2 < 3. These values will play an important role in developing the iterative solution.

Iterative Form

There’s a few ways to think about this problem to get an intuition about what the iterative form might look like. One way is to consider a similar problem such as the fibonacci sequence.

Fibonacci Sequence

Consider a similar mathematical expression such as the fibonacci sequence expressed in scheme as a recursive procedure.

(define (fib n)
  (if (< n 3)
      1
      (+ (fib (- n 1)) (fib (- n 2)))))

It’s simpler since there aren’t additional multiplication operations and you’re only dealing with two recursive procedures in the body instead of 3, but it exhibits the same pattern of process growth (exponential expansion followed by contraction).

(fib 5)

(+ (fib 4) (fib 3))

(+ (+ (fib 3) (fib 2)) (fib 3))

(+ (+ (+ (fib 2) (fib 1)) (fib 2)) (fib 3))

...

In the iterative form, you compute the next result using two values you currently have. You accumulate the answer from bottom up.

Here’s an example in scheme:

(define (fn n)
  (define (iter x a b)
    (if (> x n)
        a
        (iter (+ x 1) b (+ a b))))
  (if (< n 2) 1 (iter 2 1 1)))

a represents n - 2 and b represents n - 1. If we’re dealing with an n that’s less than 2, return 1. Otherwise kick off the recursive process with (iter 2 1 1). In each step, we compute the next result (+ a b) and feed that into our next recursive call as the new b.

There’s a few key insights from this iterative fibonacci definition:

  1. You start with two values and you use them to compute the next value. As the computation evolves, so do the two values you’re holding at any given point.
  2. Each step contains all the state you need to compute the next recursive step – there are no deferred operations
  3. Some notion of a “base case” is still necessary for knowing when to stop the accumulation. In this example we use x and we stop when x is greater than the initial argument to the fibonacci function.

Back to the problem

The fibonacci definition has a key trait in common with our problem. The current result based on n is based on computation using the last three results (instead of the two by fibonacci). In fact, they’re so similar that if you squint you might see the iterative form of our problem!

(define (fn n)
  (define (iter x a b c)
    (if (> x n)
        c
        (iter (+ x 1) b c (+ (* 3 a) (* 2 b) c))))
  (if (< n 3) n (iter 3 0 1 2)))

Unlike our fibonacci problem, we need to perform both addition and multiplication (+ (* 3 a) (* 2 b) c) in order to generate the next result. This is what moves the computation along. With fibonacci, there’s just an addition (+ a b) that gets computed as the new next value – but the same underlying pattern applies. Compute the next value, shift all three numbers forward and repeat.

Commentary

We got this problem on the first day of David Beazley’s SICP course and it completely destroyed my brain when I initially attempted to come up with the iterative solution to this.

The new syntax combined with the noise from all the additional computation in the recursive function made it very challenging. I didn’t really have an AHA moment to this until David shared his solution with us and I only realized the connection to fibonacci when I set out to write this post and revisit my understanding of the exercise.

I think this is a great problem in terms of noticing this tension between simplicity and beauty of recursive procedure definitions and their performance. You have this simple definition that grows wildly during runtime. Then you go forth and optimize it so that it doesn’t accumulate on the stack but get a far less intuitive looking definition as a result. There’s also the fun challenge of porting a recursive definition to an iterative one and noticing some of the difficulties in coming up with the right base cases. It’s really easy to make off by one errors with these problems and it’s not immediately obvious why the results are wrong until you go back to trace the mathematical definitions manually which can be tedious.

SICP Exercise 1.9 – Process Illustration

Prompt

Each of the following two procedures defines a method for adding two positive integers in terms of the procedures聽inc, which increments its argument by 1, and聽dec, which decrements its argument by 1

(define (inc x) (+ x 1))
(define (dec x) (- x 1))

; Implementation 1
(define (plus-v1 a b)
  (if (= a 0) 
      b 
      (inc (plus-v1 (dec a) b))))

; Implementation 2
(define (plus-v2 a b)
  (if (= a 0) 
      b 
      (plus-v2 (dec a) (inc b))))

Using the substitution model, illustrate the process generated by each procedure in evaluating (plus-v1 4 5) and (plus-v2 4 5). Are these processes iterative or recursive?

Solution

An important distinction to reiterate here is the difference between a recursive procedure and a recursive process. A procedure that references itself is a recursive procedure. In the implementations of plus above, both procedures are recursive. It’s a static construct. But a recursive procedure, at runtime, can generate an iterative process! Whether a process is iterative or recursive depends on the manner in which the computation is carried out.

A common way to illustrate each evaluation step of a procedure application process is using the substitution model to show the expression that gets evaluated.

Here’s an example of using substitution applied to the first plus procedure.

(plus-v1 3 5)
(inc (plus-v1 2 5))
(inc (inc (plus-v1 1 5)))
(inc (inc (inc (plus-v1 0 5))))
(inc (inc (inc 5)))
(inc (inc (+ 5 1)))
(inc (inc 6))
(inc (+ 6 1))
(inc 7)
(+ 7 1)
8

The first (plus-v1 3 5) expression is our initial procedure invocation.

If we evaluate this we will get our next result. Since 3 does not equal 0, the if statement is skipped and we get (inc (plus-v1 2 5)). If we evaluate this again, notice that we don’t expand the expression for inc yet because we don’t know what the value of it’s argument is. We need to continue expanding the plus expressions until we reach an actual primitive value.

In other words, we are deferring the evaluation of inc until later. If you look at our full expansion, we only start calling inc when we get the value 5 back from (plus-v1 0 5). This chain of deferred operations during the evaluation process makes this a recursive process!

Now contrast this to the other plus procedure plus-v2.

(plus-v2 3 5)
(plus-v2 2 6)
(plus-v2 1 7)
(plus-v2 0 8)
8

In each step of the evaluation, no operation is deferred. We’re incrementing one number, decrementing the other and calling the function again. There are no function calls waiting for a result. This is an iterative process. In iterative processes, the complete state is computed in each step. In other words, prior to calling the procedure, all the parameters that would bind to the arguments are provided with primitive values.

Comments

When we did this exercise with David Beazley in his SICP course, he joked that this exercise was sort of like desert island coding. You’re stranded on a desert island… if you only had increment and decrement, how would you add addition and multiplication? Funny enough, he was actually stuck in a scenario not too far from that when he had to re-invent the shell using python while locked in a secure vault (very entertaining talk!).

What I like about this exercise is how effectively it gives students a feel for the difference between recursive and iterative processes. By expanding out an expression by hand using substitution, you can see operations expanding and contracting. This raises a lot of interesting ideas and questions around performance, recursive function base-cases, and whether all recursive forms have an iterative form and vice-versa.

SICP Exercise 1.6 – Cond as If?

Prompt

Alyssa P. Hacker doesn’t see why聽if聽needs to be provided as a special form. “Why can’t I just define it as an ordinary procedure in terms of聽cond” she asks. Alyssa’s friend Eva Lu Ator claims this can indeed be done, and she defines a new version of聽if:

(define (new-if predicate 
                then-clause 
                else-clause)
  (cond (predicate then-clause)
        (else else-clause)))

Eva demonstrates the program for Alyssa:

(new-if (= 2 3) 0 5)
5
(new-if (= 1 1) 0 5)
0

Delighted, Alyssa uses new-if to rewrite the square-root program:

(define (sqrt-iter guess x)
  (new-if (good-enough? guess x)
          guess
          (sqrt-iter (improve guess x) x)))

What happens when Alyssa attempts to use this to compute square roots? Explain.

Solution

The new-if appears to work since in the simple example we’re getting the correct results. However, the applicative order of evaluation of the new-if is a subtle but important difference compared to the actual if statement.

With applicative order evaluation of function applications, both the then-clause and the else-clause gets executed. We’re going to end up always evaluating sqrt-iter even if the guess is good enough. This completely breaks recursive procedures that rely on the satisfaction of base case conditions to terminate calls. In this case, the recursive calls will never end because sqrt-iter will be called all the time.

I really like this exercise because at a glance, it seems perfectly reasonable to implement if in terms of cond – I mean, they both exhibit the evaluation style that you want. else clauses won’t be executed at all if the predicate is true. So why not just substitute one for the other?! That sounds good! But in trying to do this by defining a new function, you immediately lose the special-case evaluation behavior 馃樄. This really further drives home the essential nature of lazy evaluation behavior for conditionals.

I think often as consumers of programming languages we take this behavior for granted and don’t think twice about how conditional expressions are evaluated. But this sort of exercise (and many exercises in SICP!) puts you in the seat of the language designer.

Difference between Docker ENTRYPOINT and CMD

ENTRYPOINT and CMD are two docker commands that sound interchangeable, but there are important differences that I’ll cover in this post. I suspect CMD is probably the more familiar instruction, so I’ll go over what that does and how it differs from ENTRYPOINT.

Here’s the purpose of CMD, taken straight from the docker manual:

The main purpose of a聽CMD聽is to provide defaults for an executing container.

https://docs.docker.com/engine/reference/builder/#cmd

If you start a container via docker run or docker start and you don’t supply any commands, the last CMD instruction is what gets executed. In most docker files, this effectively acts as “main” or … “entrypoint”. I put entrypoint in quotes both to distinguish it from the formal ENTRYPOINT instruction and to show you why this naming is confusing!

Here’s an example of a dockerfile that runs the rails server using CMD

# Use a base image with Ruby and Rails pre-installed
FROM ruby:3.2

# Set the working directory inside the container
WORKDIR /app

# Copy the Gemfile and Gemfile.lock to the working directory
COPY Gemfile Gemfile.lock ./

# Install dependencies
RUN bundle install

# Copy the rest of the application code to the working directory
COPY . .

# Set the default command to run the Rails server
CMD ["rails", "server", "-b", "0.0.0.0"]

CMD does not create a new image layer, unlike commands like RUN. It does not do anything at build time. So when you run docker build to create a docker image from a dockerfile, rails server is not being executed. It purely a runtime (container runtime) construct. This interleaving of instructions in docker that are intended for difference stages of a container lifecycle is also a common source of confusion for beginner users of docker.

In practice, at least from my experience, CMD is sufficient. I work on mostly web services and the vast majority of containers are running a server process of some kind using CMD <start server> after the application dependencies are build. For most situations, this is enough and you never have to think about or even be aware of the existence of ENTRYPOINT.

Have a nice day.

Just kidding, okay lets go over ENTRYPOINT now.

Here’s a rather confusing description on dockers website on what ENTRYPOINT is:

An聽ENTRYPOINT聽allows you to configure a container that will run as an executable.

https://docs.docker.com/engine/reference/builder/#entrypoint

So CMD is to provide defaults for an executing container and ENTRYPOINT is to configure a container that will run as an executable. That’s a little confusing because using CMD sort of also allows you to run a container as an executable right? You start the container and something executes!

Here’s a simpler definition:

An ENTRYPOINT is always run. It doesn’t matter what arguments you’re passing when running docker run. ENTRYPOINT is never overwritten. If arguments are passed, those are appended to the end of what’s already specified in ENTRYPOINT.

Here’s an example of using the ENTRYPOINT instruction in a Dockerfile:

FROM ubuntu:latest
ENTRYPOINT ["echo", "Hello, World!"]

In this example, the ENTRYPOINT is set to the command echo "Hello, World!". When a container is created from this image and started, the message “Hello, World!” will be printed to the console. Running docker run my-image "Welcome" would result in the message “Hello, World! Welcome” being printed.

Let’s take a look at an example using CMD to demonstrate the override behavior:

FROM ubuntu:latest
CMD ["echo", "Hello, World!"]

In this case, the CMD instruction specifies the default command as echo "Hello, World!". However, if a user runs the container with a different command, the specified command will override the CMD instruction. For instance, running docker run my-image "Welcome" would output “Welcome” instead of the default message “Hello, World!”. The entire command is overridden.

Both CMD and ENTRYPOINT can be used together in situations where you want

  • A particular command to always execute (this is where ENTRYPOINT comes in handy) that cannot be overridden
  • A set of default arguments for the ENTRYPOINT command

This offers some flexibility in how users of your image can provide custom arguments to alter the runtime behavior of your container. In cases where you don’t need that customization or where it’s perfectly fine for users to provide their own “entrypoints”, just use CMD.

running docker container as non-root

one common misconception is that containers provide a secure and isolated environment and therefore it’s fine for processes to run as root (this is the default). I mean, it’s not like it can affect the host system right? Turns out it can and it’s called “container breakout”!

dalle2 container breakout

with containers, you should also apply the principle of least privilege and run processes as a non-root users. This significantly reduces the attack surface because any vulnerability in the container runtime that happens to expose host resources to the container will be less likely to be taken advantage of by a container process that does not have root permissions.

here’s a rough skeleton of how you can do this by using the USER directive in the Dockerfile:

# Create a non-root user named "appuser" with UID 1200
RUN adduser --disabled-password --uid 1200 content

# Set the working directory
WORKDIR /app

# Grant ownership and permissions to the non-root user for the working directory
RUN chown -R appuser /app

# Switch to the non-root user before CMD instruction
USER appuser

# ... install app dependencies ...
# this may involve copying over files and compiling

# Execute script
CMD ["./run"]

one thing worth mentioning in this example is permissions.

we use the USER instruction early on right after we change the ownership of the working directory. since this is happening before we enter the application building phases, it’s possible that the permissions of the user appuser is insufficient for subsequent commands. For example, maybe at some point in your docker file you need to change the permission of a file that appuser doesn’t own or maybe it needs to write to a bind-mounted directory owned by a host user. If this applies to your situation, you can either adjust permissions as needed prior to running USER or move USER farther down towards the CMD instruction.

generally speaking, it’s also good practice to ensure that the files that are copied over from the host have their ownership changed to appuser. this isn’t as critical as ensuring that the process itself is running as non-root via USER since if an attacker gains privileged access, they can access any file in the container regardless of ownership. nonetheless it’s a good practice that follows the principle of least privilege in that it scopes the ownership of the files to the users and groups that actually need it.

other resources if you’re interested in learning more about this topic:

  • https://medium.com/jobteaser-dev-team/docker-user-best-practices-a8d2ca5205f4
  • https://www.redhat.com/en/blog/secure-your-containers-one-weird-trick
  • https://www.redhat.com/en/blog/understanding-root-inside-and-outside-container

Distributed Caching Pattern: Write Through

Write-through caching is a caching mechanism where data modifications are written to both the source database and the cache. In this approach, every write operation triggers a write to a RAM cache (such as redis) and the source database (such as postgres).

This is commonly used to complement look-aside caching pattern so that when subsequent read operations request the same data, there’s a greater likelihood of a cache-hit, therefore reducing the response time and minimizing the load on the source data store. Additionally, the cached data being returned will be more fresh compared to a strategy that only uses TTL or expiration based invalidation.

The key benefit to using a write-through cache is data freshness. By updating the cache during updates to the source database, you ensure that the cached data remains up to date. One of the common problems with relying strictly on expiry based invalidation to serve fresh data is the risk of stale data during the period between when the data is cached and when it expires.

How to implement (Ruby on Rails Example)

Lets look at a simple example. When data needs to be written, it is first updated in the source database and then persisted in the cache. Here’s an example using rails. In this example, we’re reading a list of blog posts using look-aside caching. On writes (updates to an individual blog post), we update the cache. This keeps the list of blog posts in the cache fresh!

class PostsController < ApplicationController
  def index
    @posts = Rails.cache.fetch('posts', expires_in: 1.hour) do
      Post.all.to_a
    end
    render json: @posts
  end

  def create
    @post = Post.new(post_params)
    if @post.save
      update_posts_cache
      render json: @post, status: :created
    else
      render json: @post.errors, status: :unprocessable_entity
    end
  end

  def update
    @post = Post.find(params[:id])
    if @post.update(post_params)
      update_posts_cache
      render json: @post
    else
      render json: @post.errors, status: :unprocessable_entity
    end
  end

  def destroy
    @post = Post.find(params[:id])
    @post.destroy
    update_posts_cache
    head :no_content
  end

  private

  def post_params
    params.require(:post).permit(:title, :content)
  end

  def update_posts_cache
    Rails.cache.write('posts', Post.all.to_a, expires_in: 1.hour)
  end
end

In this example, the index action retrieves the list of blog posts from the cache using the key `posts`. If the cache contains the posts, they are returned directly. Otherwise, the posts are fetched from the database (Post.all.to_a) and stored in the cache with a 1-hour expiration time.

Implementing write-through isn’t always straightforward because cached data may not always be easily re-computed on writes. The extent to which you can re-compute cached data during a write greatly affects the feasibility of a write-through implementation. In the example above, we re-computed the list of posts in the cache by invoking Posts.all.to_a in the code for writing. But what if instead of serving all posts on reads, our application only serves unread posts for logged in users? If you have 10 logged in users, they need to be served different lists of posts! (As an exercise, think about how you would handle this!).

Sometimes cache re-computability is affected by the API paradigm you choose. RESTful services are able to leverage write through caching better than GraphQL services. Lets see why.

GraphQL and Write Through Caching

At work we heavily use GraphQL API’s and GraphQL presents unique challenges for caching. The entire paradigm of GraphQL rests on allowing clients to query for the data they need. This leads to a higher variability in queries of nearly unbounded complexity (you can have highly nested queries). Common caching techniques can be easily applied to applications serving a small set of frequently executed but fixed queries – but if every query is likely going to be unique, the value of caching is diminished.

There’s a bigger reason though why caching (particularly write-through caching) is trickier with GraphQL. In reality, most GraphQL servers are built for internal product teams and client queries aren’t infinitely variable. There’s typically some fixed number of clients that use some set of queries to power an experience.

For example, at a video streaming company there may be a recommendations team (the client) that uses a set of queries for fetching video metadata from an internal video catalog service (graphql server).

Lets assume…

  1. There’s a high traffic query that looks like this defined within the recommendation client: query { videos { title createdAt reviews { reviewerFullName } } }.
  2. Within the video service, query results are cached on read with look-aside caching.

Now lets say you want to implement write-through caching to keep the cache of the recommendation teams query results warm. If video metadata gets updated throughout the day in the video catalog service (lets just say by some other service via HTTP), how does the cache get updated?

Here’s what we know:

  1. The query is defined on the client (unlike REST, where the query is defined on the server).
  2. The result of the query operation is computed at runtime in the video service resolvers.

This is a situation where the cached data (JSON result of the GraphQL query) is no longer easily computable. In order to compute that data, you need to execute the original query against the schema using the same inputs. But wait, GraphQL servers don’t define the queries, clients do! 馃槥

Unless the application changes the type of data that gets cached or the server records queries being executed against the API (to be “replayed” on relevant writes), warming this type of cache using write-through is not really possible. So in practice, most GraphQL servers focus on lazy-loading read-based caching strategies such as automatic persisted queries using GET requests or fragment caching on the server side.

Risks

Adopting write-through caching isn’t without risks – before implementing it, it’s worth considering the following in the context of your unique problem.

  • Increased Write Latency. Write-through caching introduces additional overhead due to the need to write data to both the cache and the main data store before returning a result. This can be an issue for applications with high writes that cannot tolerate additional write latency. That said, most users of applications generally have higher tolerance for writes than for reads.
  • Redundant / Unnecessary data in cache. The cache may fill up regardless of whether the data being written is going to be used. Data being written isn’t really a guarantee of a future read, so you may end up with a cache filled with unread data (which defeats the purpose of a cache). If you don’t have proper eviction policies in place, systems with high write volumes with large amounts of data being written can rapidly fill up a cache and make it unusable. On the other hand, if there are eviction policies in place and you haven’t properly sized your cache, you end up with churn in your cache where data being written to the cache is evicted before it has a chance to be used by readers.
  • Delayed benefit. The cache is only updated when something changes as a result of a write. If nothing changes, the cache stays empty. This is also why you rarely see systems that rely exclusively on write-through caching. If updates are infrequent, the cache will stay cold forever unless you also cache when the data is being read.

Questions to ask before caching with a remote cache

Distributed caching with a system like memcached or redis comes at a cost! If you haven’t added a remote cache yet, here’s some costs to consider:

  1. The operational cost of actually standing up and maintaining a remote caching service. Is it a managed service like AWS elasticache or self managed? What are the failure modes? How does it get secured and upgraded?
  2. More complexity in having another system in the call stack that makes it harder to understand and troubleshoot application behavior. Data freshness and consistency issues come to the forefront whenever a cache (or really any derived data) is involved.
  3. General exposure to a entire class of new problems (cascading failures) that comes with sharing load with another system.

Even if you already operate or use a remote cache, caching any bit of data still adds complexity to your application – and not the essential kind. It’s easy to use caching strategies as a hammer for every performance issue, but here are some questions I recommend you ask before reaching for caching as a solution.

Have I figured out what the problem really is? Is speed really an issue?

Programmers tend to want everything to be fast – and while it’s important that apps generally feel snappy and help users get stuff done efficiently, speed doesn’t impact the user experience the same way in every interaction they have with your app.

For example, users expect fast read experiences (like searching a catalog on ecommerce sites) but slower write / update experiences (like submitting a payment). If you spend all your time optimizing a slow payment submit process when most of your users are bouncing because they can’t be bothered to even look at your inventory, you’re wasting your time.

What specifically needs to be faster and by how much?

Just like most metrics, optimization has diminishing returns. How fast do you want an experience to be? Under a second? Under 100 milliseconds? If your p99 is < 100ms and there’s no reason to believe that your users are unhappy with performance, the return on investing time in reducing latency may be low.

The marginal value of optimization depends on your use-case and current baseline performance. If you’re a backend API team with an SLA, the specific numbers may even be hard coded into a contract. If there are no binding agreements on performance, the requirements are fuzzier and may be driven more by best practices and customer reports. In either case, do some cost benefit analysis – this doesn’t have to be rigorous and sometimes the reason to do it regardless of how high the lift is may be as simple as “users will stop using this dashboard if it continues taking five seconds to load”.

Finally, once you’ve picked a target don’t forget to measure your performance progress. Are there certain metrics (percentile latency, profiling stats, etc) you can use to describe the current performance and how will you know when you’ve reached your goal? I really enjoy performance work, but I also know I need to know when to stop at the risk of losing sight of other goals.

Can I optimize without caching?

Reaching for caching as a strategy in a first pass at performance work does two things that are not ideal for the resilience of your system.

  1. It hides inefficiencies (poorly written queries, N^2 algorithms) that exist in your program that will be much more difficult to spot once the cache is in place.
  2. It can make your cache load bearing over time, meaning unless your cache is healthy, your application is unusable under normal load. Since most distributed in-memory caches are, by design, meant to be volatile and ephemeral, cold-cache events become an Achilles heel for your system.

Before you cache data, consider common performance bottlenecks in your system. Some common examples for web apps are:

  • Excess database querying such as N+1 queries.
  • Excess memory allocation, especially for garbage collected languages. I currently use rails for word and Active Record object allocations are particularly expensive since they lead to longer GC cycles that impact latency.
  • Missing HTTP response compression (gzip compression for example at a reverse proxy).
  • In-efficient algorithms, such as slow compute inside nested loops. Worth knowing the basics of Big O here.

Can I cache outside of my origin server?

Ok, so you need to cache – but can you cache the data such that:

  1. The data is as close to the requester as possible, therefore much faster to serve!
  2. Your backend server can sit back and do zero work (isn’t that great?)

The good news for web applications is that browser clients have their own cache. By using proper cache-control HTTP headers, you can have control over what requests are cached and for how long. If the data is too dynamic (unique to individual users for example) to cache for any length of time, consider if you can make it static (same for all users). If you can’t make it static, can you extract the parts that are static and cache that?

Most enterprise applications with global customer bases also make use of Content Delivery Networks (CDN’s) that have points of presence (data centers strategically located in geographic locations around the world) that can serve requests out of its own cache without reaching to the origin servers.

Delegating caching behavior to systems closer to your clients isn’t a mutually exclusive strategy to caching at origins with remote in-memory systems, but they can be a low effort (implementation wise) and significantly faster caching approach so they’re worth trying out first if they’re not being used.