Things not to do with string functions

Whatever the programming language or framework you are using, you are most likely familiar with the string-handling functions you have at your disposal. You probably even wield concat, replace, match and split like as many ninja weapons! However sometime the hard part is not to solve an issue with strings, rather it is to recognize when you should restrain from using these otherwise tried-and-true tools and take another approach, lest your code be broken or insecure. A famous example of this is the Stack Overflow question “RegEx match open tags except XHTML self-contained tags” where Jeff learns that regular expressions are not the right tool when it comes to parsing (X)HTML.

With this article I’ll try to highlight some tasks which at first glance, seem like they could be accomplished using string-handling functions and regular expressions, while going down that path only leads to much sadness.

Matching URLs

Let’s imagine you are building the new awesome social network where users can keep in touch with their friends and family, have constructive debate and discover new ideas. In order to protect your community you want to forbid any link to a website outside the domains you control. More specifically, you want to redact any URL which points to a URL which is not part of your https://awesome.example.com website. The code which will differentiate between allowed and disallowed URLs may look something like this.

function isUrlAllowed(url) {
  return Boolean(url.match('awesome.example.com'))
}

Your users cannot post links to other websites anymore. https://wikipedia.org certainly does not contain “awesome.example.org”, therefore it is forbidden. Mission accomblished! Right?

Of course not. Your astute users have quickly caught-on and started using a neat trick! Rather than posting a link to https://wikipedia.org, they can post a link to https://wikipedia.org#awesome.example.org. This is a perfectly valid URL which points where it is supposed to, with the added benefit that it goes right through your filter.

Alright then. Let’s pour some more work into this function. Here’s the next iteration you might come up with.

function isUrlAllowed(url) {
  return Boolean(url.match(/^(https?:\/\/)?awesome.example.com/))
}

“Surely this ought to do it!” you may be thinking. Of course, one of your more astute users found yet another way to circumvent your filter. This user owns the domain “astute.xyz” and started hosting a URL shortening service at https://awesome.example.com.astute.xyz. Now each and every one of your users can use this service to post links to wherever they wish, since the URLs now all start with “https://awesome.example.com”, which is exactly what you are matching.

This issue (not this usecase thankfully) is one I have encountered on real, production code. During an audit of the codebase the issues with this approach were pointed out to us and the fix was revealed to be easy and elegant. Your language or framework of choice probably has facilities to parse URLs for you already. Instead of building some brittle regular expression or string-handling machinery, you can just use tried-and-true standard library functions. In Javascript, it looked like this.

function isUrlAllowed(url) {
  const parsedUrl = new URL(url)
  return parsedUrl.host === 'awesome.example.org'
}

URLs are more complex beasts than they may look like initially, best to let some well-established library parse it.

Concatenating file paths

Now let’s say you wish to allow your users to upload files through your brand-new desktop app. For some (very questionable) reasons you decided to have users write the path of the file they wish to upload relative to their home directory. In order to load the file, you write the following.

function uploadFile() {
  const pathInHome = promptUserForUploadedFilePath()
  const path = process.env.HOME + pathInHome
  return readFile(path)
}

Many things can go wrong. If as a user I want to upload the file located under /home/me/Pictures/cute-cat.png, I’d be tempted to input “Pictures/cute-cat.png”. Given that you don’t necessarily know whether the HOME environment variable ends with a path separator (it usually does not) you could end up in quite a predicament when you then try to read the file /home/mePictures/cute-cat.png. The obvious way to fix it is to simply concatenate with a path separator between the two fragments.

function uploadFile() {
  const pathInHome = promptUserForUploadedFilePath()
  const path = process.env.HOME + '/' + pathInHome
  return readFile(path)
}

This might be fine if you distribute your app only for GNU/Linux and OS X but it will definitely break down on Windows. You can do some OS detection to include either the forward slash found in UNIX-like OSes or the backslash found on Windows but this sounds like something that should be handled by your standard library. Turns out it often is!

const path = require('path')

function uploadFile() {
  const pathInHome = promptUserForUploadedFilePath()
  const path = path.join(process.env.HOME, pathInHome)
  return readFile(path)
}

This operation is often found under the name path.join, for example it is os.path.join in Python, File.join in Ruby or even std::filesystem::path::append in C++ (the usage for that one looks super weird). These implementations will be perfectly capable of handling extra or missing separators, or relative and absolute paths.

Matching email addresses

Ah, good old venerable email. Anytime you need to work with email you can be sure things will be more complicated than what initially planned. By a lot. It starts at the simple question: what is an email address? Let’s say you want to be helpful to your users and have your form validate in real time. Users should only be able to submit their email address if it is valid. You could write something like this. (I have seen a similar function in production.)

function isEmailValid(email) {
  return /[a-z0-9-]+@([a-z0-9-]+\.)+[a-z]{2,3}/.test(email)
  // One or more alphanumeric characters or dashes,
  // then the @ symbol,
  // then one or more alphanumeric characters or dashes,
  // followed by a dot,
  // at least once,
  // then two or three alphabetic characters.
}

A few things can go wrong with this approach.

What happens if the address includes a comment? Those look like this: username+comment@example.com. They sometimes map to multiple inboxes, or the user can also simply have triage rules depend on them. People do use those.
This regular expression might have worked in the old days when we did not have fancy TLDs such as .berlin, .museum, .flowers or .pizza, however now all bets are off. The longest TLD in the IANA’s official list to date is the 24-characters monster .xn--vermgensberatung-pwb, which will show up as .vermögensberatung in your browser thanks to the magic of Punycode.
This will not catch many other obscure features of e-mail addresses. Wikipedia has a very surprising list of valid emails to illustrate this.

My recommendation for this is quite simple: don’t validate email-addresses yourself. You’ll find many articles on the net with behemoth regular expressions claiming to match all email addresses perfectly; perhaps one of them does, but the chances are low. With HTML5 browsers have actually been given the ability to do some powerful form validation: rather than coming up with your own matching logic you can just delegate to the browser. Simply make sure you give your inputs the “email” type.

<input type="email" required />

If you do that however you need to remember: browsers are free to define their own algorithm. “But, this means I still need to have my own validation logic server-side!?” you may say. And of course you’d be right, even if you instructed browsers to ensure only email addresses go through you can never trust user input. However there still is something you can do to avoid having to validate email addresses.

Just send a verification email to the address, whatever it is.

After all, what you care about is that you can communicate with your user, right? Not that their email address obeys a regular expression? Isn’t the email infrastructure best suited to decide what is an acceptable email address and what is not anyway? Just send the email with a link, and if someone clicks the link, you know the email address is good.

With this article I hope I was able to teach you something about solving problems which at first sight involve tricky string manipulations. Though often your trusty string functions will do the job well, there are certainly also elegant built-in solutions for those problems which resist your string-fu!

July 23, 2020