What is the ultimate goal of the Domain Name and Email ID Validation?

Finding the right balance between security and accessibility.

Many a times, on the internet, we use online services from various internet-based service providers, accorded through the "accounts" that need to be created with them. In many cases the access is granted via the OAuth mechanism wherein the web-portals with large userbase viz. Facebook, Google, Github, Stackoverflow act as access-enablers through their own sign-in mechanisms. Alternatively, users have to "Sign Up" with their own credentials which typically involve e-mail IDs and custom login ID and the password combination. In either case, our E-mail ID plays a vital role in this log-in/authentication process.

To put it differently, if you do not own an e-mail ID, you do not have a workable way to access the myriads of benefits one can take from online transactions. However, since most of us use standard e-mail services, the problem never occurs. Or, the first time it occurs, we create an e-mail account with one such popular email service provider and get going. However, given the fact that the user behavior has been molded to work-around the problem, many of the web-services do not feel the need to admit any special email IDs, other than most of the popular services. Some of those who do allow custom email IDs, do not allow Internationalized Domain Names (IDNs)and Unicode based user names. The reason stems from the fact that, apparently there are no significant number of users that are asking for this kind of enablements. However, the fact is that such users are either not using the internet or have been forced to use the workaround. And frankly, this does not augur well for the overall advancement of the internet.

To those uninitiated with the concept of IDNs, these are the domain names that use at least one letter which is other than Latin script. Here is an example of the same: https://faß.de

To go deeper into the topic, we need to understand what are the actual mechanisms that are at play that deny the IDNs and IDN based Email Services. One of the major hurdles is introduced by the mechanism called "validation" that is implemented in most of the web-applications. Lets see what it entails.

A typical regular expression that is used while validating the domain names is;

((?!-)[A-Za-z0-9-]{1, 63}(?<!-)\\.)+[A-Za-z]{2, 6}$

Though apparently well-designed in terms of the length of a domain name and non-permissibility of hyphen as a beginning character, the TLD length of the TLD, is permitted to be till 6 instead of 3, it has it's own serious limitations. One of the most obvious ones lies in the

[A-Za-z0-9-]

and

[A-Za-z]

which is, non-permissibility of the characters that are other than Latin.

The e-mail validation requirement from the software developer community, has the following needs -

  • to stave off unwanted spammers
  • to ensure the user does not make inadvertent mistake
  • to ensure that the e-mail ID belongs to the actual person inputting it

They are not exactly interested in implementing a technically all-encompassing definition as per the RFC 5322, that allows all the forms (IP addresses, including port IDs and all) of e-mail id. The solution suitable for their use-case is expected to solely ensure that all the legitimate e-mail holders should be able to get through. The definition of "legitimate" differs vastly from technical stand-point (RFC 5322 way) to usability stand-point. The usability aspect of the validation aims to ensure that all the e-mail IDs validated by the validation mechanism belong to actual people, using them for their communication purposes. This, thus introduces another angle to the validation process, ensuring an actually "in-use" e-mail ID, a requirement for which RFC-5322 definition is clearly not sufficient.

Thus, on practical grounds, the actual requirements boil down to this -

  1. To ensure some very basic validation checks
  2. To ensure that the inputted e-mail is in use

Third requirement typically involves, sending a standard response seeking e-mail to the inputted e-mail ID and authenticating the user based on the action delineated in the response mechanism. This is the most widely used mechanism to ensure the third requirement of validating an "in use" e-mail ID. This does involve round-tripping from the back-end server implementation and is not a straight-forward single-screen implementation, however, one cannot do away with this.

The first requirement, stems from the need that the developers do not want totally "non e-mail like" strings to pass as an e-mail. This involves blanks, strings without "@" sign or without a domain name. Given the punycode representations of the domain names, if one needs to enable domain validation, they need to engage in full-fledged implementation that ensures a valid domain name. Thus, given the basic nature of requirement in this regard, validating for

<something>@<something>.<something>

is the only apt way of satisfying the requirement.

A typical regex that can satisfy this requirement is:

^[^@\s]+@[^@\s\.]+\.[^@\s\.]+$

The above regex, follows the standard Perl regular-expression standard, widely followed by majority of the programming languages. The validation statement is:

<anything except whitespaces and "@" sign>@<anything except whitespaces and "@" sign>.<anything except whitespaces, @ sign and dot>

For those who want to go one step deeper into the more relevant implementations, they can follow the following validation methodology.

<e-mail local part>@<domain name>

For <e-mail local part> - This can be either Unicode or ASCII based. Typically, the e-mail service implementers are supposed to follow certain guidelines that would enable a good-user experience for their users in terms of availability of a good email box name. One of such important guidelines us given by the "Universal Acceptance Steering Group" - [UASG-028]. As a developer, it would help to understand the same so that validations on domain names, email IDs can be appropriately coded.

For <domain name>- This can be either IDN or and ASCII based. There are various checks one can apply to check if the domain name is valid. One can be a static protocol based check which involves using some of the available libraries on popular programming platforms. Or, one can check if the domain actually resolves to a hosted website, which can be a part of dynamic check. Which check to use, totally depends on your business requirement. For a detailed testing of such available libraries, one can follow any domain validation methodology using standard libraries. For the recent studies on the subject, follow the document [UASG-018A] for some of the common programming languages like: C, C#, Go, Java, Javascript, Python and Rust. In addition, for additional platforms like, iOS Swift, PHP and Android Kotlin, do follow the document [UASG 037]. Both UASG-018A and UASG 037, go into the depth of the individual popular libraries available to the developers and their compliance level as of their release dates. For the easy understanding of the compliance level, following color coding mechanism has been followed:

UASG Highlight Colors.png

Developers are advised to not use the libraries highlighted in the Red as they would definitely not provide your end-users with the UASG compliant experience. It would be an additional help to the cause if one reaches out to the developers of those libraries and sensitizes them about the need for acknowledging these changes to the internet naming ecosystem.

It would be good to highlight here that there are two ways in which the domain names and the e-mail IDs can be validated. First one being the static one, in which only the label is validated based on the analysis of it's textual composition. The other one involves doing a live/dynamic checking. The dynamic check for the domain name can be done by doing a proper DNS resolution query on the domain name. The same can be implemented on an E-mail by sending an email to the mail-id. The user can be needed to validate the ownership of the e-mail ID by either clicking on the link given in the e-mail or by sharing some sort of special code which (s)he can access only post accessing the e-mail ID.

Those who are interested to know the overall process, challenges and issues one may come across while implementing the Internationalized Email Solution, they can also go through the following RFCs:

  • RFC 6530 (Overview and Framework for Internationalized Email)
  • RFC 6531 (SMTP Extension for Internationalized Email)
  • RFC 6532 (Internationalized Email Headers)
  • RFC 6533 (Internationalized Delivery Status and Disposition Notifications)
  • RFC 6855 (IMAP Support for UTF-8)
  • RFC 6856 (Post Office Protocol Version 3 (POP3) Support for UTF-8)
  • RFC 6857 (Post-Delivery Message Downgrading for Internationalized Email Messages)
  • RFC 6858 (Simplified POP and IMAP Downgrading for Internationalized Email)

In addition, for e-mail software developer/mail-admin [UASG 030] and [UASG 030A] are well-researched documents listing the compliance levels of various service providers/open source softwares, vis-à-vis important building blocks of any e-mailing system, viz. MUA, MSA, MTA, MDA are discussed along with their compliance level from Internationalized Email requirements side.

One thing is for sure. If we want a more inclusive internet that welcomes the next billion users onboard, building "Universally Accepting" applications is the way to go. As a consumer of services, it is incumbent upon us to explore and if possible adopt IDNs and IDN based Email services. As a software developer one needs to be aware of such a userbase and he/she should ensure that the code he uses is "inclusive" in all ways.

Happy coding !