The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms Canonical and Isomeric can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive.
Typically, a number of equally valid SMILES can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of ethanol. Algorithms have been developed to ensure the same SMILES is generated for a molecule regardless of the order of atoms in the structure. This SMILES is unique for each structure, although dependent on the canonicalisation algorithm used to generate it, and is termed the Canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure and do not simply manipulate strings as is sometimes thought. Various algorithms for generating Canonical SMILES have been developed, including those by Daylight Chemical Information Systems, OpenEye Scientific Software, MEDIT and Chemical Computing Group. A common application of Canonical SMILES is indexing and ensuring uniqueness of molecules in a database.
SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone and SMILES which encode this information are termed Isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term Isomeric SMILES is also applied to SMILES in which isotopes are specified.
Some of the examples of SMILES :
Creating SMILES using ChemSketch |
The colourful SMILES |
It's so easy! |
The Grammar of SMILES :
SMILES Grammar | Description |
---|---|
Atoms | Atoms are the nouns of the SMILES grammar. One represents each atom by its chemical symbol. Usually one encloses the symbol in square brackets, like this: [Cl]. However, the following organic subset symbols may appear without the brackets: B, C, N, O, P, S, F, Cl, Br, and I. These include the halogens, which would normally bond to only one other atom in any case, and other atoms that are assumed to be bound to hydrogen if they are not explicitly bound to something else. An atom participating in an aromatic ring structure is listed in lowercase. The use of the brackets is significant. For example, [S] refers to elemental sulfur, while the symbol S represents hydrogen sulfide, which has two atoms of hydrogen bound to the one of sulfur. (However, Cl-Cl refers to the diatomic molecule of chlorine, while Cl refers to hydrochloric acid.) |
Charges and positions of atoms | Charge signs (+ and -) and digits giving the multiple of a charge or the position of an atom are the adjectives (and sometimes the adverbs) of SMILES grammar. An ionic valence is a classic application. For example, [Fe+2] is the ferrous or iron (II) ion. Note that SMILES does not require, nor use, superscripts or subscripts. One does not multiply atoms themselves (except for atoms of hydrogen) by using numbers. Instead, one repeats the atomic symbol as many times as the atom appears. |
Bonds | Bonds are the verbs of the SMILES grammar.To simplify things even further, one may omit the - and : symbols for atoms that are adjacent to one another and have single or aromatic bonds joining them. This is the reason for representing an aromatically bound atom in lowercase instead of in UPPERCASE. Thus the SMILES for diatomic oxygen is O=O; that for carbon dioxide is O=C=O; for diatomic nitrogen, N#N; for hydrogen cyanide, C#N; for acetylene or ethyne, C#C; for hydrazine, N=N. |
Branches | Branches are the subordinating conjunctions of the SMILES grammar. A structure that branches from the main line is enclosed in parentheses. Nesting and stacking of branches is permitted. An atom other than carbon in a linear structure would also receive a branch. Thus the SMILES for chloromethane (formerly called "methyl chloride") would be C(Cl), and that for tetrachloromethane ("carbon tetrachloride") would be C(Cl)(Cl)(Cl)(Cl). Carboxylic acids are a common branching structure. The SMILES for acetic acid, for example, is CC(=O)O. |
Rings | To write a cyclic or ring structure, you "break" one of the bonds and write the structure as a line having digits following the atoms in the broken bond. Thus the SMILES for cyclohexane is C1CCCCC1. If a given atom is part of more than one ring structure, and you have to break more than one bond, you then use a different digit for each broken bond, in order to convey how to re-join the atoms. By convention, aromatic ring vertices are written in lowercase. Thus the SMILES for benzene is c1ccccc1 and that for pyridine is n1ccccc1. |
Disconnected Structures | A simple dot (.) serves as the most common example of a coordinating conjunction in SMILES. Two structures not having a covalent bond of any kind to join them are considered disconnected, and are joined with a dot. This is the proper method for representing ionic compounds. For example, the SMILES for sodium chloride is [Na+].[Cl-]. The SMILES for sodium acetate is [Na+].[CC(=O)O-]. |
For more information, come and look at SMILES!
No comments:
Post a Comment