diff --git a/doc/css_selectors.md b/doc/css_selectors.md new file mode 100644 index 0000000..b5a28ad --- /dev/null +++ b/doc/css_selectors.md @@ -0,0 +1,477 @@ +# CSS Selectors Specification + +This document acts as an alternative specification to the official W3 +[CSS3 Selectors Specification][w3spec]. This document specifies only the +selectors supported by Oga itself. Only CSS3 selectors are covered, CSS4 is not +part of this specification. + +This document is best viewed in the YARD generated documentation or any other +Markdown viewer that supports the [Kramdown][kramdown] syntax. Alternatively it +can be viewed in its raw form. + +## Abstract + +The official W3 specification on CSS selectors is anything but pleasant to read. +A lack of good examples and unspecified behaviour are just two of many problems. +This document was written as a reference guide for myself as well as a way for +others to more easily understand how CSS selectors work. + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be +interpreted as described in [RFC 2119][rfc-2119]. + +## Syntax + +To describe syntax elements of CSS selectors this document uses the same grammar +as [Ragel][ragel]. For example, an integer would be defined as following: + + integer = [0-9]+; + +In turn an integer that can optionally be prefixed by `+` or `-` would be +defined as following: + + integer = ('+' | '-')* [0-9]+; + +A quick and basic crash course of the Ragel grammar: + +* `*`: zero or one instance of the preceding token(s) +* `+`: one or more instances of the preceding token(s) +* `(` and `)`: used for grouping expressions together +* `^`: inverts a match, thus `^[0-9]` means "anything but a single digit" +* `"..."` or `'...'`: a literal character, `"x"` would match the literal "x" +* `|`: the OR operator, `x | y` translates to "x OR y" +* `[...]`: used to define a sequence, `[0-9]` translates to "0 OR 1 OR 2 OR + 3..." all the way upto 9 + +Semicolons are used to terminate lines. While not strictly required in this +specification they are included in order to produce a Ragel syntax compatible +grammar. + +See the Ragel documentation for more information on the grammar. + +## Terminology + +local name +: The name of an element without a namespace. For the element `` the + local name is `strong`. + +namespace prefix +: The namespace prefix of an element. For the element `` the + namespace prefix is `foo`. + +expression +: A single or multiple selectors used together to retrieve a set of elements + from a document. + +## Selector Scoping + +Whenever a selector is used to match an element the selector applies to all +nodes in the context. For example, the selector `foo` would match all `foo` +elements at any position in the document. On the other hand, the selector +`foo bar` only matches any `bar` elements that are a descedant of any `foo` +element. + +In XPath the corresponding axis for this is `descendant-or-self`. In other +words, this CSS expression: + + foo + +is the same as this XPath expression: + + descendant-or-self::foo + +In turn this CSS expression: + + foo bar + +is the same as this XPath expression: + + descendant-or-self::foo/descendant-or-self::bar + +Note that in the various XPath examples the `descendant-or-self` axis is omitted +in order to enhance readability. + +### Syntax + +A CSS expression is made up of multiple selectors separated by one or more +spaces. There MUST be at least 1 space between two selectors, there MAY be more +than one. Multiple spaces do not alter the behaviour of the expression in any +way. + +## Universal Selector + +W3 chapter: + +The universal selector `*` (also known as the "wildcard selector") can be used +to match any element, regardless of its local name or namespace prefix. + +Example XML: + + + + + + +CSS: + + root * + +This would return a set containing two elements: `` and `` + +The corresponding XPath is also `*`. + +## Element Selector + +W3 chapter: + +The element selector (known as "Type selector" in the official W3 specification) +can be used to match a set of elements by their local name or namespace. The +selector `foo` is used to match all elements with the local name being set to +`foo`. + +Example XML: + + + + + + +CSS: + + root foo + +This would return a set with only the `` element. + +This selector can be used in combination with the +[Universal Selector][universal-selector]. This allows one to select elements +using both a given local name and namespace. The syntax for this is as +following: + + ns-prefix|local-name + +Here the pipe (`|`) character separates the namespace prefix and the local name. +Both can either be an identifier or a wildcard. For example, the selector +`rb|foo` matches all elements with local name `foo` and namespace prefix `rb`. + +The namespace prefix MAY be left out producing the selector `|local-name`. In +this case the selector only matches elements _without_ a namespace prefix. + +If a namespace prefix is given and it's _not_ a wildcard then elements without a +namespace prefix will _not_ be matched. + +The corresponding XPath expression for such a selector is +`ns-prefix:local-name`. For example, `rb|foo` in CSS is the same as `rb:foo` in +XPath. + +### Syntax + +The syntax for just the local name is as following: + + identifier = '*' | [a-zA-Z]+ [a-zA-Z\-_0-9]*; + +The wildcard is put in place to allow a single rule to be used for both names +and wildcards. + +The syntax for selecting an element including a namespace prefix is as +following: + + ns_plus_local_name = identifier* '|' identifier + +This would match `|foo`, `*|foo` and `foo|bar`. In order to match `foo` the +regular `identifier` rule declared above can be used. + +## Attribute Selectors + +W3 chapter: + +Attribute selectors can be used to further narrow down a set of elements based +on their attribute list. In XPath these selectors are known as "predicates". For +example, the selector `foo[bar]` matches all `foo` elements that have a `bar` +attribute, regardless of the value of said attribute. + +Example XML: + + + + + + +CSS: + + root foo[number] + +This would return a set containing only the `` element since the `` +element has no attributes. + +For the CSS expression `foo[number]` the corresponding XPath expression is the +following: + + foo[@number] + +When specifying an attribute you MAY include an operator and a value to match. +In this case you MUST include an attribute value surrounded by either single or +double quotes (but not a combination of the two). + +There are 6 operators available: + +* `=`: equals operator +* `~=`: whitespace-in operator +* `^=`: starts-with operator +* `$=`: ends-with operator +* `*=`: contains operator +* `|=`: hyphen-starts-with operator + +### Equals Operator + +The equals operator matches an element if a given attribute value equals the +value specified. For example, `foo[number="1"]` matches all `foo` elements that +have a `number` attribute who's value is _exactly_ "1". + +Example XML: + + + + + + +CSS: + + root foo[number="1"] + +This would return a set containing only the first `` element. + +The corresponding XPath expression is quite similar. For `foo[number="1"]` this +would be: + + foo[@number="1"] + +### Whitespace-in Operator + +This operator matches an element if the given attribute value consists out of +space separated values of which one is exactly the given value. For example, +`foo[numbers~="1"]` matches all `foo` elements that have the value `"1"` in the +`numbers` attribute. + +Example XML: + + + + + + +CSS: + + root foo[numbers~="1"] + +This would return a set containing only the first `foo` element. On the other +hand, if one were to use the expression `root foo[numbers~="bar"]` instead then +only the second `` element would be matched. + +The corresponding XPath expression is quite complex, `foo[numbers~="1"]` is +translated into the following XPath expression: + + foo[contains(concat(" ", @numbers, " "), concat(" ", "1", " "))] + +The `concat` calls are used to ensure the expression doesn't match the substring +of an attrbitue value and that the expression matches elements of which the +attribute only has a single value. If `foo[contains(@numbers, ' 1 ')]` were to +be used then attributes such as `` would not be matched. + +Software implementing this selector are free to decide how they concatenate +spaces around the value to match. Both Oga and Nokogiri use an extra call to +`concat` but the following would be perfectly valid too: + + foo[contains(concat(" ", @numbers, " "), " 1 ")] + +### Starts-with Operator + +This operator matches elements of which the attribute value starts _exactly_ +with the given value. For example, `foo[numbers^="1"]` would match the element +`` but _not_ the element ``. + +For `foo[numbers^="1"]` the corresponding XPath expression is as following: + + foo[starts-with(@numbers, "1")] + +### Ends-with Operator + +This operator matches elements of which the attribute value ends _exactly_ with +the given value. For example, `foo[numbers$="3"]` would match the element `` but _not_ the element ``. + +The corresponding XPath expression is quite complex due to a lack of a +`ends-with` function in XPath. Instead one has to resort to using the +`substring()` function. As such the corresponding XPath expression for +`foo[bar="baz"]` is as following: + + foo[substring(@bar, string-length(@bar) - string-length("baz") + 1, string-length("baz")) = "baz"] + +### Contains Operator + +This operator matches elements of which the attribute value contains the given +value. For example, `foo[bar*="baz"]` would match both `` +and ``. + +For `foo[bar*="baz"]` the corresponding XPath expression is as following: + + foo[contains(@bar, "baz")] + +### Hyphen-starts-with Operator + +This operator matches elements of which the attribute value is a hyphen +separated list of values that starts _exactly_ with the given value. For +example, `foo[numbers|="1"]` matches `` but not +``. + +For `foo[numbers|="1"]` the corresponding XPath expression is as following: + + foo[@numbers = "1" or starts-with(@numbers, concat("1", "-"))] + +Note that this selector will also match elements such as +``. + +### Syntax + +The syntax of the various attribute selectors can be described as following: + + # Strings are used for the attribute values + + dquote = '"'; + squote = "'"; + + string_dquote = dquote ^dquote* dquote; + string_squote = squote ^squote* squote; + + string = string_dquote | string_squote; + + # The `identifier` rule is the same as the one used for matching element + # names. + attr_test = identifier '[' space* identifier (space* '=' space* string)* space* ']'; + +Whitespace inside the brackets does not affect the behaviour of the selector. + +## Pseudo Classes + +W3 chapter: + +Pseudo classes can be used to further narrow down elements besides just their +names and attribute values. In essence they are a combination of XPath function +calls and axes. Some pseudo classes can take an argument to alter their +behaviour. + +Pseudo classes are often applied to element selectors. For example: + + foo:bar + +Here `:bar` would be a pseudo class applied to the `foo` element. Some pseudo +classes (e.g. the `:root` pseudo class) can also be used on their own, for +example: + + :root + +### :root + +The `:root` pseudo class selects an element only if it's the top-level element +in a document. + +Example XML: + + + + + +Using the CSS expression `root foo:root` we'd get an empty set as the `` +element is not the root element. On the other hand, `root:root` would return a +set containing only the `` element. + +This selector can both be applied to an element selector as well as being used +on its own. + +For the selector `foo:root` the corresponding XPath expression is as following: + + foo[not(parent::*)] + +For `:root` the XPath expression is: + + *[not(parent::*)] + +### :nth-child(n) + +The `:nth-child(n)` selector can be used to select a set of elements based on +their position and/or an interval. Here `n` is an argument that can be used to +specify one of the following: + +1. A literal node set index +2. A node interval used to match every N nodes +3. A node interval plus an initial offset + +The first element in a node set for `:nth-child()` is located at position 1, +_not_ position 0 (unlike most programming languages). As a result +`:nth-child(1)` matches the _first_ element, _not_ the second. + +Besides using a literal index argument you can also use an interval, optionally +with an offset. This can be used to for example match every 2nd element, or +every 2nd element starting at element number 4. + +The syntax of this argument is as following: + + integer = ('+' | '-')* [0-9]+; + interval = ('n' | '-n' | integer 'n') integer; + +Here `interval` would match any of the following: + + n + -n + 2n + 2n+5 + 2n-5 + -2n+5 + -2n-5 + +Due to `integer` also matching the `+` and `-` it will be part of the same +token. If this is not desired the following grammar can be used instead: + + integer = [0-9]+; + modifier = '+' | '-'; + interval = ('n' | '-n' | modifier* integer 'n') modifier integer; + +To match every 2nd element you'd use the following: + + :nth-child(2n) + +To match every 2nd element starting at element 2 you'd instead use this: + + :nth-child(2n+2) + +As mentioned the `+2` in the above example is the initial offset. This is +however _only_ the case if the second number is positive. That means that for +`:nth-child(2n-2)` the offset is _not_ `-2`. When using a negative offset the +actual offset has to first be calculated. When using an argument in the form of +`An-B` we can calculate the actual offset as following: + + offset = A - (B % A) + +For example, this would effectively turn `:nth-child(2n-2)` into +`:nth-child(2n+2)` and `:nth-child(2n-5)` into `:nth-child(2n+1)`. Note that if +the minus sign is part of the number you can simply use the following formula +instead: + + offset = B % A + +For `:nth-child(2n-5)` this translates to: + + offset = -5 % 2 + +Which would result in `:nth-child(2n+1)`. + +To ease the process of selecting even and uneven elements you can also use +`even` and `odd` as an argument. Using `:nth-child(even)` is the same as +`:nth-child(2n)` while using `:nth-child(odd)` in turn is the same as +`:nth-child(2n+1)`. + +[w3spec]: http://www.w3.org/TR/css3-selectors/ +[rfc-2119]: https://www.ietf.org/rfc/rfc2119.txt +[kramdown]: http://kramdown.gettalong.org/ +[universal-selector]: #universal-selector +[bnf]: https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form +[ragel]: http://www.colm.net/open-source/ragel/