Filter

Enclosing class:: Characters

public static class Characters.Filter extends Character.Subset

Subsets of Unicode characters identified by their general category. The categories are identified by constants defined in the Character class, like LOWERCASE_LETTER, UPPERCASE_LETTER, DECIMAL_DIGIT_NUMBER and SPACE_SEPARATOR.

An instance of this class can be obtained from an enumeration of character types using the forTypes(byte[]) method, or using one of the constants predefined in this class. Then, Unicode characters can be tested for inclusion in the subset by calling the contains(int) method.

Relationship with international standards

ISO 19162:2015 §B.5.2 recommends to ignore spaces, case and the following characters when comparing two identified object names: “_” (underscore), “-” (minus sign), “/” (solidus), “(” (left parenthesis) and “)” (right parenthesis). The same specification also limits the set of valid characters in a name to the following (§6.3.1):

A-Z a-z 0-9 _ [ ] ( ) { } < = > . , : ; + - (space) % & ' " * ^ / \ ? | °

Note: SIS does not enforce this restriction in its programmatic API, but may perform some character substitutions at Well Known Text (WKT) formatting time.

If we take only the characters in the above list which are valid in a Unicode identifier and remove the characters that ISO 19162 recommends to ignore, the only characters left are letters and digits.

Since:

0.3

See Also:

Field Summary

Fields

Modifier and Type

Field

Description

static final Characters.Filter

LETTERS_AND_DIGITS

The subset of all characters for which Character.isLetterOrDigit(int) returns true.

static final Characters.Filter

UNICODE_IDENTIFIER

The subset of all characters for which Character.isUnicodeIdentifierPart(int) returns true, excluding ignorable characters.
Method Summary

Modifier and Type

Method

Description

boolean

contains(int codePoint)

Returns true if this subset contains the given Unicode character.

static Characters.Filter

forTypes(byte... types)

Returns a subset representing the union of all Unicode characters of the given types.

Methods inherited from class Character.Subset
equals, hashCode, toString

Methods inherited from class Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- LETTERS_AND_DIGITS
  
  public static final Characters.Filter LETTERS_AND_DIGITS
  
  The subset of all characters for which Character.isLetterOrDigit(int) returns true. This subset includes the following general categories:
  Character.LOWERCASE_LETTER, UPPERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER and DECIMAL_DIGIT_NUMBER.
  SIS uses this filter when comparing two identified object names. See the Relationship with international standards section in this class javadoc for more information.
  See Also:
  
  AbstractIdentifiedObject.isHeuristicMatchForName(String)
  
  Citations.identifierMatches(Citation, String)
- UNICODE_IDENTIFIER
  
  public static final Characters.Filter UNICODE_IDENTIFIER
  
  The subset of all characters for which Character.isUnicodeIdentifierPart(int) returns true, excluding ignorable characters. This subset includes all the LETTERS_AND_DIGITS categories with the addition of the following ones:
  Character.LETTER_NUMBER, CONNECTOR_PUNCTUATION, NON_SPACING_MARK and COMBINING_SPACING_MARK.
Method Details
- contains
  
  public boolean contains(int codePoint)
  
  Returns true if this subset contains the given Unicode character.
  
  Parameters:
  
  codePoint - the Unicode character, as a code point value.
  
  Returns:
  
  true if this subset contains the given character.
- forTypes
  
  public static Characters.Filter forTypes(byte... types)
  
  Returns a subset representing the union of all Unicode characters of the given types.
  Parameters:
  
  types - the character types, as Character constants.
  
  Returns:
  
  the subset of Unicode characters of the given type.
  
  See Also:
  
  Character.LOWERCASE_LETTER
  
  Character.UPPERCASE_LETTER
  
  Character.DECIMAL_DIGIT_NUMBER
  
  Character.SPACE_SEPARATOR

Class Characters.Filter

Relationship with international standards

Field Summary

Method Summary

Methods inherited from class Character.Subset

Methods inherited from class Object

Field Details

LETTERS_AND_DIGITS

UNICODE_IDENTIFIER

Method Details

contains

forTypes