Regular Expressions(kompiler)


• Regular Expressions
• finite state machine is a good “visual” aid
– but it is not very suitable as a specification
(its textual description is too clumsy)
• regular expressions are a suitable specification
– a more compact way to define a language that can be accepted by an FSM
• used to give the lexical description of a programming language
– define each “token” (keywords, identifiers, literals, operators, punctuation, etc)
– define white-space, comments, etc
• these are not tokens, but must be recognized and ignored
• Example: Pascal identifier
• Lexical specification (in English):
– a letter, followed by zero or more letters or digits
• Lexical specification (as a regular expression):
– letter . (letter | digit)*
• Operands of a regular expression
• Operands are same as labels on the edges of an FSM
– single characters, or
– the special character e (the empty string)
– “letter” is a shorthand for
– a | b | c | … | z | A | B | C | … | Z
• “digit“ is a shorthand for
– 0 | 1 | 2 | … | 9
• sometimes we put the characters in quotes
– necessary when denoting | . * ( )
• Precedence of | . * operators.
• Consider regular expressions:
– letter.letter | digit*
– letter.(letter | digit)*
• TEST YOURSELF
Question 1: Describe (in English) the language defined by each of the following regular expressions:
– letter (letter* | digit*)
– (letter | _ ) (letter | digit | _ )*
– digit* “.” digit*
– digit digit* “.” digit digit*
• TEST YOURSELF
Question 2: Write a regular expression for each of these languages:
– The set of all C++ reserved words
• Examples: if, while, for, class, int, case, char, true, false
– C++ string literals that begin with ” and end with ” and don’t contain any other ” except possibly in the escape sequence \”
• Example: ”The escape sequence \” occurs in this string”
– C++ comments that begin with /* and end with */ and don’t contain any other */ within the string
• Example: /* This is a comment * still the same comment */
• Example: Integer Literals
• An integer literal with an optional sign can be defined in English as:
– “(nothing or + or -) followed by one or more digits”
• The corresponding regular expression is:
– (+|-|e) (digit.digit*)
• A new convenient operator ‘+’
– same precedence as ‘*’
– digit digit* is the same as
– digit + which means “one or more digits”
• Language Defined by a Regular Expression
• Recall: language = set of strings
• Language defined by an automaton
– the set of strings accepted by the automaton
• Language defined by a regular expression
– the set of strings that match the expression
• Concept of Reg Exp Generating a String
Rewrite regular expression until have only a sequence of letters (string) left
• Non–determinism in Generation
• Different rule applications may yield different final results
• Concept of Language Generated by Reg Exp
• Set of all strings generated by a regular expression is the language of the regular expression
• In general, language may be infinite
• String generated by regular expression language is often called a “token”
• Examples of Languages and Reg Exp
• å = { 0, 1, . }
– (0 | 1)+ “.” (0 | 1)* | (0 | 1)* “.” (0 | 1)+
Þ binary floating point numbers
– (0 0)* Þ even-length all-zero strings
– 1* (0 1* 0 1*)* Þ binary strings with even number of zeros
• å = { a,b,c, 0, 1, 2 }
– (a|b|c)(a|b|c|0|1|2)* Þ alphanumeric identifiers
– (0|1|2)+ Þ trinary numbers
• Reg Exp Notational Shorthand
• R + one or more strings of R: R(R*)
• R? optional R: (R|e)
• [abcd] one of listed characters: (a|b|c|d)
• [a-z] one character from this range: (a|b|c|d…|z)
• [^abc] anything but one of the listed chars
• [^a-z] any one character not from this range
• Equivalence of FSM and Regular Expressions
• Theorem:
– For each finite state machine M, we can construct a regular expression R such that M and R accept the same language.
– [proof omitted]
• Theorem:
– For each regular expression R, we can construct a finite state machine M such that R and M accept the same language.
– [proof outline follows]
• Regular Expressions to NFSM (1)
• Regular Expressions to NFSM (2)
• Regular Expressions to NFSM (3)
• For A*
• Example of RegExp -> NFSM conversion
• Consider the regular expression
(1|0)*1
• The NFSM is
• Converting NFSM to DFSM
• Simulate the NFSM
• Each state of DFSM
– is a non-empty subset of states of the NFSM
• Start state of DFSM
– is the set of NFSM states reachable from the NFSM start state using only e-moves
• Add a transition S a > S’ to DFSM iff
– S’ is the set of NFSM states reachable from any state in S after consuming only the input a, considering e-moves as well
• Remarks on converting NFSM to DFSM
• An NFSM may be in many states at any time
• How many different states ?
• If there are N states, the NFSM must be in some subset of those N states
• How many subsets are there?
• 2N = finitely many
• For example, if N = 5 then 2N = 32 subsets
• NFSM -> DFSM Example
• TEST YOURSELF
Question 3: First convert each of these regular expressions to a NFSM
– (a | b | e) (a | b)
– (ab | ba)* (aa | bb)
Question 4: Next convert each resulting NFSM to a DFSM

Grammars for Syntax Definition
A Context-free Grammar (CFG) Is Utilized to Describe the Syntactic Structure of a Language
A CFG Is Characterized By:
1. A Set of Tokens or Terminal Symbols
2. A Set of Non-terminals
3. A Set of Production Rules
Each Rule Has the Form

NT ® {T, NT}*
4. A Non-terminal Designated As the Start Symbol
Grammars for Syntax Definition
Example CFG
Grammars are Used to Derive Strings:
Grammars are Used to Derive Strings:
A More Complex Grammar
Defining a Parse Tree
More Formally, a Parse Tree for a CFG Has the Following Properties:
Root Is Labeled With the Start Symbol
Leaf Node Is a Token or Î
Interior Node (Now Leaf) Is a Non-Terminal
If A ® x1x2…xn, Then A Is an Interior; x1x2…xn Are Children of A and May Be Non-Terminals or Tokens
Other Important Concepts
Ambiguity
Other Important Concepts
Associativity of Operators
Other Important Concepts
Operator Precedence
Syntax-Directed Translation
Associate Attributes With Grammar Rules & Constructs and Translate As Parsing Occurs
Our Example Uses Infix to Postfix Notation Translation for Expressions
Translation May Be Defined Inductively As: Postfix(e), E is an Expression
Syntax-Directed Definition: (2 parts)
Each Production Has a Set of Semantic Rules
Each Grammar Symbol Has a Set of Attributes
For the Following Example, String Attribute “t” is Associated With Each Grammar Symbol, i.e.,
What is a Derivation for 9 + 5 – 2?
Syntax-Directed Definition: (2 parts)
Each Production Rule of the CFG Has a Semantic Rule
Note: Semantic Rules for expr Use Synthesized Attributes Which Obtain Their Values From Other Rules.
Semantic Rules are Embedded in Parse Tree
How Do Semantic Rules Work ?
What Type of Tree Traversal is Being Performed?
How Can We More Closely Associate Semantic Rules With Production Rules ?
Examples
Parsing – Top-Down & Predictive
Top-Down Parsing Þ Parse tree / derivation of a token string occurs in a top down fashion.
For Example, Consider:
Top-Down Parse (type = start symbol)
Top-Down Parse (type = start symbol)
Top-Down Process
Recursive Descent or Predictive Parsing
Parser Operates by Attempting to Match Tokens in the Input Stream
Utilize both Grammar and Input Below to Motivate Code for Algorithm
Top-Down Algorithm (Continued)
Problem with Top Down Parsing
Left Recursion in CFG May Cause Parser to Loop Forever
Solution: Algorithm to Remove Left Recursion
Comparing Grammars
with Left Recursion
Notice Location of Semantic Actions in Tree
What is Order of Processing?
Comparing Grammars
without Left Recursion
The Lexical Analysis Process
A Graphical Depiction
The Lexical Analysis Process
Functional Responsibilities
Input Token String Is Broken Down
White Space and Comments Are Filtered Out
Individual Tokens With Associated Values Are Identified
Symbol Table Is Initialized and Entries Are Constructed for Each “Appropriate” Token
Under What Conditions will a Character be Pushed Back?
Can You Cite Some Examples in Programming Language Statements?
Algorithm for Lexical Analyzer
Algorithm for Lexical Analyzer
Symbol Table Considerations

Tinggalkan Balasan

Please log in using one of these methods to post your comment:

Logo WordPress.com

You are commenting using your WordPress.com account. Logout / Ubah )

Gambar Twitter

You are commenting using your Twitter account. Logout / Ubah )

Foto Facebook

You are commenting using your Facebook account. Logout / Ubah )

Foto Google+

You are commenting using your Google+ account. Logout / Ubah )

Connecting to %s