Java 정규표현식 (regular expression)

1. 정규표현식이란

def) 특정한 규칙을 가진 문자열의 집합을 표현하는 데 사용하는 형식 언어

정규표현식은 문자열 내에서의 검색을 위해 사용되는 축약된 표현을 말합니다.

cbcbcb181818

위와 같은 문장에서 '숫자' 만을 추출하기 위해서는 '숫자' 를 표상하는 무언가가 있어야겠지요?

문자열 타입으로 저장된 데이터 하나를 보았을 때 사람은 그것이 숫자인지 영문자인지 딱 보면 척 하고 알 수 있지만,

컴퓨터는 아스키코드 범위를 정해주든지 0부터 9까지와 하나하나 비교해보든지 하는 복잡한 과정을 거쳐야

그것이 숫자인지를 알 수가 있습니다.

이런 불편들을 해결하기 위해 등장하게 된 것이 바로 정규표현식입니다.

java 뿐 아니라 다양한 언어에서 공통적으로 사용되며, 관련된 라이브러리를 지원하고 있습니다.

2. 문자 표현방법 (character classes)

[abc]	a, b, c 중 어느 하나
[^abc]	a, b, c를 제외한 모든 문자
[a-zA-Z]	a부터 z까지와 A부터 Z까지
[a-z[A-Z]]	a-z 또는 A-Z (합집합)
[a-z&&[d,e,f]]	a-z 와 [d,e,f 중 어느 하나]의 교집합
[a-z&&[^m-p]]	a-z 에서 [m-p 를 제외]한 (차집합)

위 표를 참고할 때, 숫자를 나타내려면 [0-9] 와 같이 범위를 지정해주면 되겠군요.

사전정의된 캐릭터 클래스를 이용하면 보다 간편해집니다.

.	어떤 문자 하나
\d	숫자 하나 [0-9]
\D	가 아닌 것 [^0-9]
\s	공백문자 하나 (whitespace, \t \n \f \r)
\S	가 아닌 것
\w	영문자와 숫자와 언더스코어( _ ) [a-zA-Z_0-9]
\W	가 아닌 것

* 주의사항

java에서는 역슬래쉬(\)를 escape 문자로 취급하므로 역슬래쉬 하나를 나타내기 위해서는 \\라고 해야합니다.

문자 '\' 자체를 표시하고 싶다면 \\\\라고 해야합니다.

위의 \d 의 경우 java에서는 \\d로 사용합니다.

Enter your regex:
[0-9]
Enter input string to search:
cbcbcb181818
I found the "1" starting at 6 ending at 7
I found the "8" starting at 7 ending at 8
I found the "1" starting at 8 ending at 9
I found the "8" starting at 9 ending at 10
I found the "1" starting at 10 ending at 11
I found the "8" starting at 11 ending at 12

3. 수량 나타내기 (Quantifiers)

수량자로는 해당 문자가 match 되어야하는 횟수를 지정할 수 있습니다.

Greedy	Reluctant	Possessive	desc
x?	x??	x?+	x가 없거나 한 번
x*	x*?	x*+	x가 0 또는 그 이상
x+	x+?	x++	x가 1 또는 그 이상
x{n}	x{n}?	x{n}+	x가 n 번
x{n,}	x{n,}?	x{n,}+	x가 최소한 n번
x{n,m}	x{n,m}?	x{n,m}+	x가 n번에서 m번 사이

문자열 "abaabaaab"에 대해서 위 Greedy 수량자들을 적용한 결과는 다음과 같습니다.

Enter your regex:
a?
Enter input string to search:
abaabaaab
I found the "a" starting at 0 ending at 1
I found the "" starting at 1 ending at 1
I found the "a" starting at 2 ending at 3
I found the "a" starting at 3 ending at 4
I found the "" starting at 4 ending at 4
I found the "a" starting at 5 ending at 6
I found the "a" starting at 6 ending at 7
I found the "a" starting at 7 ending at 8
I found the "" starting at 8 ending at 8
I found the "" starting at 9 ending at 9

Enter your regex:
a*
Enter input string to search:
abaabaaab
I found the "a" starting at 0 ending at 1
I found the "" starting at 1 ending at 1
I found the "aa" starting at 2 ending at 4
I found the "" starting at 4 ending at 4
I found the "aaa" starting at 5 ending at 8
I found the "" starting at 8 ending at 8
I found the "" starting at 9 ending at 9

Enter your regex:
a+
Enter input string to search:
abaabaaab
I found the "a" starting at 0 ending at 1
I found the "aa" starting at 2 ending at 4
I found the "aaa" starting at 5 ending at 8

Enter your regex:
a{2}
Enter input string to search:
abaabaaab
I found the "aa" starting at 2 ending at 4
I found the "aa" starting at 5 ending at 7

Enter your regex:
a{2,}
Enter input string to search:
abaabaaab
I found the "aa" starting at 2 ending at 4
I found the "aaa" starting at 5 ending at 8
Enter your regex:

Enter your regex:
a{1,2}
Enter input string to search:
abaabaaab
I found the "a" starting at 0 ending at 1
I found the "aa" starting at 2 ending at 4
I found the "aa" starting at 5 ending at 7
I found the "a" starting at 7 ending at 8

한편, 수량자는

1. capturing group, (cat) 이나

2. character class, [cat] 에도 적용될 수 있습니다.

전자의 경우 "cat" 이 반복되는 횟수를,

후자의 경우 'cat 중에 어느 한 글자' 가 반복되는 횟수를 지정할 수 있습니다. ex) cc ca at tt

capturing group은 괄호 안에 포함된 문자들을 하나의 단위로 묶어줍니다.
문자열 중 매치가 된 부분은 메모리에 남아 역참조 (backreference) 할 수 있습니다.
예를 들어, (\d\d)\1은 두 개의 숫자가 연속적으로 매치되는 지를 확인합니다.
Numbering 은 왼쪽에서 오른쪽으로 이루어집니다.
( ( A ) ( B ( C ) ) ) 는

( ( A ) ( B ( C ) ) )
( A )
( B ( C ) )
( C )

와 같이 번호를 붙일 수 있습니다.

이걸 알아야하는 이유는 Matcher의 몇몇 메소드들이 그룹번호를 파라미터로 요구하기 때문입니다.

(start, end, group 등)

** 괄호로 묶지 않으면 수량자가 마지막 문자 't' 에 대해서만 적용되어 catt을 찾게 됩니다.

Enter your regex:
cat{2}
Enter input string to search:
onecattwocatcatthreecatcatcat
I found the "catt" starting at 3 ending at 7
I found the "catt" starting at 12 ending at 16

Enter your regex:
(cat){2}
Enter input string to search:
onecattwocatcatthreecatcatcat
I found the "catcat" starting at 9 ending at 15
I found the "catcat" starting at 20 ending at 26

Enter your regex:
[cat]{2}
Enter input string to search:
onecattwocatcatthreecatcatcat
I found the "ca" starting at 3 ending at 5
I found the "tt" starting at 5 ending at 7
I found the "ca" starting at 9 ending at 11
I found the "tc" starting at 11 ending at 13
I found the "at" starting at 13 ending at 15
I found the "ca" starting at 20 ending at 22
I found the "tc" starting at 22 ending at 24
I found the "at" starting at 24 ending at 26
I found the "ca" starting at 26 ending at 28

Enter your regex:
(\d\d)\1
Enter input string to search:
12121233
I found the "1212" starting at 0 ending at 4

끝으로 greedy reluctant possessive 의 차이를 알아보겠습니다.

imadogyou'readog 라는 문자열에서

임의의문자들과 결합된 dog를 찾으려고 합니다.

1) greedy의 경우 .*dog 중에서 .* 는 어느(any) 문자가 0 또는 그 이상 존재함을 의미합니다.

탐욕스럽게 이에 해당하는 전체 문자열(imadogyou'readog)을 먹어치워버립니다.

다음으로 d 의 매치여부를 확인해야하는데 남은 문자열이 없기 때문에

.* 를 두들겨패서 글자를 하나씩 토해내게 만듭니다.

g.. o.. d 드디어 d 를 찾았습니다. 다음으로 o 를, 그리고 g 를 매치시킵니다.

매치가 끝에서 이루어졌습니다. 전체 문자열을 반환합니다.

2) reluctant는 반대로 앞에서부터 한 글자씩 읽어가며 매치여부를 확인하고 마지막으로 전체 문자열을 확인하게 됩니다.

입이 짧은 reluctant는 imadog 에서 매치를 확인합니다.

그리고 깨작깨작 글자를 하나씩 읽어들이다가 you'readog 에서도 매치를 확인합니다.

3) possessive 는 항상 전체 문자열을 읽어와 단 한 번 매치 여부를 확인합니다.

greedy와 마찬가지로 .*+ 가 전체 문자열을 다 먹어치웁니다.

탐욕을 넘어선 소유욕 때문에 .*+ 를 아무리 두들겨패도 글자를 토해내지 않습니다.

결국 match 에 실패하게 됩니다.

Enter your regex:
.*dog
Enter input string to search:
imadogyou'readog
I found the "imadogyou'readog" starting at 0 ending at 14
Enter your regex:
.*?dog
Enter input string to search:
imadogyou'readog
I found the "imadog" starting at 0 ending at 6
I found the "you'readog" starting at 6 ending at 14
Enter your regex:
.*+dog
Enter input string to search:
imadogyou'readog
No match found

4. 정규표현식 테스트도구

java.util.regex 에서는 정규표현식에 사용될 수 있는 도구로서 Pattern과 Matcher를 제공하고 있습니다.

Pattern.compile() 로 정규식을 저장할 수 있습니다.

pattern.matcher() 는 대상 문자열을 저장합니다.

import java.util.regex.Pattern;
import java.util.Scanner;
import java.util.regex.Matcher;

public class RegexTestHarness {

    public static void main(String[] args){
        Scanner sc = new Scanner(System.in);
        
        while (true) {
        	System.out.println("Enter your regex: ");
            Pattern pattern = 
            Pattern.compile(sc.nextLine());
   
            System.out.println("Enter input string to search: ");
            Matcher matcher = 
            pattern.matcher(sc.nextLine());

            boolean found = false;
            while (matcher.find()) {
            	System.out.println("I found the \"" + matcher.group() + "\"" + 
            			" starting at " + matcher.start() + " ending at " + matcher.end() );
                found = true;
            }
            if(!found){
                System.out.println("No match found");
            }
        }
    }
}

위 프로그램은 입력된 정규식에 해당하는 문자가 입력된 문자열 내 어느 위치에 존재하는지 알려줍니다.

이때 matcher의 메소드 start() end()는 인덱스 사이사이에 글자가 들어있는 것으로 봅니다.

0 1 2 3 4

그래서 cat은 인덱스 0에서 시작하여 3에서 끝납니다.

Enter your regex:
cat
Enter input string to search:
cats
I found the "cat" starting at 0 ending at 3

참고

https://docs.oracle.com/javase/tutorial/essential/regex

'Java & Spring' 카테고리의 다른 글

Java 큰 수의 표현 BigInteger 와 BigDecimal (0)	2021.11.01
Java NumberFormatException 에 대처하는 우리들의 자세 (0)	2021.10.27
Java 에서 swap 하는 방법 (0)	2021.10.27
Java 정규표현식 - 2. 문자열에서 숫자 등 추출하기 (0)	2021.10.25
Java 문자열 빈 값 vs 공백 vs null 비교 (0)	2021.10.23

Java 정규표현식 (regular expression) - 1. 개요

1. 정규표현식이란

2. 문자 표현방법 (character classes)

3. 수량 나타내기 (Quantifiers)

4. 정규표현식 테스트도구

'Java & Spring' 카테고리의 다른 글

댓글

티스토리툴바

Java 정규표현식 (regular expression) - 1. 개요

1. 정규표현식이란

2. 문자 표현방법 (character classes)

3. 수량 나타내기 (Quantifiers)

4. 정규표현식 테스트도구

'Java & Spring' 카테고리의 다른 글

관련글

댓글

티스토리툴바