Shell

Tổng quan

Xem xét trường hợp khi bạn có tệp được phân tách bằng ống 6 GB trên máy tính xách tay của bạn và bạn muốn tìm hiểu số lượng giá trị khác biệt trong một cột cụ thể. Bạn có thể có thể làm điều này theo nhiều cách. Bạn có thể đặt tệp đó vào cơ sở dữ liệu và chạy Lệnh SQL hoặc bạn có thể viết tập lệnh python / perl.

Có lẽ bất cứ điều gì bạn làm nó đều thắng được đơn giản / ít tốn thời gian hơn thế này

TungNT:Others tungnt$ cat data.txt | cut -d "|" -f 1 | sort | uniq | wc -l
30

Và điều này sẽ chạy nhanh hơn bất cứ điều gì bạn làm với tập lệnh perl / python.

Một số lệnh cơ bản trong Shell

cat

TungNT:ShellBasic tungnt$ cat data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333

% time cat ut_2024_10.csv | wc -l
 7171315
cat ut_2024_10.csv  0.04s user 0.63s system 21% cpu 3.117 total
wc -l  2.79s user 0.13s system 93% cpu 3.115 total

head & tail

TungNT:ShellBasic tungnt$ head data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ head -n 3 data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
TungNT:ShellBasic tungnt$ tail data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ tail -n 2 data.txt 
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333

Piping

Chúng ta cũng có thể dùng lệnh Head, Tail trên bằng cách kết hợp với lệnh Cat như sau:

TungNT:ShellBasic tungnt$ cat data.txt | head
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333

Just read the “|” in the command as “pass the data on to”

wc

Lệnh wc cho phép chúng ta đếm số dòng (-l), từ (-w) hoặc ký tự (-c) trong một tệp đã cho:

TungNT:ShellBasic tungnt$ wc -l data.txt 
      10 data.txt
TungNT:ShellBasic tungnt$ wc -w data.txt 
      10 data.txt
TungNT:ShellBasic tungnt$ wc -c data.txt 
     296 data.txt

grep

TungNT:ShellBasic tungnt$ grep "1985|BAL" data.txt
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ grep "1985|BAL" data.txt | head
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ grep "1985|BAL" data.txt | head -n 2
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
TungNT:ShellBasic tungnt$ grep "1985|BAL" data.txt | head -n 2
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
TungNT:ShellBasic tungnt$ head -n 2 data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
TungNT:ShellBasic tungnt$ head -n 2 data.txt | grep "1985|BAL"
1985|BAL|AL|murraed02|1472819

sort

Tham số:

-t: Sử dụng dấu phân cách nào?
-k: Cột muốn sắp xếp?
-n: Dùng nếu sắp theo xếp số, sắp xếp theo từ thì không cần dùng.
-r: Sắp xếp giảm dần, mặc định là sắp xếp tăng dần.

TungNT:ShellBasic tungnt$ sort -t "|" -k 5 -n -r data.txt 
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ sort -t "|" -k 5 -n -r data.txt | head -2
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000

https://www.geeksforgeeks.org/sort-command-linuxunix-examples/

cut

Lệnh này cho phép chọn các cột nhất định từ dữ liệu. Đôi khi chúng ta có thể muốn xem xét chỉ một số cột trong dữ liệu của mình.

Tùy chọn:

-d: Sử dụng dấu phân cách nào?
-f: Cột / cột nào cần cắt?

TungNT:ShellBasic tungnt$ cut -d "|" -f 1,4,5 data.txt 
yearID|playerID|salary
1985|murraed02|1472819
1985|lynnfr01|1090000
1985|ripkeca01|800000
1985|lacyle01|725000
1985|flanami01|641667
1985|boddimi01|625000
1985|stewasa01|581250
1985|martide01|560000
1985|roeniga01|558333
TungNT:ShellBasic tungnt$ cut -d "|" -f 1,4,5 data.txt | head -2
yearID|playerID|salary
1985|murraed02|1472819

uniq

Lệnh này loại bỏ trùng lặp liên tiếp. Vì vậy, kết hợp với sắp xếp, nó có thể được sử dụng để có được các giá trị riêng biệt trong dữ liệu.

TungNT:ShellBasic tungnt$ cat data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ cat data.txt | cut -d "|" -f 4 | sort
boddimi01
flanami01
lacyle01
lynnfr01
martide01
murraed02
playerID
ripkeca01
roeniga01
stewasa01
TungNT:ShellBasic tungnt$ cat data.txt | cut -d "|" -f 4 | sort | uniq
boddimi01
flanami01
lacyle01
lynnfr01
martide01
murraed02
playerID
ripkeca01
roeniga01
stewasa01
TungNT:ShellBasic tungnt$ cat data.txt | cut -d "|" -f 4 | sort | uniq | head
boddimi01
flanami01
lacyle01
lynnfr01
martide01
murraed02
playerID
ripkeca01
roeniga01
stewasa01

tr

Thay đổi dấu phân cách trong một tệp: có thể muốn thay thế một số ký tự trong tệp bằng một thứ khác bằng lệnh tr.

TungNT:ShellBasic tungnt$ cat data.txt | tr '|' ',' |  head -4
yearID,teamID,lgID,playerID,salary
1985,BAL,AL,murraed02,1472819
1985,BAL,AL,lynnfr01,1090000
1985,BAL,AL,ripkeca01,800000
TungNT:ShellBasic tungnt$ cat data.txt | sed -e 's/|/,/g' | head -4
yearID,teamID,lgID,playerID,salary
1985,BAL,AL,murraed02,1472819
1985,BAL,AL,lynnfr01,1090000
1985,BAL,AL,ripkeca01,800000
TungNT:ShellBasic tungnt$ cat data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1985|BAL|AL|flanami01|641667
1985|BAL|AL|boddimi01|625000
1985|BAL|AL|stewasa01|581250
1985|BAL|AL|martide01|560000
1985|BAL|AL|roeniga01|558333

awk

Tổng của một cột trong một tập tin: sử dụng lệnh awk, có thể tìm thấy tổng cột trong tệp. Chia nó cho số lượng dòng và có thể nhận được giá trị trung bình.

TungNT:ShellBasic tungnt$ cat data.txt | awk -F "|" '{ sum += $5 } END { printf sum }'
7054069
TungNT:ShellBasic tungnt$ head -n 1 cord_19_embeddings_2020-06-01.csv | awk 'BEGIN{FS=","} END{print NF}'

Cấu trúc lệnh awk:

BEGIN {action}
pattern {action}
pattern {action}
.
.
pattern { action}
END {action}

Trong đó:

BEGIN {action}: khởi tạo các biến,
pattern {action}: thực hiện xử lý dữ liệu đầu vào,
END {action}: thực hiện công việc gì đó xau khi xử lý dữ liệu đầu vào xong.

awk 'BEGIN{SOMETHING HERE} {SOMETHING HERE: could put Multiple Blocks Like this} END {SOMETHING HERE}' file.txt

Các biến được khởi tạo trước ở giai đoạn BEGIN:

FS: field separator. Mặc định là khoảng trắng (1 hoặc nhiều khoảng trắng hoặc tab).
RS: record separator. Dấu phân tách mặc định là dòng mới.
NR: NR là số của bản ghi hiện tại.
NF: Số lượng các trường sau khi một dòng duy nhất đã được phân tách bằng cách sử dụng FS.
Biến $: awk tách dòng đang đến với nó bằng cách sử dụng FS đã cho và giữ các phần tách trong biến $. Ví dụ: cột 1 là $ 1, cột 2 là $ 2. $ 0 là biểu diễn chuỗi của toàn bộ dòng. Lưu ý rằng nếu muốn truy cập cột cuối cùng, bạn không cần phải ghép.

TungNT:ShellBasic tungnt$ awk 'BEGIN{sum=0; FS="|"} {sum += $5} END{print sum}' data.txt 
7054069
TungNT:ShellBasic tungnt$ awk 'BEGIN{sum=0; FS="|"; ctn=0} {sum += $5; ctn += 1} END{print sum/ctn}' data.txt 
705407
TungNT:ShellBasic tungnt$ awk 'BEGIN{sum=0; FS="|"} {sum += $5} END{print sum/NR}' data.txt 
705407

Dùng để filter:

TungNT:ShellBasic tungnt$ awk 'BEGIN{FS="|"} $1=="1985" && $5 >= 700000 {print $0}' data.txt 
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000

Có thể áp dụng SQL ở dòng lệnh:

Toán tử logic:

== equality operator; returns TRUE is both sides are equal
!= inverse equality operator
&& logical AND
|| logical OR
! logical NOT
<, >, ⇐, >= relational operators

Phép toán: +, -, /, *, %, ^

Function: length, substr, split,…

Group by:

TungNT:ShellBasic tungnt$ awk 'BEGIN{FS="|"}
{myArray[$1] += 1}
END{
for(k in myArray){if(k!="yearID")print k"|"myArray[k]};
}' data.txt
1985|4
1986|5
TungNT:ShellBasic tungnt$ awk 'BEGIN{FS="|"}
{myArray[$1] += 1}
END{
for(k in myArray){if(k!="yearID")print k"|"myArray[k]};
}' data.txt | tr '|' ',' > new_data.txt 
TungNT:ShellBasic tungnt$ 
TungNT:ShellBasic tungnt$ cat new_data.txt 
1985,4
1986,5

TungNT:ShellBasic tungnt$ cat data.txt 
yearID|teamID|lgID|playerID|salary
1985|BAL|AL|murraed02|1472819
1985|BAL|AL|lynnfr01|1090000
1985|BAL|AL|ripkeca01|800000
1985|BAL|AL|lacyle01|725000
1986|BAL|AL|flanami01|641667
1986|BAL|AL|boddimi01|625000
1986|BAL|AL|stewasa01|581250
1986|BAL|AL|martide01|560000
1986|BAL|AL|roeniga01|558333
TungNT:ShellBasic tungnt$ awk 'BEGIN{FS="|"}
$5 < 600000 {myArr["0_600000"] += 1}
600000 <= $5 && $5 < 800000 {myArr["600000_800000"] += 1}
800000 <= $5 && $5 < 1000000 {myArr["800000_1000000"] += 1}
1000000 <= $5 {myArr["1000000_8"] += 1}
END{
for(k in myArr){print k "|" myArr[k]}
}' data.txt
800000_1000000|1
1000000_8|3
0_600000|3
600000_800000|3

TungNT:ShellBasic tungnt$ awk 'BEGIN{FS="|"}
{myArray[$1] += 1}
END{
for(k in myArray){if(k!="yearID")print k"|"myArray[k]};
}' data.txt | tr '|' ',' > new_data.txt 
TungNT:ShellBasic tungnt$ cat new_data.txt 
1985,4
1986,5
TungNT:ShellBasic tungnt$ FILENAME="new_data.txt"
TungNT:ShellBasic tungnt$ echo $FILENAME
new_data.txt
TungNT:ShellBasic tungnt$ (awk 'BEGIN{FS=","; c=0} $1 ~ /^[-0-9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \
> sort -n "$FILENAME")
2
1985,4
1986,5

    # Create a New file named A.txt to keep only the salary column.
    cat Salaries.csv | cut -d "," -f 5 > A.txt
    FILENAME="A.txt"

    # The first awk counts the number of lines which are numeric. We use a regex here to check if the column is numeric or not.
    # ';' stands for Synchronous execution i.e sort only runs after the awk is over.
    # The output of both commands are given to awk command which does the whole work.
    # So Now the first line going to the second awk is the number of lines in the file which are numeric.
    # and from the second to the end line the file is sorted.
    (awk 'BEGIN {c=0} $1 ~ /^[-0-9]*(\.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; \
            sort -n "$FILENAME") | awk '
      BEGIN {
        c = 0;
        sum = 0;
        med1_loc = 0;
        med2_loc = 0;
        med1_val = 0;
        med2_val = 0;
        min = 0;
        max = 0;
      }

      NR==1 {
        LINES = $1
        # We check whether numlines is even or odd so that we keep only
        # the locations in the array where the median might be.
        if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;}
        if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;}
      }

      $1 ~ /^[-0-9]*(\.[0-9]*)?$/  &&  NR!=1 {
        # setting min value
        if (c==0) {min = $1;}
        # middle two values in array
        if (c==med1_loc) {med1_val = $1;}
        if (c==med2_loc) {med2_val = $1;}
        c++
        sum += $1
        max = $1
      }
      END {
        ave = sum / c
        median = (med1_val + med2_val ) / 2
        print "sum:" sum
        print "count:" c
        print "mean:" ave
        print "median:" median
        print "min:" min
        print "max:" max
      }
    '

find

Tìm các tệp trong một thư mục thỏa mãn một điều kiện nhất định: có thể làm điều này bằng cách sử dụng lệnh find.

TungNT:ShellBasic tungnt$ find . -name "h*.txt"
TungNT:ShellBasic tungnt$ find . -name "d*.txt"
./data.txt

Sử dụng lệnh find kết hợp với regex:

TungNT:ShellBasic tungnt$ find . -name "[Dd]*.txt"
./data.txt

Tìm và xoá file:

TungNT:ShellBasic tungnt$ find . -name "[hH]*.txt" | rm
usage: rm [-f | -i] [-dPRrvW] file ...
       unlink file

TungNT:ShellBasic tungnt$ find . -name "[hH]*.txt" | xargs
TungNT:ShellBasic tungnt$ find . -name "[Dd]*.txt" | xargs
./data.txt
TungNT:ShellBasic tungnt$ find . -name "[hH]*.txt" | xargs rm
TungNT:ShellBasic tungnt$ find . -name "*.txt" | xargs grep '625000'
1985|BAL|AL|boddimi01|625000

Tạo file với nội dung mới

TungNT:ShellBasic tungnt$ cat data.txt | tr '|' ',' |  head -4 > new_data.txt
TungNT:ShellBasic tungnt$ cat new_data.txt 
yearID,teamID,lgID,playerID,salary
1985,BAL,AL,murraed02,1472819
1985,BAL,AL,lynnfr01,1090000
1985,BAL,AL,ripkeca01,800000

sed -n 's/findWords/replaceWords/gpw output.txt' sample.txt

Đây là một lệnh của command line shell được gọi là "sed" được sử dụng để xử lý một tệp văn bản có tên là sample.txt.

Lệnh này thực hiện các hoạt động sau:

"-n": Loại bỏ đầu ra sẽ tự động in số hàng tại stdout.
"s/findWords/replaceWords/g": Tìm kiếm chuỗi "findWords" và thay thế nó bằng chuỗi "replaceWords".
"p": In kết quả được thay thế trên stdout.
"w output.txt": Ghi kết quả được thay thế vào tệp có tên là output.txt.
Vì vậy, lệnh này sẽ tìm kiếm các từ "Apple" trong file "sample.txt" và thay thế chúng bằng từ "MacLife", sau đó ghi kết quả vào tệp "output.txt". Các dòng chứa kết quả đã được thay thế cũng được in ra trên màn hình terminal (vì sử dụng 'p' option), còn các dòng khác không được in ra gì cả (vì sử dụng '-n' option).

TungNT (Blue)

Table of Contents